# Week 3

Jetzt loslegen. Gratis!
oder registrieren mit Ihrer E-Mail-Adresse
Week 3 von

## 1. Hierarchical clustering

### 1.1. Components

1.1.1. Group things

1.1.2. Interpret the grouping

1.1.3. Define close

1.1.4. How to visualize

### 1.2. Approach

1.2.1. Define 2 closest points

1.2.2. Merge them into 1 super point

1.2.3. Find closest to 2 point

### 1.3. Distance

1.3.1. Continious

1.3.1.1. Euclidean distance

1.3.1.1.1. sqrt((A1 - A2)^2 + (B1 - B2)^2 + ... + (Z1 - Z2)^2)

1.3.1.2. Correlation similarity

1.3.2. Discrete

1.3.2.1. Manhattan distance

1.3.2.1.1. |A1 - A2| + |B1 - B2| + ... + |Z1 - Z2|

### 1.4. code in R

1.4.1. data.frame(x = x, y = y)

1.4.2. distxy <- dist(dataFrame)

1.4.2.1. Calculates distances between poins

1.4.2.2. By default uses Euclidean distance

1.4.3. clusters <- hclust(distxy)

1.4.4. plot(clusters)

### 1.5. Analysis

1.5.1. Doesn't say how many clusters there are

1.5.2. Need to cut the tree by different height

1.5.2.1. Not always obvious

1.5.3. It is determenistic

1.5.4. Yet maybe unstable

1.5.5. Primary used for exploration

### 1.6. Prettier diagram

1.6.1. Add: myplclust(clusters, lab = , lab.col = )

1.6.2. heatmap(dataMatrix)

1.6.2.1. Runs cluster analysis on rows and columns

1.6.2.2. Closest are together

### 1.7. How do you merge points?

1.7.1. 1. Average location

1.7.2. Complete - the farest points

## 2. K-means clustering

### 2.1. Principles

2.1.1. How to define close?

2.1.1.1. Same as in hierarchical clusterization

2.1.2. ... others are same

### 2.3. Approach

2.3.1. Fixed number of clusters

2.3.2. Get centroids

2.3.3. Assign things to closest centroid

2.3.4. Recalculate centroids

2.3.5. Not determenistic

### 2.4. Produce

2.4.1. Final centroids

2.4.2. Which point is related to which centroid

### 2.5. Code

2.5.1. data.frame(x,y)

2.5.2. kmeansObj <- kmeans(dataFrame, centers = 3)

2.5.3. names(kMeansObj)

2.5.4. plot(x, y, col = kmeansObj\$cluster, pch = 19, cex = 2)

2.5.5. plot(kmeansObj\$centers, col=1:3, pch = 3, cex = 3, lwd = 3)

2.5.6. image(t(dataMatrix)[, order(kMeansObj\$cluster)], yaxt = "n"

## 3. Dimension reduction

### 3.1. Find new set of variables that are uncorrelated and explain as much variance as possible

3.1.1. statistical goal

3.1.2. data compression

### 3.2. lower rank matrix

3.2.1. Capture all variance of the data to as less simple components as possible

### 3.3. Solutions

3.3.1. SVD

3.3.1.1. X = UDV^T

3.3.1.2. Code

3.3.1.2.1. svd1 <- svd(scale(dataMatrixOrdered))

3.3.1.2.2. plot(svd1\$v[, 1], pch = 19)

3.3.2. PCA

3.3.3. Produce the same result: essentially same thing

3.3.4. Understand variance

3.3.4.1. plot(svd1\$d^2/sum(svd1\$d^2))

3.3.4.2. Building approximation: approx <- svd1\$u[, 1:k] %*% diag(svd1\$d[1:k]) %*% t(svd1\$v[, 1:k]), где k - количество компонент, который ты хочешь использовать

3.3.5. Can be computationally expensive

### 3.4. Missing values

3.4.1. Cannot run if there are missing values

3.4.2. Solution - fill the data

3.4.2.1. impute

3.4.2.1.1. dataMatrix <- impute.knn(dataFMatrix)\$data

3.4.2.1.2. Adds an average of k rows instead of empty value

## 4. Plotting and Color in R

### 4.2. grDevices package

4.2.1. colorRamp()

4.2.1.1. pal <- colorRamp(c("red", "blue"))

4.2.1.2. pal(0) - returns RGB 255 0 0

4.2.1.3. pal(1) - returns 0 0 255

4.2.1.4. pal(0.5) - returns 127.5 0 127.5

4.2.1.5. pal(seq(0, 1, len = 10))

4.2.2. colorRampPalette()

4.2.2.1. pal <- colorRampPalette(c("red", "yellow"))

4.2.2.2. pal(10) - returns 10 #HEX colors

4.2.3. Helps to interpolate between colors

### 4.3. RColorBrewer package

4.3.1. types of palette

4.3.1.1. Sequenial

4.3.1.1.1. numerical, ordered data

4.3.1.2. Diverging

4.3.1.2.1. for showing deviation, can be positive or negative

4.3.1.3. Qualitative

4.3.1.3.1. for factors

4.3.2. colorRamp()

4.3.3. colorRampPalette()

4.3.4. Code

4.3.4.1. library(RColorBrewer)

4.3.4.2. cols <- brewer.pal(3, 'BuGn'), 3 - number of colors

4.3.4.3. pal <- colorRampPalatte(cols)

4.3.4.4. image(volcano, col = pal(20))

### 4.4. rgb() function

4.4.1. alpha parameter

4.4.2. rgb(0, 0, 0, 0.2)

### 4.5. colorspace package

4.5.1. different control over colors