吴恩达机器学习第七次编程作业-K-means Clustering and Principal Component Analysis

2019-09-16

机器学习

Programming Exercise 7: K-means Clustering and Principal Component Analysis

环境： Python3.5 , Pycharm2019.01

参考资料：

K-means Clustering完整代码
 Principal Component Analysis完整代码

Part 1: K-Means Clustering

实现K-Means算法

Finding closest centroids

根据所给出的簇中心将所有的样本点进行标记，返回一维数组代表每个样本所属的簇，簇编号从1开始。

def Find_Closets_Centroid(X,centroid):
    idx=np.zeros(X.shape[0],dtype=int)
    for i in range(X.shape[0]):
        tmp_x=(X[i]-centroid)**2
        idx[i]=np.argmin(np.sum(tmp_x,axis=1)) # 先按行相加得到[1,3]的距离数组，再找出最小值下标
    return idx  # 下标从0开始
    
K=3 # 簇中心数量
init_centroid=[[3,3],[6,2],[8,5]] 
idx=Find_Closets_Centroid(X,init_centroid)
# print(idx[:3]) # except to see [0 2 1]

Computing centroid means

既然每个样本都有一个所属簇，那么分别计算这些不同簇的样本点集的均值作为新的簇中心。

def Computer_Centroids(X,idx,K):
    centroids=np.zeros((K,X.shape[1]))
    for i in range(K):
        centroids[i]=np.mean(X[np.where(idx==i)],axis=0) # 先将样本点集区分出来成为新的二维数组，再按列取均值
    return centroids

centroids=Computer_Centroids(X,idx,K)
print(centroids) # except to see [[2.42830111 3.15792418],[5.81350331 2.63365645],[7.11938687 3.6166844 ]]

K-means on example dataset

手写K-Means算法，观察簇中心在数据集上的变化情况。

def Record_Centroid(i,centroids,history_centroids): # 记录簇中心变化
    history_centroids_tmp = centroids.copy()
    history_centroids[0][i] = history_centroids_tmp[0]
    history_centroids[1][i] = history_centroids_tmp[1]
    history_centroids[2][i] = history_centroids_tmp[2]
    return history_centroids

def kMeans(X,K,init_centroid,max_iter): # K-Means算法运行过程
    centroids=init_centroid
    history_centroids=np.zeros((K,max_iter,X.shape[1])) # 申请三维数组记录K个簇变化
    idx=np.zeros(X.shape[0],dtype=int)
    for i in range(max_iter):
        idx=Find_Closets_Centroid(X,centroids)
        history_centroids=Record_Centroid(i,centroids,history_centroids)
        centroids=Computer_Centroids(X,idx,K) # new centroids
    return centroids,idx,history_centroids

def Plot_kMeans_Progress(X,history_centroids,idx,K): # 观察K-Means簇中心变化
    cluster1=X[np.where(idx==0)] # 最终所属的簇
    cluster2 = X[np.where(idx == 1)]
    cluster3 = X[np.where(idx == 2)]
    plt.plot(cluster1[:,0],  cluster1[:,1],'r.', label='Cluster 1')
    plt.plot(cluster2[:, 0], cluster2[:, 1],'g.', label='Cluster 2')
    plt.plot(cluster3[:, 0], cluster3[:, 1],'b.', label='Cluster 3')
    plt.plot(history_centroids[0][:,0],  history_centroids[0][:,1],'kX-',alpha=0.5) # alpha设置透明度
    plt.plot(history_centroids[1][:, 0], history_centroids[1][:, 1],'kX-',alpha=0.5)
    plt.plot(history_centroids[2][:, 0], history_centroids[2][:, 1],'kX-',alpha=0.5)
    plt.xlim(-1,9)
    plt.legend()
    plt.title('Iteration number 10',fontsize=12)
    plt.show()

Random initialization

给定样本集和簇中心数量随机选取中心簇作为初始化。

def Random_Init_Centroids(X,K):
    rand_idx=np.arange(X.shape[0]) # 生成排列
    np.random.shuffle(rand_idx)  # 随机打乱，每次都不一样
    return X[rand_idx[:K]] # 返回前K个样本点作为簇中心

Image compression with K-means

使用K-Means算法将所给的图片进行压缩。
原理：

关于图片：这里所使用的是png格式的彩图，即每个像素点24位，每8位分别代表RGB。像素点越多越清晰，这里图片大小为128*128，代表长宽也代表像素点个数。
使用plt.imread(picture_path)读入图片，将二进制信息以三维矩阵形式读出(128,128,3)。
重塑图片，使其次成为二维[128*128,3]，然后每行代表一个像素点，接着使用K-Means算法找到16个簇中心，然后用这些簇中心的值代替其所属簇的像素点的值。这样就将24位压缩成了16位，然后再重塑回三维[128,128,3]，显示效果对比。
使用不同的K，观察效果
A=A/255是特征缩放，以便计算距离，最终显示还要乘上255，还是24位

为什么能实现压缩？

之前说的像素点每个点都有一个值，范围是[0-255]，实际上压缩上述代码实现压缩之后图片大小貌似并没有变化，那么做这个意义何在？
其实，原图每个像素点可以看成一种颜色，我们做的事就是找出16种主要的颜色代替其余颜色。值的范围没有变，但我们就可以改变之前像素点所存储的信息了，我们先在只要4bit就可以知道这个像素点所属的簇，那么再找到这个簇的颜色就是这个像素点的颜色了。
故，我们需要额外的一点点空间存储簇的颜色，然后(以下面图片为例)图片大小就可以用128*128*4替换，而不是218*128*24。有种类似时间换空间的感觉。
下面代码也就是实现了完整的图像压缩核心部分以显示效果，并没有实际压缩图片。

def Clustering_On_Pixels(path,K,max_iter):
    A=plt.imread(path)
    # print(type(A),A.shape) # except to see: <class 'numpy.ndarray'> (128, 128, 3)
    A=A/255 # Divide by 255 so that all values are in the range 0 - 1
    X=A.reshape(A.shape[0]*A.shape[1],3)  # [128*128,3]
    init_centroids=Random_Init_Centroids(X,K) # 初始化簇中心
    centroids,idx,history=kMeans(X,K,init_centroids,max_iter) # 迭代找到16个簇中心
    return A,X,centroids

def Image_Compression(path):
    K, max_iter = 16, 10  # try different values of K and max_iters here
    A,X,centroids=Clustering_On_Pixels(path,K,max_iter)
    idx=Find_Closets_Centroid(X,centroids)
    X_recovered=centroids[idx] # 值代替
    X_recovered=X_recovered.reshape(A.shape[0],A.shape[1],3) # 重塑回三维
    plt.subplot(1,2,1)
    plt.imshow(A*255)
    plt.title('Original')
    plt.subplot(1, 2, 2)
    plt.imshow(X_recovered*255)
    plt.title('Compressed, with %d colors' %K,fontsize=10)
    plt.show()

Image_Compression('./ML/ex7/bird_small.png')

Part2: Principal Component Analysis

2.1 Example Dataset

def  Get_data(path):
    data=sio.loadmat(path)
    # for key in data:
    #     print(key)
    return data,data['X']

''' Visualizing example dataset for PCA '''
data,X=Get_data('./ML/ex7/ex7data1.mat')
# print(X.shape) # (50,2)
plt.scatter(X[:,0], X[:,1], facecolors='none', edgecolors='b')
plt.show()

2.2 Implementing PCA

def Feature_Normalize(X):
    means = np.mean(X, axis=0)  # the mean of the each feature
    std = np.std(X, axis=0, ddof=1)  # 取每列标准偏差
    return (X-means)/std,means

def PCA(X):
    sigma=(X.T@X)/X.shape[0] # covariance matrix
    U,S,V=np.linalg.svd(sigma) # wait to learn
    return U,S,V

X_normal,means=Feature_Normalize(X) # normalize X
U,S,V=PCA(X_normal)
# print(means.shape,S.shape,U.shape)
# print(U[:,0]) # except to  see: [-0.70710678 -0.70710678]
plt.plot([means[0], means[0] + 1.5*S[0]*U[0,0]], [means[1], means[1] + 1.5*S[0]*U[0,1]],c='r', linewidth=3, label='First Principal Component')
plt.plot([means[0], means[0] + 1.5*S[1]*U[1,0]], [means[1], means[1] + 1.5*S[1]*U[1,1]],c='g', linewidth=3, label='Second Principal Component')
plt.show()

Dimensionality Reduction with PCA

def Project_Data(X,U,K): # 数据投影
    return X@U[:,:K]
def Recover_Data(Z,U,K): # 数据重构
    return Z@U[:,:K].T
def Visualizing_Projections(X_normal,X_rec):
    plt.scatter(X_normal[:, 0], X_normal[:, 1], facecolors='none', edgecolors='b', label='X_normal')
    plt.scatter(X_rec[:, 0], X_rec[:, 1], facecolors='none', edgecolors='r', label='X_rec')
    plt.xlim(-4, 3)
    plt.ylim(-4, 3)
    for i in range(X_normal.shape[0]):
        plt.plot([X_normal[i, 0], X_rec[i, 0]], [X_normal[i, 1], X_rec[i, 1]], 'k--')
    plt.legend(loc=2)
    plt.grid(True)
    plt.show()

''' Dimensionality Reduction with PCA '''
Z=Project_Data(X_normal,U,1) # Projecting the data onto the principal components
# print(Z.shape,Z[0]) # except to see: (50, 1) [1.48127391]
X_rec=Recover_Data(Z,U,1)
# print(X_rec.shape,X_rec[0]) # (50, 2) [-1.04741883 -1.04741883]
Visualizing_Projections(X_normal,X_rec)

Face Image Dataset

Faces dataset

5000*1024的灰度图像，在显示的时候重塑乘32*32的就行。

def Show_Face(X,num):
    plt.subplots_adjust(left=0.1, right=0.9, bottom=0.1, top=0.9,hspace=0.02, wspace=0.02)
    for i in range(num):
        plt.subplot(10,10,i+1,xticks=[], yticks=[]) # 不显示刻度
        plt.imshow(X[i].reshape(32,32).T,cmap = 'Greys_r',interpolation='nearest') # 指定为灰度，默认RGB
    plt.show()

PCA on Faces

data,X=Get_data('./ML/ex7/ex7faces.mat')
# Show_Face(X,100)
X_noraml,means=Feature_Normalize(X)
U,S,V=PCA(X_noraml)
# print(U.shape) # (1024,1024)
Show_Face(U.T,36) # change subplot=6*6

Reconstruct Data

Z=Project_Data(X_noraml,U,100) # Dimensionality Reduction
# print(Z.shape) # (5000, 100)
X_rec=Recover_Data(Z,U,100)
# print(X_rec.shape) # (5000, 1024)
# Show_Face(X_noraml,100)
Show_Face(X_rec,100)