吴恩达机器学习第八次编程作业-推荐系统

Programming Exercise 8: Recommender Systems

目标:实现协同过滤算法并应用于电影评分系统


工具:Pycharn,Python3.6


参考资料:

  1. ex8_cofi.m
  2. numpy中的ravel()、flatten()、squeeze()的用法与区别
  3. python中numpy.r_和numpy.c_
  4. 正则化
  5. 矩阵分解在协同过滤推荐算法中的应用
  6. 推荐系统
  7. scipy.optimize优化器的各种使用

完整代码

  1. Recommender Systems

Movie ratings dataset

主要是介绍数据格式(特征维度为100)。略

Collaborative filtering learning algorithm

特征矩阵X :( nm , 100)
参数矩阵θ : ( nu ,100 )
评分矩阵Y :XθT , y(i,j)=(θ(j))Tx(i)(具体运算以实际数据为准)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
data1=Get_Data('./ex8/ex8_movies.mat')
Y,R=data1['Y'],data1['R']

data2=Get_Data('./ex8/ex8_movieParams.mat')
X,theta=data2['X'],data2['Theta']

num_u,num_m,num_f=map(int,[data2['num_users'],data2['num_movies'],data2['num_features']])
print(X.shape,theta.shape) # (1682, 10) (943, 10)
print(num_u,num_m,num_f) # 943 1682 10

# Reduce the data set size so that this runs faster
num_u,num_m,num_f=4,5,3
X,theta=X[:num_m,:num_f],theta[:num_u,:num_f]
Y,R=Y[:num_m,:num_u],R[:num_m,:num_u]
# print(X.shape,theta.shape) # (5, 3) (4, 3)
print(cofiCostFunc(Merge(X,theta),Y,R,num_u,num_m,num_f,0)) # 22.224603725685675

为了方便简洁,下述图片与实现均合并了正则化

Collaborative filtering cost function


为了使用高级优化算法,首先把参数合并成一个向量,在函数中再进行分解

1
2
def Merge(X,theta):
return np.r_[X.flatten(),theta.flatten()] # 按行合并

协同过滤的参数不需要加偏置项了,故可直接正则化
关于正则化:回去又查了一下为什么不惩罚θ0,倒是找到一篇资料说是约定不惩罚第0项,吴恩达机器学习视频课时57也只是粗略的提了一下‘区别对待’

1
2
3
4
5
6
7
def cofiCostFunc(params,Y,R,num_u,num_m,num_f,lamda):
X,theta=params[:num_m*num_f].reshape(num_m,num_f),params[num_m*num_f:].reshape(num_u,num_f) # 分解
# print(X.shape,theta.shape,Y.shape,R.shape)
error=0.5*np.square((X.dot(theta.T)-Y)*R).sum()
reg1=0.5*lamda*np.square(theta).sum()
reg2=0.5*lamda*np.square(X).sum()
return error + reg1 + reg2

注意计算error要乘以R,区别评分与未评分

Collaborative filtering gradient


为了使用高级优化算法,这里返回的也是两个梯度矩阵的合并形式

1
2
3
4
5
def cofiFradient(params,Y,R,num_u,num_m,num_f,lamda):
X, theta = params[:num_m * num_f].reshape(num_m, num_f), params[num_m * num_f:].reshape(num_u, num_f)
X_grad=((X.dot(theta.T)-Y)*R)@theta+lamda*X
theta_grad=((X.dot(theta.T)-Y)*R).T@X+lamda*theta
return Merge(X_grad,theta_grad)

Learning movie recommendations

  1. 导入电影列表,然后手动添加一些评分

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    movies=[]
    with open('./ex8/movie_ids.txt','r',encoding='utf-16') as f:
    for line in f:
    movies.append(' '.join(line.split()[1:]))

    my_ratings = np.zeros(len(movies)) # 1682

    my_ratings[0] = 4
    my_ratings[97] = 2
    my_ratings[6] = 3
    my_ratings[11] = 5
    my_ratings[53] = 4
    my_ratings[63] = 5
    my_ratings[65] = 3
    my_ratings[68] = 5
    my_ratings[182] = 4
    my_ratings[225] = 5
    my_ratings[354] = 5
  2. 将自己的评分向量添加到评分矩阵中,注意评分标记矩阵也要作相应修改,然后随机初始化特征矩阵X与参数矩阵θ,用高级优化算法运行

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    Y = np.c_[Y, my_ratings]  # (1682, 944)
    R = np.c_[R, my_ratings!=0] # (1682, 944)
    num_m,num_u=Y.shape
    num_f=10

    Y_mean,Y_norm=Normalize_rating(Y,R)

    lambd = 10

    # Set Initial Parameters (Theta, X)
    X=np.random.random((num_m,num_f))
    theta=np.random.random((num_u,num_f))
    res=opt.minimize(fun=cofiCostFunc,
    x0=Merge(X,theta),
    args=(Y,R,num_u,num_m,num_f,lambd),
    method='TNC',
    jac=cofiGradient,
    options={'maxiter':100})
  3. 根据返回的参数,预测评分,进行推荐

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    ret=res.x
    fit_x,fit_theta=ret[:num_m*num_f].reshape(num_m,num_f),ret[num_m*num_f:].reshape(num_u,num_f)

    p=fit_x@fit_theta.T # 预测的评分矩阵

    my_predict=p[:,-1]+Y_mean.flatten() # 在预测基础上加上行均值(之前做过均值归一化)

    idx=np.argsort(my_predict)[::-1] # Returns the indices that would sort an array

    print("Top recommendations for you:")
    for i in range(10):
    print('Predicting rating %.1f for movie %s.' \
    %(my_predict[idx[i]],movies[idx[i]]))

    print("\nOriginal ratings provided:")
    for i in range(len(my_ratings)):
    if my_ratings[i] > 0:
    print('Rated %d for movie %s.'% (my_ratings[i],movies[i]))
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
Top recommendations for you:
Predicting rating 8.3 for movie Shawshank Redemption, The (1994).
Predicting rating 8.3 for movie Titanic (1997).
Predicting rating 8.3 for movie Star Wars (1977).
Predicting rating 8.2 for movie Schindler's List (1993).
Predicting rating 8.2 for movie Close Shave, A (1995).
Predicting rating 8.2 for movie Wrong Trousers, The (1993).
Predicting rating 8.2 for movie Raiders of the Lost Ark (1981).
Predicting rating 8.1 for movie Casablanca (1942).
Predicting rating 8.1 for movie Usual Suspects, The (1995).
Predicting rating 8.1 for movie Good Will Hunting (1997).

Original ratings provided:
Rated 4 for movie Toy Story (1995).
Rated 3 for movie Twelve Monkeys (1995).
Rated 5 for movie Usual Suspects, The (1995).
Rated 4 for movie Outbreak (1995).
Rated 5 for movie Shawshank Redemption, The (1994).
Rated 3 for movie While You Were Sleeping (1995).
Rated 5 for movie Forrest Gump (1994).
Rated 2 for movie Silence of the Lambs, The (1991).
Rated 4 for movie Alien (1979).
Rated 5 for movie Die Hard 2 (1990).
Rated 5 for movie Sphere (1998).

总结

大体就是以下几个步骤:

  1. 正确实现代价函数
  2. 正确实现梯度计算
  3. 进行细节化处理:如均值归一化
  4. 调用优化器求出最优参数
  5. 进行预测,注意如果进行过均值归一化还需要加上均值
  6. 根据预测评分进行推荐