基于矩阵分解算法的智能Steam游戏AI推荐系统——深度学习算法应用(含python、ipynb工程源码)+数据集（一）

前言

本项目采用了矩阵分解算法，用于对玩家已游玩的数据进行深入分析。它的目标是从众多游戏中筛选出最适合该玩家的游戏，以实现一种相对精准的游戏推荐系统。

首先，项目会收集并分析玩家已经玩过的游戏数据，包括游戏名称、游戏时长、游戏评分等信息。这些数据构成了一个大型的用户-游戏交互矩阵，其中每一行代表一个玩家，每一列代表一个游戏，矩阵中的值表示玩家与游戏之间的交互情况。

接下来，项目运用矩阵分解算法，将用户-游戏这稀疏矩阵用两个小矩阵——特征-游戏矩阵和用户-特征矩阵，进行近似替代。这个分解过程会将玩家和游戏映射到一个潜在的特征空间，从而能够推断出玩家与游戏之间的潜在关系。

一旦模型训练完成，系统可以根据玩家的游戏历史，预测他们可能喜欢的游戏。这种预测是基于玩家与其他玩家的相似性以及游戏与其他游戏的相似性来实现的。因此，系统可以为每个玩家提供个性化的游戏推荐，考虑到他们的游戏偏好和历史行为。

总的来说，本项目的目标是通过矩阵分解和潜在因子模型，提供一种更为精准的游戏推荐系统。这种个性化推荐可以提高玩家的游戏体验，同时也有助于游戏平台提供更好的游戏推广和增加用户黏性。

总体设计

本部分包括系统整体结构图和系统流程图。

系统整体结构图

系统整体结构如图所示。

在这里插入图片描述

系统流程图

系统流程如图所示。

在这里插入图片描述

运行环境

本部分包括 Python 环境、TensorFlow环境、 PyQt5环境。

Python环境

需要Python 3.7及以上配置，在Windows环境下推荐下载Anaconda完成Python所需环境的配置，下载地址为https://www.anaconda.com/，也可下载虚拟机在Linux环境下运行代码。

安装NumPy:

conda install numpy

安裝TensorFlow:

pip install tensorflow

安装Pandas:

conda install pandas

安装成功。

TensorFlow环境

以管理员身份运行anaconda Prompt，在终端中输入:

conda create -n your_env_name python==3.7

输入下面命令，进入环境：

conda activate your_env_name

PyQt5环境

打开anaconda Prompt，输入命令

conda install pyqt

在选项中输入y进行安装。

需要打包为可执行文件时安装pyinstaller，安装方法是在终端输入:

pip install pyinstaller

模块实现

本项目包括4个模块:数据预处理、模型构建、模型训练及保存、模型测试，下面分别给出各模块的功能介绍及相关代码。

1. 数据预处理

数据集来源于Kaggle，链接地址为https://www.kaggle.com/tamber/steam-video-games，此数据集包含了用户的ID、游戏名称、是否购买或游玩、游戏时长，其中：共包含12393名用户，涉及游戏数量5155款。将数据集置于Jupyter工作路径下的steam-video-games文件夹中。
相关代码如下:

import numpy as np
import pandas as pd
import tensorflow.compat.v1 as tf
tf.disable_v2_behavior()
import random
from collections import Counter
from sklearn.metrics import roc_curve, auc, average_precision_score
import joblib
#导入数据集并列表显示
path = './steam-video-games/steam-200k.csv'
df = pd.read_csv(path, header = None, names = ['UserID', 'Game', 'Action', 'Hours', 'Not Needed'])
df.head()

导入数据集如图所示。
在这里插入图片描述

由于数据杂乱，需要进行预处理以得到用户游玩的时长，相关代码如下:

#从购买记录和游玩记录中筛选出游戏时长
df['Hours_Played'] = df['Hours'].astype('float32')
df.loc[(df['Action']=='purchase')&(df['Hours']==1.0), 'Hours_Played'] = 0
#排序
df.UserID = df.UserID.astype('int')
df = df.sort_values(['UserID', 'Game', 'Hours_Played'])
#整理为新的表格clean_df
clean_df = df.drop_duplicates(['UserID', 'Game'], keep = 'last').drop(['Action', 'Hours', 'Not Needed'], axis = 1)
clean_df.head()
#输出数据集中的用户数量和游戏数量
n_users = len(clean_df.UserID.unique())
n_games = len(clean_df.Game.unique())
print('用户-游戏数据集中一共有{0}个用户，{1}个游戏'.format(n_users, n_games))

由于是稀疏矩阵，因而使用矩阵分解算法可以得到较好的效果，相关代码如下：

#计算矩阵的稀疏程度
sparsity = clean_df.shape[0] / float(n_users * n_games)
print('用户-游戏矩阵中有效数据占比为：{:.2%}'.format(sparsity))
#序列化ID相关代码
#建立序列化的ID，方便使用
#用户ID到用户序列化ID的字典
user2idx = {user: i for i, user in enumerate(clean_df.UserID.unique())}
#用户序列化ID到用户ID的字典
idx2user = {i: user for user, i in user2idx.items()}
#游戏名到游戏序列化ID的字典
game2idx = {game: i for i, game in enumerate(clean_df.Game.unique())}
#游戏序列化ID到游戏名的字典
idx2game = {i: game for game, i in game2idx.items()}
#将字典保存，用于PyQt5中
joblib.dump(idx2game, './Save_data/idx2game.pkl')
joblib.dump(game2idx, './Save_data/game2idx.pkl')

将用户ID、游戏名称、游戏时长分别存储为数组，其中用户ID、游戏名称使用前一步得到的序列化ID存储，以便使用，相关代码如下：

#用户序列化ID-游戏序列化ID-游戏时长
user_idx = clean_df['UserID'].apply(lambda x: user2idx[x]).values
game_idx = clean_df['gamesIdx'] = clean_df['Game'].apply(lambda x:game2idx[x]).values
hours = clean_df['Hours_Played'].values
#保存游戏时长矩阵
hours_save = np.zeros(shape = (n_users, n_games))
for i in range(len(user_idx)):
    hours_save[user_idx[i], game_idx[i]] = hours[i]
joblib.dump(hours_save, './Save_data/hours.pkl')

根据用户的购买情况建立矩阵，未购买的游戏标识为0，购买的游戏标识为1。根据游戏时长建立置信度矩阵，游戏时长越长，说明玩家越喜欢该游戏。因此，置信度随着游戏时长的提高而提高，最小值为1，若为0，则与未购买的游戏相同，但用户购买说明对该游戏感兴趣。

相关代码如下：

#建立稀疏矩阵存储大数据集
#购买矩阵
#未购买标识为0
#购买标识为1
#置信度矩阵
#根据游戏时长提高置信度，最低为1
zero_matrix = np.zeros(shape = (n_users, n_games))
#购买矩阵
user_game_pref = zero_matrix.copy()
user_game_pref[user_idx, game_idx] = 1
#保存购买矩阵
joblib.dump(user_game_pref, './Save_data/buy.pkl')
#置信度矩阵
user_game_interactions = zero_matrix.copy()
user_game_interactions[user_idx, game_idx] = hours + 1
#为保证准确率，需要用户购买的数量达到一定值，设置阈值为10款游戏
k = 5
#对于每个用户计算他们购买的游戏数量
purchase_counts = np.apply_along_axis(np.bincount, 1, user_game_pref.astype(int))
buyers_idx = np.where(purchase_counts[:, 1] >= 2 * k)[0] 
#购买超过2*k个游戏的买家集合
print('{0}名玩家购买了至少{1}款游戏'.format(len(buyers_idx), 2 * k))
#保存有效购买用户名单
joblib.dump(buyers_idx, './Save_data/buyers.pkl')

在2189名用户中，划分出训练集、测试集、验证集，比例分别为80%、10%、10%，相关代码如下：

test_frac = 0.2 #10%数据用来验证，10%数据用来测试
test_users_idx = np.random.choice(buyers_idx, 
                            size = int(np.ceil(len(buyers_idx) * test_frac)),
                            replace = False)
val_users_idx = test_users_idx[:int(len(test_users_idx) / 2)]
test_users_idx = test_users_idx[int(len(test_users_idx) / 2):]

准确率的计算方式：通过掩盖5个用户购买的游戏，使用模型得到推荐的5个游戏与掩盖的游戏相比计算正确率，相关代码如下：

#在训练集中掩盖k个游戏
def data_process(dat, train, test, user_idx, k):
    for user in user_idx:
        purchases = np.where(dat[user, :] == 1)[0]
        mask = np.random.choice(purchases, size = k, replace = False)
        train[user, mask] = 0
        test[user, mask] = dat[user, mask]
    return train, test
train_matrix = user_game_pref.copy()
test_matrix = zero_matrix.copy()
val_matrix = zero_matrix.copy()
train_matrix, val_matrix = data_process(user_game_pref, train_matrix, 
val_matrix, val_users_idx, k)
train_matrix, test_matrix = data_process(user_game_pref, train_matrix, 
test_matrix, test_users_idx, k)
#测试是否将部分游戏掩盖
test_matrix[test_users_idx[0],test_matrix[test_users_idx[0],:].nonzero()[0]]
train_matrix[test_users_idx[0],test_matrix[test_users_idx[0],:].nonzero()[0]]

工程源代码下载

详见本人博客资源下载页

其它资料下载

如果大家想继续了解人工智能相关学习路线和知识体系，欢迎大家翻阅我的另外一篇博客《重磅 | 完备的人工智能AI 学习——基础知识学习路线，所有资料免关注免套路直接网盘下载》
这篇博客参考了Github知名开源平台，AI技术平台以及相关领域专家：Datawhale，ApacheCN，AI有道和黄海广博士等约有近100G相关资料，希望能帮助到所有小伙伴们。