Coding01

Coding 点滴

0%

根据官网走一遍 tensorflow-recommenders 学习代码

本文根据 Tensorflow 官网和公众号文章 学习 tensorflow-recommenders。

下载 tensorflow-recommenders

在环境里下载 tensorflow-recommenders

1
2
pip install -q tensorflow-recommenders
pip install -q --upgrade tensorflow-datasets

引入:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
import os
import pprint
import tempfile

from typing import Dict, Text

import numpy as np
import tensorflow as tf
import tensorflow_datasets as tfds
import tensorflow_recommenders as tfrs
import ssl

os.environ['HTTP_PROXY'] = 'http://0.0.0.0:8888'
os.environ['HTTPS_PROXY'] = 'http://0.0.0.0:8888'
ssl._create_default_https_context = ssl._create_unverified_context

载入数据集

由于MovieLens数据集没有预定义的拆分,因此所有数据都处于train拆分中。

1
2
3
4
# Ratings data.
ratings = tfds.load('movielens/100k-ratings', split="train", data_dir = os.path.join(os.getcwd(), "data"))
# Features of all the available movies.
movies = tfds.load('movielens/100k-movies', split="train", data_dir = os.path.join(os.getcwd(), "data"))

打印看看评级数据集返回的数据和类型,主要包括:电影ID,用户ID,分配的评级,时间戳,电影信息和用户信息的字典:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
for x in ratings.take(1).as_numpy_iterator():
pprint.pprint(x)

// 打印结果:

{'bucketized_user_age': 45.0,
'movie_genres': array([7]),
'movie_id': b'357',
'movie_title': b"One Flew Over the Cuckoo's Nest (1975)",
'raw_user_age': 46.0,
'timestamp': 879024327,
'user_gender': True,
'user_id': b'138',
'user_occupation_label': 4,
'user_occupation_text': b'doctor',
'user_rating': 4.0,
'user_zip_code': b'53211'}

同样得,电影数据集包含电影ID,电影标题以及有关其所属类型的数据。

1
2
3
4
5
6
7
for x in movies.take(1).as_numpy_iterator():
pprint.pprint(x)

// 打印结果:
{'movie_genres': array([4]),
'movie_id': b'1681',
'movie_title': b'You So Crazy (1994)'}

在本文中,我们将专注于收视率数据,所以从两个数据集中只拿到关联的 movie_id,’user_id’ 和 movie_title 即可:

1
2
3
4
5
6
7
8
ratings = ratings.map(lambda x: {
"movie_id": x["movie_id"],
"user_id": x["user_id"],
})
movies = movies.map(lambda x: {
"movie_id": x["movie_id"],
"movie_title": x["movie_title"],
})

因为 ratings 数据集不分训练集和测试集之分,使用随机分割,将80%的额定值放置在训练集中,将20%的额定放置在测试集中。

1
2
3
4
5
tf.random.set_seed(42)
shuffled = ratings.shuffle(100_000, seed=42, reshuffle_each_iteration=False)

train = shuffled.take(80_000)
test = shuffled.skip(80_000).take(20_000)

我们还要弄清楚数据中存在的唯一用户ID和电影ID。

这很重要,因为我们需要能够将分类特征的原始值映射到模型中的嵌入向量。为此,我们需要一个词汇表,将原始特征值映射到连续范围内的整数:这使我们能够在嵌入表中查找相应的嵌入。

1
2
3
4
5
6
7
8
9
10
11
12
13
movie_ids = movies.batch(1_000).map(lambda x: x["movie_id"])
user_ids = ratings.batch(1_000_000).map(lambda x: x["user_id"])

unique_movie_ids = np.unique(np.concatenate(list(movie_ids)))
unique_user_ids = np.unique(np.concatenate(list(user_ids)))

# We convert bytes to strings since bytes are not serializable
unique_movie_id_strings = [id.decode("utf-8") for id in unique_movie_ids]
unique_user_id_strings = [id.decode("utf-8") for id in unique_user_ids]

unique_movie_id_strings[:10]

// ['1', '10', '100', '1000', '1001', '1002', '1003', '1004', '1005', '1006']

搭建模型

接下来开始搭建 User 模型和 Movie 模型

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
embedding_dimension = 32

class UserModel(tf.keras.Model):

def __init__(self, embedding_dimension):
super(UserModel, self).__init__()
# The model itself is a single embedding layer.
# However, we could expand this to an arbitrarily complicated Keras model, as long
# as the output is an vector `embedding_dimension` wide.
user_features = [tf.feature_column.embedding_column(
tf.feature_column.categorical_column_with_vocabulary_list(
"user_id", unique_user_id_strings,
),
embedding_dimension,
)]
self.embedding_layer = tf.keras.layers.DenseFeatures(user_features, name="user_embedding")

def call(self, inputs):
return self.embedding_layer(inputs)

# We initialize these models and later pass them to the full model.
user_model = UserModel(embedding_dimension)

class MovieModel(tf.keras.Model):

def __init__(self, embedding_dimension):
super(MovieModel, self).__init__()
movie_features = [tf.feature_column.embedding_column(
tf.feature_column.categorical_column_with_vocabulary_list(
"movie_id", unique_movie_id_strings,
),
embedding_dimension,
)]
self.embedding_layer = tf.keras.layers.DenseFeatures(movie_features, name="movie_embedding")

def call(self, inputs):
return self.embedding_layer(inputs)

movie_model = MovieModel(embedding_dimension)

在我们的训练数据中,我们有正面(用户,电影)对。为了弄清楚我们的模型有多好,我们需要将模型为此对计算的亲和力得分与所有其他可能的候选者的得分进行比较:如果阳性对的得分高于所有其他候选者的得分,则我们的模型高度准确。

为此,我们可以使用tfrs.metrics.FactorizedTopK指标。指标有一个必填参数:候选数据集,用作评估的隐式否定词。

在我们的例子中,这是movies数据集,通过我们的电影模型转换为嵌入:

1
2
3
metrics = tfrs.metrics.FactorizedTopK(
candidates=movies.batch(128).map(lambda x: {"movie_id": x["movie_id"]}).map(movie_model)
)

下一个组成部分是用于训练模型的损失。 TFRS具有多个损失层和任务来简化此过程。

在这种情况下,我们将使用RetrievalTask对象:一个方便包装器,将损失函数和度量计算捆绑在一起:

1
2
3
task = tfrs.tasks.Retrieval(
metrics = metrics
)

现在,我们可以将所有这些放到一个模型中。 TFRS公开了一个基本模型类tfrs.models.Model ,该模型简化了构建模型:我们需要做的就是在init方法中设置组件,并实现compute_loss方法,并使用原始特征并返回损失值。

然后,基本模型将负责创建合适的训练循环以适合我们的模型。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
class MovielensModel(tfrs.models.Model):

def __init__(self, user_model, movie_model):
super().__init__()
self.movie_model: tf.keras.Model = movie_model
self.user_model: tf.keras.Model = user_model
self.task: tf.keras.layers.Layer = task

def compute_loss(self, features: Dict[Text, tf.Tensor], training=False) -> tf.Tensor:
# We pick out the user features and pass them into the user model.
user_embeddings = self.user_model({"user_id": features["user_id"]})
# And pick out the movie features and pass them into the movie model,
# getting embeddings back.
positive_movie_embeddings = self.movie_model({"movie_id": features["movie_id"]})

# The task computes the loss and the metrics.
return self.task(user_embeddings, positive_movie_embeddings)

拟合和评估 定义模型后,我们可以使用标准Keras拟合和评估例程来拟合和评估模型。

让我们首先实例化模型。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
model = MovielensModel(user_model, movie_model)
model.compile(optimizer=tf.keras.optimizers.Adagrad(learning_rate=0.1))

cached_train = train.shuffle(100_000).batch(8192).cache()
cached_test = test.batch(4096).cache()

// 然后训练模型
model.fit(cached_train, epochs=3)

// 结果:
Epoch 1/3
WARNING:tensorflow:The dtype of the source tensor must be floating (e.g. tf.float32) when calling GradientTape.gradient, got tf.int32
WARNING:tensorflow:The dtype of the source tensor must be floating (e.g. tf.float32) when calling GradientTape.gradient, got tf.int32
WARNING:tensorflow:Gradients do not exist for variables ['counter:0'] when minimizing the loss.
WARNING:tensorflow:Gradients do not exist for variables ['counter:0'] when minimizing the loss.
WARNING:tensorflow:The dtype of the source tensor must be floating (e.g. tf.float32) when calling GradientTape.gradient, got tf.int32
WARNING:tensorflow:The dtype of the source tensor must be floating (e.g. tf.float32) when calling GradientTape.gradient, got tf.int32
WARNING:tensorflow:Gradients do not exist for variables ['counter:0'] when minimizing the loss.
WARNING:tensorflow:Gradients do not exist for variables ['counter:0'] when minimizing the loss.
10/10 [==============================] - 18s 2s/step - factorized_top_k/top_1_categorical_accuracy: 3.5000e-04 - factorized_top_k/top_5_categorical_accuracy: 0.0031 - factorized_top_k/top_10_categorical_accuracy: 0.0068 - factorized_top_k/top_50_categorical_accuracy: 0.0390 - factorized_top_k/top_100_categorical_accuracy: 0.0772 - loss: 70253.9602 - regularization_loss: 0.0000e+00 - total_loss: 70253.9602
Epoch 2/3
10/10 [==============================] - 17s 2s/step - factorized_top_k/top_1_categorical_accuracy: 0.0025 - factorized_top_k/top_5_categorical_accuracy: 0.0211 - factorized_top_k/top_10_categorical_accuracy: 0.0418 - factorized_top_k/top_50_categorical_accuracy: 0.1770 - factorized_top_k/top_100_categorical_accuracy: 0.2975 - loss: 67883.1371 - regularization_loss: 0.0000e+00 - total_loss: 67883.1371
Epoch 3/3
10/10 [==============================] - 16s 2s/step - factorized_top_k/top_1_categorical_accuracy: 0.0036 - factorized_top_k/top_5_categorical_accuracy: 0.0261 - factorized_top_k/top_10_categorical_accuracy: 0.0511 - factorized_top_k/top_50_categorical_accuracy: 0.2022 - factorized_top_k/top_100_categorical_accuracy: 0.3294 - loss: 66398.5795 - regularization_loss: 0.0000e+00 - total_loss: 66398.5795
<tensorflow.python.keras.callbacks.History at 0x7f988e8e0ed0>

校验

1
2
3
4
model.evaluate(cached_test, return_dict=True)

// 结果
'factorized_top_k/top_10_categorical_accuracy': 0.014349999837577343

总结

本文主要根据官网提供的教程走了一遍推荐流程,下一步好好学习如何使用 tensorflow-recommenders

Welcome to my other publishing channels