Coding01

Coding 点滴

0%

前一段时间简单了解 tensorflow_text 简单中文分词使用,再结合 Rasa 的学习,就萌生出模仿 Rasa 的结巴分词 tokenizer,造一个 Tensorflow_text_tokenizer。

创建一个 Rasa tokenizer 主要包括以下几个步骤:

  1. Setup
  2. Tokenizer
  3. Registry File
  4. Train and Test
  5. Conclusion

了解结巴分词代码

为了开始自建插件,我们先拿一个JiebaTokenizer源代码做测试,并在分词处打印出分词效果:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
...
def tokenize(self, message: Message, attribute: Text) -> List[Token]:
import jieba

text = message.get(attribute)

tokenized = jieba.tokenize(text)
print('******')
print(f"{[t for t in tokenized]}")
print('******')

tokens = [Token(word, start) for (word, start, end) in tokenized]
return self._apply_token_pattern(tokens)
...

在 config 中,加入自定义插件:

1
2
3
4
5
6
7
8
9
language: zh

pipeline:
- name: components.fanlyJiebaTokenizer.JiebaTokenizer
- name: CRFEntityExtractor
- name: CountVectorsFeaturizer
OOV_token: oov
token_pattern: '(?u)\b\w+\b'
- name: KeywordIntentClassifier

训练和测试:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
NLU model loaded. Type a message and press enter to parse it.
Next message:
我想找地方吃饭
******
Building prefix dict from the default dictionary ...
Loading model from cache /var/folders/cz/kq5sssg12jx887hj62hwczrr0000gn/T/jieba.cache
Loading model cost 0.729 seconds.
Prefix dict has been built successfully.
[('我', 0, 1), ('想', 1, 2), ('找', 2, 3), ('地方', 3, 5), ('吃饭', 5, 7)]
******
{
"text": "我想找地方吃饭",
"intent": {
"name": "eat_search",
"confidence": 1.0
},
"entities": []
}
Next message:

构建 TF-Text 分词

注:由于 Rasa 目前只支持 TensorFlow 2.3 版本,而 TensorFlow-Text 最新版需要使用 TensorFlow 2.4 版本,所以我们为了兼容,下载 Rasa 源代码,并对源代码引入的 TensorFlow 和相关的插件版本号都做修改来匹配使用 TensorFlow-Text 的中文分词功能。

在 Rasa 源代码路径:

1
/rasa/nlu/tokenizers

创建文件 tensorflow_text_tokenizer.py

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
import glob
import logging
import os
import shutil
import typing
from typing import Any, Dict, List, Optional, Text

from rasa.nlu.components import Component
from rasa.nlu.tokenizers.tokenizer import Token, Tokenizer
from rasa.shared.nlu.training_data.message import Message

logger = logging.getLogger(__name__)


if typing.TYPE_CHECKING:
from rasa.nlu.model import Metadata


class TensorFlowTextTokenizer(Tokenizer):
"""This tokenizer is a wrapper for tensorflow_text (https://www.tensorflow.org/tutorials/tensorflow_text/intro)."""

supported_language_list = ["zh"]

defaults = {
"model_handle": "https://hub.tensorflow.google.cn/google/zh_segmentation/1",
# Flag to check whether to split intents
"intent_tokenization_flag": False,
# Symbol on which intent should be split
"intent_split_symbol": "_",
# Regular expression to detect tokens
"token_pattern": None,
} # default don't load custom dictionary

def __init__(self, component_config: Dict[Text, Any] = None) -> None:
"""Construct a new intent classifier using the TensorFlow framework."""

super().__init__(component_config)

@classmethod
def required_packages(cls) -> List[Text]:
return ["tensorflow", "tensorflow_text"]

def tokenize(self, message: Message, attribute: Text) -> List[Token]:
import tensorflow_text as tftext
import tensorflow as tf

# 设定模型的 UR
self.model_handle = self.component_config.get("model_handle")
segmenter = tftext.HubModuleTokenizer(self.model_handle)

text = message.get(attribute)
print(text)
tokens, starts, ends = segmenter.tokenize_with_offsets(text)
tokens_list = tokens.numpy()
starts_list = starts.numpy()
print('******')
print(f"{[t.decode('utf-8') for t in tokens_list]}")
print(f"{[t for t in starts_list]}")
print('******')

tokensData = [Token(tokens_list[i], starts_list[i]) for i in range(len(tokens_list))]
return self._apply_token_pattern(tokensData)

初步模仿结巴分词代码,并直接打印出 log,看看分词的效果。

registry.py 注入我们写的插件:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
from rasa.nlu.tokenizers.tensorflow_text_tokenizer import TensorFlowTextTokenizer

...

component_classes = [
# utils
SpacyNLP,
MitieNLP,
HFTransformersNLP,
# tokenizers
MitieTokenizer,
SpacyTokenizer,
WhitespaceTokenizer,
ConveRTTokenizer,
JiebaTokenizer,
TensorFlowTextTokenizer,
...
]

测试

我们在 examples 路径下直接利用 Rasa 源代码执行环境 init 一个 demo 出来:

1
poetry run rasa init

nlu.yml 增加一组测试数据:

1
2
3
4
5
6
7
nlu:
- intent: eat_search
examples: |
- 我想找地方吃饭
- 我想吃[火锅](food)了
- 找个吃[拉面](food)的地方
- 附近有什么好吃的地方吗?

这样就可以对这组数据进行训练了,在 config.yml 中加入 pipeline 等,其中就包括我们创建的 TensorFlowTextTokenizer

1
2
3
4
5
6
7
8
9
language: zh

pipeline:
- name: TensorFlowTextTokenizer
- name: CRFEntityExtractor
- name: CountVectorsFeaturizer
OOV_token: oov
token_pattern: '(?u)\b\w+\b'
- name: KeywordIntentClassifier

大功告成,我们通过训练看分词效果:

1
2
// 训练
poetry run rasa train nlu

看看测试结果:

总结

下一步计划完善 TensorFlowTextTokenizer 分词功能,提交代码给 Rasa,看是否有机会参与 Rasa 的开源项目。

另:Tensorflow_text 分词的 Starts 是偏移量

今天我们通过一个简单的一个方程式:

利用 Swift for TensorFlow 学习计算极大值和极小值。

首先,让我们通过笔算算出来极大值和极小值:

极大值

在代码中:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
import TensorFlow
import TrainingLoop

var x: Float = 0
let η: Float = 0.01
let maxIterations = 100

@differentiable
func f(_ x: Float) -> Float {
return 5 * pow(x, 3) + 2 * pow(x, 2) - 3 * x
}

print("Before optimization, ", terminator: "")
print("x: \(x) and f(x): \(f(x))")

// Optimization loop
for _ in 1...maxIterations {
/// Derivative of `f` w.r.t. `x`.
let 𝛁xF = gradient(at: x) { x -> Float in
return f(x)
}
// Optimization step: update `x` to maximize `f`
x += η * 𝛁xF
}

print("After gradient ascent, ", terminator: "")
print("input: \(x) and output: \(f(x))")

极小值

1
2
3
4
5
6
7
8
9
for _ in 1...maxIterations {
let 𝛁xF = gradient(at: x) { x -> Float in
return f(x)
}
// Optimization step: update `x` to maximize `f`
x.move(along: 𝛁xF.scaled(by: -η))
}
print("After gradient descent, ", terminator: "")
print("input: \(x) and output: \(f(x))")

解释

这个原理挺好理解得,如果一个函数在极大值点周围:

同样的,在极小值点周围:

本文中,主要是使用函数:gradient(at: in:)

1
@inlinable public func gradient<T, R>(at x: T, in f: @differentiable (T) -> Tensor<R>) -> T.TangentVector where T : Differentiable, R : TensorFlowFloatingPoint

今天,通过创建一个 Swift Package 引入 tensorflow/swift-models,来的构建一个 Swift for TensorFlow 开发环境。

Swift Package Manager 是一个苹果官方出的管理源代码分发的工具,目的是更简单的使用别人共享的代码。它会直接处理包之间的依赖管理、版本控制、编译和链接。从总体功能上来说,和 iOS 平台上的 Cocoapods、Carthage 一样。

tensorflow/swift-models 项目主要是采用 Swift Package 结构:

Read more »

本文学习使用 zh_segmentation:基于 Chinese Treebank 6.0 构建的中文分词模型。

安装插件

本模型需要安装 2.4.0b0 或更高版本的 tensorflow_text

1
pip install "tensorflow_text>=2.4.0b0"

分词

对字符串:「新华社北京」分词:

1
2
3
4
5
6
7
8
9
10
import tensorflow_text as text
import tensorflow as tf

# 设定模型的 UR
MODEL_HANDLE = "https://hub.tensorflow.google.cn/google/zh_segmentation/1"
segmenter = text.HubModuleTokenizer(MODEL_HANDLE)

# 分割[新华社北京]。
input_text = ["新华社北京"]
tokens, starts, ends = segmenter.tokenize_with_offsets(input_text)

打印 tokens 看看:

1
print(tokens.to_list())

需要对数组内容进行解析,输出数组:

1
2
3
4
5
first = tokens.to_list()[0][0]
second = tokens.to_list()[0][1]

print(first.decode('utf-8'))
print(second.decode('utf-8'))
Read more »

本文根据 Tensorflow 官网和公众号文章 学习 tensorflow-recommenders。

下载 tensorflow-recommenders

在环境里下载 tensorflow-recommenders

1
2
pip install -q tensorflow-recommenders
pip install -q --upgrade tensorflow-datasets

引入:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
import os
import pprint
import tempfile

from typing import Dict, Text

import numpy as np
import tensorflow as tf
import tensorflow_datasets as tfds
import tensorflow_recommenders as tfrs
import ssl

os.environ['HTTP_PROXY'] = 'http://0.0.0.0:8888'
os.environ['HTTPS_PROXY'] = 'http://0.0.0.0:8888'
ssl._create_default_https_context = ssl._create_unverified_context

载入数据集

由于MovieLens数据集没有预定义的拆分,因此所有数据都处于train拆分中。

1
2
3
4
# Ratings data.
ratings = tfds.load('movielens/100k-ratings', split="train", data_dir = os.path.join(os.getcwd(), "data"))
# Features of all the available movies.
movies = tfds.load('movielens/100k-movies', split="train", data_dir = os.path.join(os.getcwd(), "data"))
Read more »

引入 Python:

Swift For TensorFlow supports Python interoperability.

You can import Python modules from Swift, call Python functions, and convert values between Swift and Python.

1
2
import PythonKit
print(Python.version)

MNIST 数据集识别

本文主要学习如何利用 S4TF 训练和识别 MNIST。

读取数据集

在 Swift 版本里:

先引入组件:

1
2
3
4
5
6
7
import TensorFlow
import PythonKit
import Foundation

let os = Python.import("os")
let np = Python.import("numpy")
let metrics = Python.import("sklearn.metrics")

在之前的文章使用 MNIST 集入门 Tensoflow 中 Python 版本读数据集:

1
2
3
(x_train, y_train), (x_test, y_test) = tf.keras.datasets.mnist.load_data(
path = os.path.join(os.getcwd(), "data/mnist.npz")
)

在 Swift 版本调入数据集:

1
2
3
4
5
if !Bool(os.path.exists("mnist.npz"))! {
os.popen("wget https://s3.amazonaws.com/img-datasets/mnist.npz").read()
}

let mnist = np.load("mnist.npz")
Read more »

CIFAR-10 图像识别

本文主要学习获取 CIFAR-10 数据集,通过简单的模型对数据集进行训练和识别。

下载数据集

和之前一样,使用 http_proxy 代理:

1
2
3
4
5
6
7
8
9
10
import matplotlib.pyplot as plt
import tensorflow as tf
from tensorflow.keras import layers, models
import tensorflow_datasets as tfds
import os
import ssl

os.environ['HTTP_PROXY'] = 'http://0.0.0.0:8888'
os.environ['HTTPS_PROXY'] = 'http://0.0.0.0:8888'
ssl._create_default_https_context = ssl._create_unverified_context

下载 CIFAR-10 数据集

1
(x_train, y_train), (x_test, y_test) = tf.keras.datasets.cifar10.load_data()

返回值

Tuple of Numpy arrays: (x_train, y_train), (x_test, y_test).
x_train, x_test: uint8 arrays of RGB image data with shape (num_samples, 3, 32, 32) if tf.keras.backend.image_data_format() is ‘channels_first’, or (num_samples, 32, 32, 3) if the data format is ‘channels_last’.

y_train, y_test: uint8 arrays of category labels (integers in range 0-9) each with shape (num_samples, 1).

数据集中包含 50000 张 32*32 的彩色图片和这些图相对应的类别标签的训练集,10000 张测试图像。

其中,x_train 是训练集图片,y_train 是对应的标签,x_test 是测试集图片,y_test 是测试集对应的标签。

Read more »

本文主要记录如何在 VSCode 关联 Laradock 容器,配置和使用容器的 PHP 环境和一些插件,如:phpcs

由于 VSCode 已经继承了很多工具,不用我们怎么设置,就可以达到我们的目标。

从容器中打开代码

首先必须保证我们已经启动我们的容器了:

然后启动 VSCode,使用命令 F1,选择 Remote Explorer: Focus on Containers View

选择我们的 workspace 容器:

这时候会重新打开新的页面让你选择代码所在的路径,OK 后即可打开代码,和从本地路径选择效果一样:

Read more »

开篇

最近学习 Swift 开发 WidgetKit,再加上学习 TensorFlow,也就有了两者的结合。

learn-swift-hero

Swift for TensorFlow 是深度学习和可微分计算的新一代系统。,开源代码:tensorflow/swift

运行原理

S4TF 安装

首先下载 swift 对应的版本:

Install Swift for TensorFlow

安装完之后,在 ~/.zshrc 加入 bin 路径:

1
$ export PATH=/Library/Developer/Toolchains/swift-latest/usr/bin:"${PATH}"

验证

写一个 test.swift:

1
2
3
4
import TensorFlow

var x = Tensor<Float>([[1, 2], [3, 4]])
print(x + x)

在命令行执行运行看看输出结果:

每日步数:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
class StepsInteractor {
let healthStore = HKHealthStore()
let stepCountType = HKObjectType.quantityType(forIdentifier: HKQuantityTypeIdentifier.stepCount)!
// Access Step Count
let healthKitTypes: Set = [ HKObjectType.quantityType(forIdentifier: HKQuantityTypeIdentifier.stepCount)! ]

func retrieveStepsWithAuth(completion: @escaping (Double) -> Void) {
// Check for Authorization
if (healthStore.authorizationStatus(for: stepCountType) != HKAuthorizationStatus.sharingAuthorized) {
healthStore.requestAuthorization(toShare: healthKitTypes, read: healthKitTypes) { (success, error) in
if (success) {
// Authorization Successful
self.getSteps { (result) in
completion(result)
}
} else {
completion(-1)
}
}
} else {
self.getSteps { (result) in
completion(result)
}
}
}

func getSteps(completion: @escaping (Double) -> Void) {
let stepsQuantityType = HKQuantityType.quantityType(forIdentifier: .stepCount)!

let now = Date()
let startOfDay = now - 2.days
var interval = DateComponents()
interval.day = 1

let query = HKStatisticsCollectionQuery(
quantityType: stepsQuantityType,
quantitySamplePredicate: nil,
options: [.cumulativeSum],
anchorDate: startOfDay,
intervalComponents: interval)

query.initialResultsHandler = { _, result, error in
var resultCount = 0.0
result!.enumerateStatistics(from: startOfDay, to: now) { statistics, _ in
if let sum = statistics.sumQuantity() {
// Get steps (they are of double type)
resultCount = sum.doubleValue(for: HKUnit.count())
} // end if
// Return
completion(resultCount)
}
}

query.statisticsUpdateHandler = {
query, statistics, statisticsCollection, error in
// If new statistics are available
if let sum = statistics?.sumQuantity() {
let resultCount = sum.doubleValue(for: HKUnit.count())
// Return
completion(resultCount)
} // end if
}

healthStore.execute(query)
}
}
Read more »