Coding01

Coding 点滴

0%

用 TensorFlow_text(3) 构建 Rasa 中文分词 tokenizer

前一段时间简单了解 tensorflow_text 简单中文分词使用,再结合 Rasa 的学习,就萌生出模仿 Rasa 的结巴分词 tokenizer,造一个 Tensorflow_text_tokenizer。

创建一个 Rasa tokenizer 主要包括以下几个步骤:

  1. Setup
  2. Tokenizer
  3. Registry File
  4. Train and Test
  5. Conclusion

了解结巴分词代码

为了开始自建插件,我们先拿一个JiebaTokenizer源代码做测试,并在分词处打印出分词效果:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
...
def tokenize(self, message: Message, attribute: Text) -> List[Token]:
import jieba

text = message.get(attribute)

tokenized = jieba.tokenize(text)
print('******')
print(f"{[t for t in tokenized]}")
print('******')

tokens = [Token(word, start) for (word, start, end) in tokenized]
return self._apply_token_pattern(tokens)
...

在 config 中,加入自定义插件:

1
2
3
4
5
6
7
8
9
language: zh

pipeline:
- name: components.fanlyJiebaTokenizer.JiebaTokenizer
- name: CRFEntityExtractor
- name: CountVectorsFeaturizer
OOV_token: oov
token_pattern: '(?u)\b\w+\b'
- name: KeywordIntentClassifier

训练和测试:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
NLU model loaded. Type a message and press enter to parse it.
Next message:
我想找地方吃饭
******
Building prefix dict from the default dictionary ...
Loading model from cache /var/folders/cz/kq5sssg12jx887hj62hwczrr0000gn/T/jieba.cache
Loading model cost 0.729 seconds.
Prefix dict has been built successfully.
[('我', 0, 1), ('想', 1, 2), ('找', 2, 3), ('地方', 3, 5), ('吃饭', 5, 7)]
******
{
"text": "我想找地方吃饭",
"intent": {
"name": "eat_search",
"confidence": 1.0
},
"entities": []
}
Next message:

构建 TF-Text 分词

注:由于 Rasa 目前只支持 TensorFlow 2.3 版本,而 TensorFlow-Text 最新版需要使用 TensorFlow 2.4 版本,所以我们为了兼容,下载 Rasa 源代码,并对源代码引入的 TensorFlow 和相关的插件版本号都做修改来匹配使用 TensorFlow-Text 的中文分词功能。

在 Rasa 源代码路径:

1
/rasa/nlu/tokenizers

创建文件 tensorflow_text_tokenizer.py

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
import glob
import logging
import os
import shutil
import typing
from typing import Any, Dict, List, Optional, Text

from rasa.nlu.components import Component
from rasa.nlu.tokenizers.tokenizer import Token, Tokenizer
from rasa.shared.nlu.training_data.message import Message

logger = logging.getLogger(__name__)


if typing.TYPE_CHECKING:
from rasa.nlu.model import Metadata


class TensorFlowTextTokenizer(Tokenizer):
"""This tokenizer is a wrapper for tensorflow_text (https://www.tensorflow.org/tutorials/tensorflow_text/intro)."""

supported_language_list = ["zh"]

defaults = {
"model_handle": "https://hub.tensorflow.google.cn/google/zh_segmentation/1",
# Flag to check whether to split intents
"intent_tokenization_flag": False,
# Symbol on which intent should be split
"intent_split_symbol": "_",
# Regular expression to detect tokens
"token_pattern": None,
} # default don't load custom dictionary

def __init__(self, component_config: Dict[Text, Any] = None) -> None:
"""Construct a new intent classifier using the TensorFlow framework."""

super().__init__(component_config)

@classmethod
def required_packages(cls) -> List[Text]:
return ["tensorflow", "tensorflow_text"]

def tokenize(self, message: Message, attribute: Text) -> List[Token]:
import tensorflow_text as tftext
import tensorflow as tf

# 设定模型的 UR
self.model_handle = self.component_config.get("model_handle")
segmenter = tftext.HubModuleTokenizer(self.model_handle)

text = message.get(attribute)
print(text)
tokens, starts, ends = segmenter.tokenize_with_offsets(text)
tokens_list = tokens.numpy()
starts_list = starts.numpy()
print('******')
print(f"{[t.decode('utf-8') for t in tokens_list]}")
print(f"{[t for t in starts_list]}")
print('******')

tokensData = [Token(tokens_list[i], starts_list[i]) for i in range(len(tokens_list))]
return self._apply_token_pattern(tokensData)

初步模仿结巴分词代码,并直接打印出 log,看看分词的效果。

registry.py 注入我们写的插件:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
from rasa.nlu.tokenizers.tensorflow_text_tokenizer import TensorFlowTextTokenizer

...

component_classes = [
# utils
SpacyNLP,
MitieNLP,
HFTransformersNLP,
# tokenizers
MitieTokenizer,
SpacyTokenizer,
WhitespaceTokenizer,
ConveRTTokenizer,
JiebaTokenizer,
TensorFlowTextTokenizer,
...
]

测试

我们在 examples 路径下直接利用 Rasa 源代码执行环境 init 一个 demo 出来:

1
poetry run rasa init

nlu.yml 增加一组测试数据:

1
2
3
4
5
6
7
nlu:
- intent: eat_search
examples: |
- 我想找地方吃饭
- 我想吃[火锅](food)了
- 找个吃[拉面](food)的地方
- 附近有什么好吃的地方吗?

这样就可以对这组数据进行训练了,在 config.yml 中加入 pipeline 等,其中就包括我们创建的 TensorFlowTextTokenizer

1
2
3
4
5
6
7
8
9
language: zh

pipeline:
- name: TensorFlowTextTokenizer
- name: CRFEntityExtractor
- name: CountVectorsFeaturizer
OOV_token: oov
token_pattern: '(?u)\b\w+\b'
- name: KeywordIntentClassifier

大功告成,我们通过训练看分词效果:

1
2
// 训练
poetry run rasa train nlu

看看测试结果:

总结

下一步计划完善 TensorFlowTextTokenizer 分词功能,提交代码给 Rasa,看是否有机会参与 Rasa 的开源项目。

另:Tensorflow_text 分词的 Starts 是偏移量

Welcome to my other publishing channels