介绍
- Character Filter
在 Tokenizer 之前对文本进行处理, 例如增加删除及替换字符, 可以配置多个 Character Filters, 会影响 Tokenizer 的 position 和 offset 信息
自带: html_strip, mapping, pattern replace
- Tokenizer
将原始的文本按照一定的规则, 切分为词 (term or token)
自带: whitespace, standard/ pattern/ keyword/ path hierarchy
- Token Filter
将 Tokenizer 输出的单词 (term), 进行增加, 修改, 删除.
如自带的 lowercase, stop, synonym(添加近义词)
定义分词器
过滤html标签
1
2
3
4
5
6
7
8
|
# 自定义分词器
POST _analyze
{
"tokenizer": "keyword",
"char_filter": ["html_strip"],
"text":"<b>hello world</b>"
}
|
过滤之后的结果
1
2
3
4
5
6
7
8
9
10
11
|
{
"tokens" : [
{
"token" : "hello world",
"start_offset" : 3,
"end_offset" : 18,
"type" : "word",
"position" : 0
}
]
}
|
替换
将一个字符替换成其它字符
1
2
3
4
5
6
7
8
9
10
11
|
POST _analyze
{
"tokenizer": "standard",
"char_filter": [
{
"type":"mapping",
"mappings":["- => _"]
}
],
"text": "a-b word-ok"
}
|
替换结果
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
|
{
"tokens" : [
{
"token" : "a_b",
"start_offset" : 0,
"end_offset" : 3,
"type" : "<ALPHANUM>",
"position" : 0
},
{
"token" : "word_ok",
"start_offset" : 4,
"end_offset" : 11,
"type" : "<ALPHANUM>",
"position" : 1
}
]
}
|
正则匹配
自定义正则匹配
1
2
3
4
5
6
7
8
9
10
11
12
13
14
|
GET _analyze
{
"tokenizer": "standard",
"char_filter": [
{
"type":"pattern_replace",
"pattern":"http://(.*)",
"replacement":"$1"
}
],
"text": "http://www.google.com"
}
|
正则后的结果
1
2
3
4
5
6
7
8
9
10
11
12
|
{
"tokens" : [
{
"token" : "www.google.com",
"start_offset" : 0,
"end_offset" : 21,
"type" : "<ALPHANUM>",
"position" : 0
}
]
}
|
路径切分
1
2
3
4
5
6
|
# 路径切分
POST _analyze
{
"tokenizer": "path_hierarchy",
"text": "/usr/local/elasticsearch"
}
|
结果显示, 一级一级的显示
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
|
{
"tokens" : [
{
"token" : "/usr",
"start_offset" : 0,
"end_offset" : 4,
"type" : "word",
"position" : 0
},
{
"token" : "/usr/local",
"start_offset" : 0,
"end_offset" : 10,
"type" : "word",
"position" : 0
},
{
"token" : "/usr/local/elasticsearch",
"start_offset" : 0,
"end_offset" : 24,
"type" : "word",
"position" : 0
}
]
}
|
空格切分
以空格分切, 去掉一些介词
1
2
3
4
5
6
7
|
GET _analyze
{
"tokenizer": "whitespace",
"filter": ["stop"],
"text": ["This is a apple"]
}
|
切分结果
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
|
{
"tokens" : [
{
"token" : "This",
"start_offset" : 0,
"end_offset" : 4,
"type" : "word",
"position" : 0
},
{
"token" : "apple",
"start_offset" : 10,
"end_offset" : 15,
"type" : "word",
"position" : 3
}
]
}
|
还可以加入一个转小写的分词器 lowercase
1
2
3
4
5
6
7
8
|
GET _analyze
{
"tokenizer": "whitespace",
"filter": ["stop", "lowercase"],
"text": ["The is A apple"]
}
|
自定义分记号器
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
|
PUT my_index
{
"settings": {
"analysis": {
"analyzer": {
"my_custom_analyzer":{
"type":"custom",
"char_filter":["emoticons"],
"tokenizer":"punctuation",
"filter":["lowercase", "english_stop"]
}
},
"tokenizer": {
"punctuation":{
"type":"pattern",
"pattern":"[ .,!?]"
}
},
"char_filter": {
"emoticons":{
"type":"mapping",
"mappings":[
":) => _happy_"
]
}
},
"filter": {
"english_stop":{
"type":"stop",
"stopwords":"_english_"
}
}
}
}
}
|
测试一下自定义的分词器
1
2
3
4
5
6
|
POST my_index/_analyze
{
"analyzer": "my_custom_analyzer",
"text": [":) person man, HELLO"]
}
|
结果
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
|
{
"tokens" : [
{
"token" : "_happy_",
"start_offset" : 0,
"end_offset" : 2,
"type" : "word",
"position" : 0
},
{
"token" : "person",
"start_offset" : 3,
"end_offset" : 9,
"type" : "word",
"position" : 1
},
{
"token" : "man",
"start_offset" : 10,
"end_offset" : 13,
"type" : "word",
"position" : 2
},
{
"token" : "hello",
"start_offset" : 15,
"end_offset" : 20,
"type" : "word",
"position" : 3
}
]
}
|