Python自然语言处理NLP(一)

学习地址

NLTK与NLP

2019-09-22-14-49-04-00fe412b3b518643.png

NLTK设计目标:

  1. 简易性
  2. 一致性
  3. 可扩展性
  4. 可模块化

2019-09-22-19-37-43-7736344b4708a885.png

导入包

  1. 首先需要安装nltk包,本文使用开发环境为centos7.0 + pyhton3.5 + pycharm2019.01,故可直接在解释器设置页面中安装。
  2. 安装成功之后就可以导入包import nltk了。
  3. 下载nltk附带语料库nltk.download(),不支持中文。
  4. 下载成功后导入与语料from nltk.book import *
  5. 提示找不到某个语料库,自动在某些路径搜索失败,按提示下载到nltk_data文件中即可,如nltk.download('gutenberg')

使用NLTK语料

2019-09-22-15-14-33-b5fb3c03d43fab53.png

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
from nltk.book import * # 导入语料库

''' search word/text '''
text1.concordance('monstrous') # 按句子查找
print(text5.count('lol')) # 704

''' search similary word '''
text1.similar('monstrous',num=20) # 找到相同上下文或者意思相近的词语,指定数量
text2.similar('monstrous')

''' 搜索共同上下文 ''' # Find contexts where the specified words appear; list most frequent common contexts first.
text2.common_contexts(['monstrous','very']) # a_pretty a_lucky be_glad am_glad is_pretty

''' word distribute ''' # Produce a plot showing the distribution of the words through the text
text4.dispersion_plot(['citizen','democracy','freedom','duties','America'])

''' generate text ''' # 自动生成文章,在较新的版本中没有这个方法
text3.generate() # Print random text, generated using a trigram language model

''' 计数词汇 '''
print(len(text3))
print(sorted(set(text3))) # 去重排序
print(len(set(text3)))

''' 重复词密度 '''
print(text3.count('smote'))
print(len(text3)/len(set(text3)))

Figure_3635e1b0a3c548e9c5.png

词链表

1
2
3
4
5
6
7
8
print(sent1,sent2)
print(sent1+sent2)
print(type(sent1)) # <class 'list'>

- - - - - - - - - - - - - - - - - - - - -
['Call', 'me', 'Ishmael', '.'] ['The', 'family', 'of', 'Dashwood', 'had', 'long', 'been', 'settled', 'in', 'Sussex', '.']
['Call', 'me', 'Ishmael', '.', 'The', 'family', 'of', 'Dashwood', 'had', 'long', 'been', 'settled', 'in', 'Sussex', '.']
<class 'list'>

可以看到sentence是list类型,那么就可以对其做list支持的各种操作(索引,切片,增删改查等)。

简单统计

  • 频率分布
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
fdist1=FreqDist(sent3) # 统计各个单词频数,生成字典,按频数降序
for key in fdist1:
   print(key,':',fdist1[key])
# print(fdist1.keys())
fdist1.plot(5,cumulative=False) # Plot samples from the frequency distribution displaying the most frequent sample first,指定频率前5个的单词,cumulative表示是否累计
print(fdist1.hapaxes()) # 查看低频词只出现一次的词

fdist2=FreqDist(sent3)
print(fdist2.items())
print(fdist2.max()) # 频数最多的单词

- - - - - - -  -
the : 3
In : 1
created : 1
. : 1
and : 1
beginning : 1
God : 1  
earth : 1  
heaven : 1
['earth', 'In', 'and', 'beginning', 'created', '.', 'heaven', 'God']
dict_items([('In', 1), ('created', 1), ('and', 1), ('earth', 1), ('God', 1), ('beginning', 1), ('the', 3), ('.', 1), ('heaven', 1)])
the

Figure_373b18e3b653cdaa70.png

  • 细粒度的选择词
    a. { w | w ∈V ∧ P(w) }
    b. [ w for w in V if P(w) ]
  • 词语搭配
1
2
3
4
5
6
7
8
9
10
from nltk.book import *
from nltk.util import bigrams

print(list(bigrams(['more','is','said','than','done'])))
a=text2.collocation_list(num=20) # Print collocations derived from the text, ignoring stopwords.
print(a) # 收集最常见的词组

- - - - -  - - - - -
[('more', 'is'), ('is', 'said'), ('said', 'than'), ('than', 'done')]
['Colonel Brandon', 'Sir John', 'Lady Middleton', 'Miss Dashwood', 'every thing', 'thousand pounds', 'dare say', 'Miss Steeles', 'said Elinor', 'Miss Steele', 'every body', 'John Dashwood', 'great deal', 'Harley Street', 'Berkeley Street', 'Miss Dashwoods', 'young man', 'Combe Magna', 'every day', 'next morning']

2019-09-22-19-34-35-54af257f4020f8ee.png

关于字符串转义问题

1
2
3
4
5
6
7
8
9
10
11
12
s='123\n456'
print(s)
print(s.replace('\n',r'\n'))
print('\\')
ss='123\\456'
print(ss.replace('\\',r'\\'))

>>>123
>>>456
>>>123\n456
>>>\
>>>123\\456