Python自然语言处理NLP(五)

学习地址

词性标注

1
2
3
4
5
6
7
8
9
''' 词性标注器 '''
from nltk.tokenize import word_tokenize
from nltk.tag import pos_tag
text1=word_tokenize('Never find someone like you')
text2=word_tokenize('Because of you, I never stray too far from the sidewalk')
print(pos_tag(text1)) # [('Never', 'RB'), ('find', 'VBP'), ('someone', 'NN'), ('like', 'IN'), ('you', 'PRP')]
text3=nltk.Text(word.lower() for word in nltk.corpus.brown.words())
print(type(text3),len(text3)) # <class 'nltk.text.Text'> 1161192
text3.similar('the') # a his this their its her an that our any all one these my in your no some other and
  1. 标注语料库:由词和词性这样的元组所组成的词列表,不同的语料库词性标记是不同的
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    ''' 标注语料库 '''
    tagged_token=nltk.tag.str2tuple('fly/NN')
    print(tagged_token) # ('fly', 'NN')

    from nltk.corpus import brown
    brown_news_tagged=brown.tagged_words(categories='news',tagset='universal') # 将新闻类的词性元组拿出来
    #print(brown_news_tagged) # [('The', 'DET'), ('Fulton', 'NOUN'), ...]
    tag_fd=nltk.FreqDist(tag for (word,tag) in brown_news_tagged) # 统计各种词性使用次数
    print(tag_fd.most_common()) # [('NOUN', 30654), ('VERB', 14399), ('ADP', 12355), ('.', 11928), ('DET', 11389), ('ADJ', 6706), ('ADV', 3349), ('CONJ', 2717), ('PRON', 2535), ('PRT', 2264), ('NUM', 2166), ('X', 92)]
    tag_fd.plot() # 绘制词性使用情况
    word_tag_pair=list(nltk.bigrams(brown_news_tagged)) # 统计名词前面最常出现什么词
    print(nltk.FreqDist(a[1] for (a,b) in word_tag_pair if b[1]=='NOUN').most_common()) # [('NOUN', 7959), ('DET', 7373), ('ADJ', 4761), ('ADP', 3781), ('.', 2796), ('VERB', 1842), ('CONJ', 938), ('NUM', 894), ('ADV', 186), ('PRT', 94), ('PRON', 19), ('X', 11)]

Figure_407628c4bd209a72bb.png

  1. 自动标注器

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
     ''' 默认标注器 '''
    from nltk.corpus import brown
    raw='I like eggs and ham,I also like milk except sheep'
    tokens=nltk.word_tokenize(raw)
    default_tagger=nltk.DefaultTagger('NN') # A tagger that assigns the same tag to every token.
    default_tagger.tag(tokens)
    temp=brown.tagged_sents(categories='news') # [[('The', 'AT'), ('Fulton', 'NP-TL'), ('County', 'NN-TL'),.....]]
    print(default_tagger.evaluate(temp)) # 0.13089484257215028 (相当于与正确答案比较人为标注准确率)

    ''' 查询标注器根据已有的标注资料按频数给新文章标注最有可能的结果 '''
    fd=nltk.FreqDist(brown.words(categories='news'))
    cfd=nltk.ConditionalFreqDist(brown.tagged_words(categories='news'))
    most_freq_words=fd.most_common()[:100]
    #print(most_freq_words) # [('the', 5580), (',', 5188),...]
    most_likely_tag=dict((word,cfd[word].max()) for (word,freq) in most_freq_words)
    baseline_tagger=nltk.UnigramTagger(model=most_likely_tag)
    print(baseline_tagger.evaluate(brown.tagged_sents(categories='news'))) # 0.45578495136941344
  2. N-gram标注器

  3. 隐马尔可夫标注器
    隐马尔可夫模型:
    1. 生成模式:确定(当前状态唯一依赖于前一个状态)/非确定模式(下一个状态不确定,假设依赖于前几个状态确定转移矩阵)
    2. 隐藏模式:

2019-09-25-19-46-30-d88046bd077240ec.png
2019-09-25-21-08-56-8e390d095629e17a.png