Python自然语言处理NLP(六)

2019-09-26

NLP

学习地址

文本分类

分类的使用

姓名判别性别
文本分类
词性分类
句子分割
识别对话性为

分类算法

朴素贝叶斯分类器
决策树
KNN
神经网络
SVM

使用朴素贝叶斯进行性别鉴定

使用单个特征
特征：名字的最后一个字母
类别：男性、女性
贝叶斯公式：P(B|A)=P(AB)/P(A)=P(A|B)*P(B)/P(A)

''' 性别鉴定器 '''
from nltk.corpus import names
import nltk
import random
# 特征提取器
def gender_features(word):
    return {'last_letter':word[-1]}

names_set=[(name,'male') for name in names.words('male.txt')]+[(name,'female') for name in names.words('female.txt')]
random.shuffle(names_set) # 随机打乱
# print(names_set[:10]) # [('Samantha', 'female'), ('Nanette', 'female'), ('Layney', 'female'), ('Bertie', 'male'), ('Godfry', 'male'),...
features=[(gender_features(n),g) for (n,g) in names_set] # 提取特征
train_set,test_set=features[500:],features[:500] # 设置训练集和测试集
classifier=nltk.NaiveBayesClassifier.train((train_set)) # 分类器
print(classifier.classify((gender_features('Neo')))) # male
print(nltk.classify.accuracy(classifier,test_set)) # 判断正确率 0.77
# 大型数据集划分
from nltk.classify import apply_features
train_set=apply_features(gender_features,names[500:]) # 不用将数据存到内存中，直接在原始数据集中划分
test_set=apply_features(gender_features,names[:500])