自然语言处理背后的数据科学

投稿
APP
微信扫一扫获取更多

自然语言处理背后的数据科学

未来飞机

2023-07-25 12:27:23

人类交流是极为复杂且迷人的现象之一。我们每天都在使用各种方式交流，无论是通过对话还是书写符号。即使是一瞥也可以传递信息。斯坦福大学的机器学习教授克里斯·曼宁曾将沟通描述为“一个离散的、符号的、分类的信号系统”。这意味着我们的感官——视觉、触觉、听觉甚至嗅觉——都在帮助我们进行交流。那么，当我们将计算技术引入这一过程时，会发生什么呢？这就是自然语言处理（NLP）的领域。

自然语言处理（NLP）

NLP 是一门结合计算机科学和人工智能领域的学科，旨在实现人机之间的有效沟通。这一领域始于20世纪50年代，当时艾伦·图灵提出了著名的“图灵测试”，用于评估计算机是否能够模拟人类的行为。自那时起，NLP 已经取得了巨大的进步，尤其是在数据科学和语言学方面。

本文接下来将介绍 NLP 中的一些基本功能，并通过 Python 代码示例进行演示。

标记化

标记化是将文本分解成最小单元的过程，比如单词。例如，给定一句话：“The red fox jumps over the moon.”，每个单词都被视为一个标记，共有七个标记。

使用 Python 进行标记化： python myText = 'The red fox jumps over the moon.' myLowerText = myText.lower() myTextList = myLowerText.split() print(myTextList) 输出结果为： ['the', 'red', 'fox', 'jumps', 'over', 'the', 'moon']

词性归类

词性归类用于确定单词在句子中的语法功能。在英语中，常见的词性包括形容词、代词、名词、动词、副词、介词、连词和感叹词。例如，“PERMIT”这个词既可以作为名词也可以作为动词使用。使用 NLTK 库可以轻松完成词性归类： python import nltk myText = nltk.word_tokenize('the red fox jumps over the moon.') print('Parts of Speech:', nltk.pos_tag(myText)) 输出结果为： Parts of Speech: [('the', 'DT'), ('red', 'JJ'), ('fox', 'NN'), ('jumps', 'NNS'), ('over', 'IN'), ('the', 'DT'), ('moon', 'NN'), ('.', '.')]

停止词删除

许多句子和段落中包含的单词如“a”、“and”、“an”和“the”并没有太多意义。删除这些停止词可以简化文本分析。使用 NLTK 库可以方便地删除停止词： ```python from nltk.corpus import stopwords from nltk.tokenize import word_tokenize

examplesent = "a red fox is an animal that is able to jump over the moon." stopwords = set(stopwords.words('english')) wordtokens = wordtokenize(examplesent) filteredsentence = [w for w in wordtokens if not w in stopwords] print(filtered_sentence) 输出结果为： ['red', 'fox', 'animal', 'able', 'jump', 'moon', '.'] ```

词干提取

词干提取是一种将单词简化为其基本形式的方法，以减少词形变化。例如，“钓鱼”这个词可以简化为“鱼”。使用 NLTK 库可以执行词干提取： ```python from nltk.stem import PorterStemmer from nltk.tokenize import word_tokenize

ps = PorterStemmer()

words = ["likes", "likely", "likes", "liking"] for w in words: print(w, ":", ps.stem(w)) 输出结果为： likes : like likely : likel likes : like liking : lik ```

词形还原

词形还原与词干提取类似，但词形还原会考虑单词的词性，从而生成更易读的结果。使用 NLTK 库可以比较词干提取与词形还原： ```python from nltk.stem import PorterStemmer from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer() ps = PorterStemmer()

words = ["corpora", "constructing", "better", "done", "worst", "pony"]

for w in words: print(w, "STEMMING:", ps.stem(w), "LEMMATIZATION", lemmatizer.lemmatize(w, pos='v')) 输出结果为： corpora STEMMING: corpora LEMMATIZATION corpora constructing STEMMING: construct LEMMATIZATION constructing better STEMMING: better LEMMATIZATION good done STEMMING: done LEMMATIZATION done worst STEMMING: worst LEMMATIZATION bad pony STEMMING: poni LEMMATIZATION pony ```