【自然语言处理】DeepLearning.AI-1-w1-Logistic Regression

Learn to extract features from text into numerical vectors, then build a binary classifier for tweets using logistic regression!

数据预处理的一般方式

  • 预处理(Preprocessing)
    • 消除句柄和URL
    • 分词
    • 去除一些停用词,英文中如(and, is, a, on, etc.)
    • 将单词转变为词干(Stemming),如 dancer, dancing, danced, 变为 'danc'
    • 将英文大写转变为小写
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
# example
import nltk # Python library for NLP
import re # library for regular expression operations
import string # for string operations

from nltk.corpus import stopwords # module for stop words that come with NLTK
from nltk.stem import PorterStemmer # module for stemming
from nltk.tokenize import word_tokenize # module for tokenizing strings

tweet_text = 'My beautiful sunflowers on a sunny Friday morning off :) #sunflowers #favourites #happy #Friday off… https://t.co/3tfYom0N1i'

def preprocess(text, remove_tag=True, tokenize=True, remove_stop_p=True, steam=True, lower=True):
"""
preprocess of text
"""
print(text)
if remove_tag:
# remove old style retweet text "RT"
text = re.sub(r'^RT[\s]+', '', text)
# remove hyperlinks
text = re.sub(r'https?:\/\/.*[\r\n]*', '', text)
# remove hashtags
# only removing the hash # sign from the word
text = re.sub(r'#', '', text)
print('\033[92m' + text)

text_list = []
if tokenize:
nltk.download('punkt')
text_list = word_tokenize(text)
print('\033[94m' + str(text_list))

text_clean = []
if remove_stop_p:
stopwords_english = stopwords.words('english')
for word in text_list: # Go through every word in your tokens list
if (word not in stopwords_english and # remove stopwords
word not in string.punctuation): # remove punctuation
text_clean.append(word)
print('\033[92m' + str(text_clean))

text_stem = []
if steam:
# Instantiate stemming class
stemmer = PorterStemmer()
# Create an empty list to store the stems
for word in text_clean:
stem_word = stemmer.stem(word) # stemming word
text_stem.append(stem_word) # append to the list
print('\033[94m' + str(text_stem))

text_lower = []
if lower:
for word in text_stem:
text_lower.append(word.lower())
print('\033[92m' + str(text_lower))
return

preprocess(tweet_text)

两种将句子特征提取变为向量的方式

方式一 - 是否出现

  • 将要处理的所有文本的 词汇量 记作 \(V\)
    • 对于每句话,出现的词标记为 1,剩余为 0
    • 那么每句话的维度就为 \(|V|\)
  • 问题
    • 导致训练,预测的计算时间长

方式二 - 频数

  • 假设一个语料库有四句话,两句正面,两句负面

  • 计算每个 unique 词在 正面或负面 中各自出现的 频数,构成一个 词汇表

  • 用一个 3维向量 表示一句话
    • 第一维度是 bias
    • 第二维度是 这句话中出现在 词汇表 中的词为 词频和
    • 第二维度是 这句话中出现在 词汇表 中的词为 词频和
    • 于是下面这句话的向量表示是 \((1,8,11)\)