200字范文,内容丰富有趣,生活中的好帮手!
200字范文 > TensorFlow-RNN循环神经网络 Example 2:文本情感分析

TensorFlow-RNN循环神经网络 Example 2:文本情感分析

时间:2024-07-15 23:18:53

相关推荐

TensorFlow-RNN循环神经网络 Example 2:文本情感分析

TensorFlow-RNN文本情感分析

之前用全连接神经网络写过一个文本情感分析 /weiwei9363/article/details/78357670现在,利用TensorFlow搭建一个RNN网络对文本进行情感分析完整代码以及详细的介绍(Solution) /jiemojiemo/deep-learning/tree/master/sentiment-rnn训练数据 /jiemojiemo/deep-learning/tree/master/sentiment-network

Step 1 数据处理

import numpy as np

# 读取数据with open('reviews.txt', 'r') as f:reviews = f.read()with open('labels.txt', 'r') as f:labels = f.read()

# 每一个 \n 表示一条reviewreviews[:2000]

'bromwell high is a cartoon comedy . it ran at the same time as some other programs about school life such as teachers . my years in the teaching profession lead me to believe that bromwell high s satire is much closer to reality than is teachers . the scramble to survive financially the insightful students who can see right through their pathetic teachers pomp the pettiness of the whole situation all remind me of the schools i knew and their students . when i saw the episode in which a student repeatedly tried to burn down the school i immediately recalled . . . . . . . . . at . . . . . . . . . . high . a classic line inspector i m here to sack one of your teachers . student welcome to bromwell high . i expect that many adults of my age think that bromwell high is far fetched . what a pity that it isn t \nstory of a man who has unnatural feelings for a pig . starts out with a opening scene that is a terrific example of absurd comedy . a formal orchestra audience is turned into an insane violent mob by the crazy chantings of it s singers . unfortunately it stays absurd the whole time with no general narrative eventually making it just too off putting . even those from the era should be turned off . the cryptic dialogue would make shakespeare seem easy to a third grader . on a technical level it s better than you might think with some good cinematography by future great vilmos zsigmond . future stars sally kirkland and frederic forrest can be seen briefly . \nhomelessness or houselessness as george carlin stated has been an issue for years but never a plan to help those on the street that were once considered human who did everything from going to school work or vote for the matter . most people think of the homeless as just a lost cause while worrying about things such as racism the war on iraq pressuring kids to succeed technology the elections inflation or worrying if they ll be next to end up on the streets . br br but what if y'

from string import punctuation# 去除标点符号all_text = ''.join([c for c in reviews if c not in punctuation])# 每一个\n表示一条reviewreviews = all_text.split('\n')all_text = ' '.join(reviews)# 获得所有单词words = all_text.split()

all_text[:2000]

'bromwell high is a cartoon comedy it ran at the same time as some other programs about school life such as teachers my years in the teaching profession lead me to believe that bromwell high s satire is much closer to reality than is teachers the scramble to survive financially the insightful students who can see right through their pathetic teachers pomp the pettiness of the whole situation all remind me of the schools i knew and their students when i saw the episode in which a student repeatedly tried to burn down the school i immediately recalledat high a classic line inspector i m here to sack one of your teachers student welcome to bromwell high i expect that many adults of my age think that bromwell high is far fetched what a pity that it isn t story of a man who has unnatural feelings for a pig starts out with a opening scene that is a terrific example of absurd comedy a formal orchestra audience is turned into an insane violent mob by the crazy chantings of it s singers unfortunately it stays absurd the whole time with no general narrative eventually making it just too off putting even those from the era should be turned off the cryptic dialogue would make shakespeare seem easy to a third grader on a technical level it s better than you might think with some good cinematography by future great vilmos zsigmond future stars sally kirkland and frederic forrest can be seen briefly homelessness or houselessness as george carlin stated has been an issue for years but never a plan to help those on the street that were once considered human who did everything from going to school work or vote for the matter most people think of the homeless as just a lost cause while worrying about things such as racism the war on iraq pressuring kids to succeed technology the elections inflation or worrying if they ll be next to end up on the streets br br but what if you were given a bet to live on the st'

words[:100]

['bromwell','high','is','a','cartoon','comedy','it','ran','at','the','same','time','as','some','other','programs','about',..........'at','high']

Step 2 将文本转换为数字

神经网络无法处理字符串,因此要将字符串转换为数字具体方法就是给每个单词贴上一个数字下标同时,为了接下来的训练,我们将训练数据中的每一条review从字符串转为数字

from collections import Counterdef get_vocab_to_int(words):# 统计每个单词出现的次数counts = Counter(words)# 按出现次数,从多到少排序vocab = sorted(counts, key=counts.get, reverse=True)# 建立单词到数字的映射,也就是给单词贴上一个数字下标,在网络中用数字标签表示单词# 例如,'apple'在网络中就是一个数字,比如是500.# 数字标签从 1 开始, 0 作特殊作用(下面会说)vocab_to_int = { word : i for i, word in enumerate(vocab, 1)}return vocab_to_int

def get_reviews_ints(vocab_to_int, reviews):# 将review转换为数字,也就是将review中每个单词,通过vocab_to_int转换为数字# 例如,"I love this moive" 可能被转换为 [5 36 45 12354]reviews_ints = []for each in reviews:reviews_ints.append( [ vocab_to_int[word] for word in each.split()] )return reviews_ints

vocab_to_int = get_vocab_to_int(words)reviews_ints = get_reviews_ints(vocab_to_int, reviews)

# 举个例子 看看"i love this moive" 被转换为什么样get_reviews_ints(vocab_to_int, ['i love this moive'])

[[10, 115, 11, 59320]]

# 共有74072个不重复的单词len(vocab_to_int)

74072

Step 3 输出标签编码

标签中包含’negative’和’positive’两类,我们将’negative’转换为0,’positive’为1

labels = np.array([0 if label=='negative' else 1 for label in labels.split('\n')])

Step 4 清理垃圾数据

出于不知名的原因,在reviews_ints中居然有长度为0的数据存在,这是无意义的数据,进行清除同时,最长的review有2514个单词,这对于我们网络而言实在是太长了,要砍掉一部分

review_lens = Counter([len(x) for x in reviews_ints])print('Zero-length reviews:{}'.format(review_lens[0]))print("Maximum review length: {}".format(max(review_lens)))

Zero-length reviews:1Maximum review length: 2514

# 获取长度不为0的review的下标non_zeros_idx = [ ii for ii, review in enumerate(reviews_ints) if len(review) != 0]len(non_zeros_idx)

25000

# 将长度为0的review从reviews_ints中清除reviews_ints = [ reviews_ints[ii] for ii in non_zeros_idx]labels = np.array( [ labels[ii] for ii in non_zeros_idx] )

Step 5 多退少补

上面提到,有的review太长了,要裁剪,有的又太短了,要填充我们固定每次输入字符序列长度为200, 对超过200的review进行裁剪,少于200的review在左边填0例如,’i love this movie’是[10, 115, 11, 59320],那么需要在左边填196个0,变成这样:[0,0,…,0, 10, 115, 11, 59320]

# 字符序列长度seq_len = 200# 大小为 reviews的数量 * seq_lenfeatures = np.zeros((len(reviews_ints), seq_len), dtype=int)for i,review in enumerate(reviews_ints):features[i, -len(review):] = np.array(review)[:seq_len]

Step 6 建立训练集、测试集、验证集

# 用于训练的比例split_frac = 0.8# 将训练集划分出来split_index = int(len(features)*split_frac)train_x, val_x = features[:split_index], features[split_index:]train_y, val_y = labels[:split_index], labels[split_index:]# 除去训练集,剩下的部分被分为测试集和验证集,一半一半test_index = int(len(val_x)*0.5)val_x, test_x = val_x[:test_index], val_x[test_index:]val_y, test_y = val_y[:test_index], val_y[test_index:]print("\t\t\tFeature Shapes:")print("Train set: \t\t{}".format(train_x.shape), "\nValidation set: \t{}".format(val_x.shape),"\nTest set: \t\t{}".format(test_x.shape))

Feature Shapes:Train set:(20000, 200) Validation set:(2500, 200) Test set: (2500, 200)

Step 7 建立网络

设置基本参数

# LSTM 个数lstm_size = 256# LSTM 层数lstm_layers = 1batch_size = 512learning_rate = 0.001

定义输入输出

n_words = len(vocab_to_int)# Create the graph objectgraph = tf.Graph()# Add nodes to the graphwith graph.as_default():# 输入变量,就是一条reviews,# 大小为[None, None],第一个None表示batch_size,可以改为batch_size# 第二个None,表示输入review的长度,可以改成seq_leninputs_ = tf.placeholder(tf.int32, [None, None], name='inputs')# 输入标签labels_ = tf.placeholder(tf.int32, [None, None], name='labels')# dropout的概率,例如 0.8 表示80%不进行dropoutkeep_prob = tf.placeholder(tf.float32, name='keep_prob')

添加Embeding层

embed_size = 300with graph.as_default():embedding = tf.Variable(tf.truncated_normal((n_words, embed_size), stddev=0.01))embed = tf.nn.embedding_lookup(embedding, inputs_)

建立LSTM层

with graph.as_default():# 建立lstm层。这一层中,有 lstm_size 个 LSTM 单元lstm = tf.contrib.rnn.BasicLSTMCell(lstm_size)# 添加dropoutdrop = tf.contrib.rnn.DropoutWrapper(lstm, keep_prob)# 如果一层lsmt不够,多来几层cell = tf.contrib.rnn.MultiRNNCell([drop] * lstm_layers)# 对于每一条输入数据,都要有一个初始状态# 每次输入batch_size 个数据,因此有batch_size个初始状态initial_state = cell.zero_state(batch_size, tf.float32)

RNN 向前传播

with graph.as_default():outputs, final_state = tf.nn.dynamic_rnn(cell, embed, initial_state=initial_state)

# outputs 大小为 (512, ?, 256)# 512 为batch_size# ? 为 seq_len# 256 为lstm单元个数outputs

<tf.Tensor 'rnn/transpose:0' shape=(512, ?, 256) dtype=float32>

定义输出

with graph.as_default():# 我们只关心lstm最后的输出结果,因此outputs[:, -1]获取每条review最后一个单词的lstm层的输出# outputs[:, -1] 大小为 batch_size * lstm_sizepredictions = tf.contrib.layers.fully_connected(outputs[:, -1], 1, activation_fn=tf.sigmoid)cost = tf.losses.mean_squared_error(labels_, predictions)optimizer = tf.train.AdamOptimizer().minimize(cost)

验证准确率

with graph.as_default():correct_pred = tf.equal( tf.cast(tf.round(predictions), tf.int32), labels_ )accuracy = tf.reduce_mean(tf.cast(correct_pred, tf.float32))

获取Batch

def get_batches(x, y, batch_size=100):n_batches = len(x) // batch_sizex, y = x[:n_batches*batch_size], y[:n_batches*batch_size]for ii in range(0, len(x), batch_size):yield x[ii:ii+batch_size], y[ii:ii+batch_size]

训练

epochs = 10# 持久化,保存训练的模型with graph.as_default():saver = tf.train.Saver()with tf.Session(graph=graph) as sess:tf.global_variables_initializer().run()iteration = 1for e in range(epochs):state = sess.run(initial_state)for ii, (x, y) in enumerate(get_batches(train_x, train_y, batch_size), 1):feed = {inputs_ : x,labels_ : y[:,None],keep_prob : 0.5,initial_state : state}loss, state, _ = sess.run([cost, final_state, optimizer], feed_dict=feed)if iteration % 5 == 0:print('Epoch: {}/{}'.format(e, epochs),'Iteration: {}'.format(iteration),'Train loss: {}'.format(loss))if iteration % 25 == 0:val_acc = []val_state = sess.run(cell.zero_state(batch_size, tf.float32))for x, y in get_batches(val_x, val_y, batch_size):feed = {inputs_ : x,labels_ : y[:,None], keep_prob : 1,initial_state : val_state}batch_acc, val_state = sess.run([accuracy, final_state], feed_dict=feed)val_acc.append(batch_acc)print('Val acc: {:.3f}'.format(np.mean(val_acc)))iteration += 1saver.save(sess, "checkpoints/sentiment.ckpt")

Epoch: 0/10 Iteration: 5 Train loss: 0.24799075722694397Epoch: 0/10 Iteration: 10 Train loss: 0.24164661765098572Epoch: 0/10 Iteration: 15 Train loss: 0.23779860138893127Epoch: 0/10 Iteration: 20 Train loss: 0.23155733942985535Epoch: 0/10 Iteration: 25 Train loss: 0.19295498728752136Val acc: 0.694Epoch: 0/10 Iteration: 30 Train loss: 0.16817498207092285Epoch: 0/10 Iteration: 35 Train loss: 0.14103104174137115Epoch: 1/10 Iteration: 40 Train loss: 0.4157596230506897Epoch: 1/10 Iteration: 45 Train loss: 0.25596609711647034Epoch: 1/10 Iteration: 50 Train loss: 0.14873309433460236Val acc: 0.759Epoch: 1/10 Iteration: 55 Train loss: 0.221963316637Epoch: 1/10 Iteration: 60 Train loss: 0.22595466673374176Epoch: 1/10 Iteration: 65 Train loss: 0.22170156240463257Epoch: 1/10 Iteration: 70 Train loss: 0.21362364292144775Epoch: 1/10 Iteration: 75 Train loss: 0.21025851368904114Val acc: 0.637Epoch: 2/10 Iteration: 80 Train loss: 0.19788420479Epoch: 2/10 Iteration: 85 Train loss: 0.18369686603546143Epoch: 2/10 Iteration: 90 Train loss: 0.15401005744934082Epoch: 2/10 Iteration: 95 Train loss: 0.08480044454336166Epoch: 2/10 Iteration: 100 Train loss: 0.21809038519859314Val acc: 0.555Epoch: 2/10 Iteration: 105 Train loss: 0.2156117707490921Epoch: 2/10 Iteration: 110 Train loss: 0.2078854888677597Epoch: 2/10 Iteration: 115 Train loss: 0.17866834998130798Epoch: 3/10 Iteration: 120 Train loss: 0.2278885841369629Epoch: 3/10 Iteration: 125 Train loss: 0.23644667863845825Val acc: 0.574Epoch: 3/10 Iteration: 130 Train loss: 0.15737152099609375Epoch: 3/10 Iteration: 135 Train loss: 0.2996417284011841Epoch: 3/10 Iteration: 140 Train loss: 0.3013457655906677Epoch: 3/10 Iteration: 145 Train loss: 0.29811352491378784Epoch: 3/10 Iteration: 150 Train loss: 0.29609352350234985Val acc: 0.539Epoch: 3/10 Iteration: 155 Train loss: 0.29265934228897095Epoch: 4/10 Iteration: 160 Train loss: 0.3259274959564209Epoch: 4/10 Iteration: 165 Train loss: 0.1977640688419342Epoch: 4/10 Iteration: 170 Train loss: 0.10309533774852753Epoch: 4/10 Iteration: 175 Train loss: 0.20305077731609344Val acc: 0.722Epoch: 4/10 Iteration: 180 Train loss: 0.21348100900650024Epoch: 4/10 Iteration: 185 Train loss: 0.1976686418056488Epoch: 4/10 Iteration: 190 Train loss: 0.17928491532802582Epoch: 4/10 Iteration: 195 Train loss: 0.17746716737747192Epoch: 5/10 Iteration: 200 Train loss: 0.12238124758005142Val acc: 0.814Epoch: 5/10 Iteration: 205 Train loss: 0.07527816295623779Epoch: 5/10 Iteration: 210 Train loss: 0.05444170534610748Epoch: 5/10 Iteration: 215 Train loss: 0.028456348925828934Epoch: 5/10 Iteration: 220 Train loss: 0.02309001237154007Epoch: 5/10 Iteration: 225 Train loss: 0.02358683943748474Val acc: 0.544Epoch: 5/10 Iteration: 230 Train loss: 0.0281759575009346Epoch: 6/10 Iteration: 235 Train loss: 0.36734506487846375Epoch: 6/10 Iteration: 240 Train loss: 0.27041739225387573Epoch: 6/10 Iteration: 245 Train loss: 0.06518629193305969Epoch: 6/10 Iteration: 250 Train loss: 0.27379676699638367Val acc: 0.683Epoch: 6/10 Iteration: 255 Train loss: 0.17366482317447662Epoch: 6/10 Iteration: 260 Train loss: 0.11729621887207031Epoch: 6/10 Iteration: 265 Train loss: 0.156696617603302Epoch: 6/10 Iteration: 270 Train loss: 0.15894444286823273Epoch: 7/10 Iteration: 275 Train loss: 0.14083260297775269Val acc: 0.653Epoch: 7/10 Iteration: 280 Train loss: 0.131819948554039Epoch: 7/10 Iteration: 285 Train loss: 0.1406235545873642Epoch: 7/10 Iteration: 290 Train loss: 0.12142431735992432Epoch: 7/10 Iteration: 295 Train loss: 0.10793609172105789Epoch: 7/10 Iteration: 300 Train loss: 0.1138591319322586Val acc: 0.778Epoch: 7/10 Iteration: 305 Train loss: 0.10069040209054947Epoch: 7/10 Iteration: 310 Train loss: 0.08547944575548172Epoch: 8/10 Iteration: 315 Train loss: 0.0743105486035347Epoch: 8/10 Iteration: 320 Train loss: 0.08303466439247131Epoch: 8/10 Iteration: 325 Train loss: 0.07770203053951263Val acc: 0.749Epoch: 8/10 Iteration: 330 Train loss: 0.05231660231947899Epoch: 8/10 Iteration: 335 Train loss: 0.05823827162384987Epoch: 8/10 Iteration: 340 Train loss: 0.06528615206480026Epoch: 8/10 Iteration: 345 Train loss: 0.06311675161123276Epoch: 8/10 Iteration: 350 Train loss: 0.07824704796075821Val acc: 0.809Epoch: 9/10 Iteration: 355 Train loss: 0.04236128553748131Epoch: 9/10 Iteration: 360 Train loss: 0.03875266760587692Epoch: 9/10 Iteration: 365 Train loss: 0.045075297355651855Epoch: 9/10 Iteration: 370 Train loss: 0.0551967048645Epoch: 9/10 Iteration: 375 Train loss: 0.051657453179359436Val acc: 0.805Epoch: 9/10 Iteration: 380 Train loss: 0.040323011577129364Epoch: 9/10 Iteration: 385 Train loss: 0.03481965512037277Epoch: 9/10 Iteration: 390 Train loss: 0.061715394258499146

测试

test_acc = []with tf.Session(graph=graph) as sess:saver.restore(sess, tf.train.latest_checkpoint('checkpoints'))test_state = sess.run(cell.zero_state(batch_size, tf.float32))for ii, (x, y) in enumerate(get_batches(test_x, test_y, batch_size), 1):feed = {inputs_: x,labels_: y[:, None],keep_prob: 1,initial_state: test_state}batch_acc, test_state = sess.run([accuracy, final_state], feed_dict=feed)test_acc.append(batch_acc)print("Test accuracy: {:.3f}".format(np.mean(test_acc)))

INFO:tensorflow:Restoring parameters from checkpoints/sentiment.ckptTest accuracy: 0.785

本内容不代表本网观点和政治立场,如有侵犯你的权益请联系我们处理。
网友评论
网友评论仅供其表达个人看法,并不表明网站立场。