基于NLP的COVID-19虛假新聞檢測(cè)（附代碼）-魔扣目錄

作者：Susan Li

翻譯：楊毅遠(yuǎn)

校對(duì)：吳金笛

本文長(zhǎng)度為4400字，建議閱讀8分鐘

本文為大家介紹了基于自然語(yǔ)言處理的COVID-19虛假新聞檢測(cè)方法以及可視化方法，并結(jié)合真實(shí)的新聞數(shù)據(jù)集與完整的代碼復(fù)現(xiàn)了檢測(cè)以及可視化的過(guò)程。

標(biāo)簽：自然語(yǔ)言處理數(shù)據(jù)可視化

最近有這樣一則新聞：一半的加拿大人被COVID-19的陰謀論所愚弄，這個(gè)新聞?wù)娴牧钊诵乃椤?/p>

世界衛(wèi)生組織（WHO）稱，與COVID-19相關(guān)的信息流行病與病毒本身同樣危險(xiǎn)。同樣地，陰謀論、神話和夸大的事實(shí)可能會(huì)產(chǎn)生超出公共健康范圍的后果。

多虧了Lead Stories，Poynter，F(xiàn)actCheck.org，Snopes，EuVsDisinfo等項(xiàng)目可以監(jiān)視、識(shí)別和檢查散布在世界各地的虛假信息。

為了探究COVID-19虛假新聞的內(nèi)容，我對(duì)于真實(shí)新聞和虛假新聞進(jìn)行了嚴(yán)格的定義。具體來(lái)說(shuō)，真實(shí)新聞是眾所周知的真實(shí)報(bào)道并且來(lái)自可信賴的新聞機(jī)構(gòu)；虛假新聞是眾所周知的錯(cuò)誤報(bào)道并且來(lái)自知名的有意試圖散布錯(cuò)誤信息的虛假新聞網(wǎng)站。

基于以上定義，我從各種新聞資源中收集了1100篇有關(guān)COVID-19的新聞文章和社交網(wǎng)絡(luò)帖子并對(duì)其進(jìn)行了標(biāo)記。

數(shù)據(jù)集可以在這里找到：

https://raw.githubusercontent.com/susanli2016/NLP-with-Python/master/data/corona_fake.csv

數(shù)據(jù)

1. from nltk.corpus import stopwords  2. STOPWORDS = set(stopwords.words('english'))  3. from sklearn.feature_extraction.text import CountVectorizer  4.   5. from textblob import TextBlob  6. import plotly.express as px  7. import plotly.figure_factory as ff  8. import plotly.graph_objects as go  9.   10. df = pd.read_csv('data/corona_fake.csv')  11. df.loc[df['label'] == 'Fake', ['label']] = 'FAKE'  12. df.loc[df['label'] == 'fake', ['label']] = 'FAKE'  13. df.loc[df['source'] == 'facebook', ['source']] = 'Facebook'  14.   15. df.loc[5]['label'] = 'FAKE'  16. df.loc[15]['label'] = 'TRUE'  17. df.loc[43]['label'] = 'FAKE'  18. df.loc[131]['label'] = 'TRUE'  19. df.loc[242]['label'] = 'FAKE'  20.   21. df = df.sample(frac=1).reset_index(drop=True)  22. df.label.value_counts()

process_data.py

經(jīng)過(guò)數(shù)據(jù)清洗，我們可以看到共有586篇真實(shí)新聞和578篇虛假新聞。

df.loc[df['label'] == 'TRUE'].source.value_counts()

圖

真實(shí)新聞主要來(lái)自哈佛健康出版社(Harvard Health Publishing)、《紐約時(shí)報(bào)》(The New York Times)、約翰霍普金斯大學(xué)彭博公共衛(wèi)生學(xué)院(Johns Hopkins Bloomberg School of Public Health)、世衛(wèi)組織(WHO)以及疾病預(yù)防控制中心(CDC)等機(jī)構(gòu)。

df.loc[df['label'] == 'FAKE'].source.value_counts()

圖二

其中的幾個(gè)虛假新聞是從Facebook的帖子中收集的，其是一個(gè)名為Natural News的極右網(wǎng)站和一個(gè)名為orthomolecular.org的替代醫(yī)學(xué)網(wǎng)站。一些文章或帖子已從互聯(lián)網(wǎng)或社交網(wǎng)絡(luò)中刪除，但是，他們?nèi)阅軌蛟诰W(wǎng)絡(luò)中被查詢到。

使用下面的函數(shù)，我們將能夠閱讀任何給定的新聞內(nèi)容并由此確定如何清洗它們：

1. def print_plot(index):  2.     example = df[df.index == index][['text','label']].values[0]  3.     if len(example) > 0:  4.         print(example[0])  5.         print('label:', example[1])  6.           7. print_plot(500)

print_plot.py

print_plot(1000)

由于我們數(shù)據(jù)集中文章內(nèi)容很清晰，所以我們僅需要?jiǎng)h除標(biāo)點(diǎn)符號(hào)并將大寫字母改為小寫即可。

文章長(zhǎng)度

在接下來(lái)的步驟中：

獲取每篇新聞的情感得分，而且分?jǐn)?shù)控制在[-1,1]范圍內(nèi)，其中1表示積極情緒，-1表示消極情緒。
獲取每篇文章的長(zhǎng)度（字?jǐn)?shù)）。

df['polarity'] = df['text'].map(lambda text: TextBlob(text).sentiment.polarity)     def text_len(x):      if type(x) is str:               return len(x.split())      else:          return 0     df['text_len'] = df['text'].Apply(text_len)  nums_text = df.query('text_len > 0')['text_len']     fig = ff.create_distplot(hist_data = [nums_text], group_labels = ['Text'])  fig.update_layout(title_text='Distribution of article length', template="plotly_white")  fig.show()

polarity_length.py

圖三

數(shù)據(jù)集中的大多數(shù)文章少于1000個(gè)單詞。不過(guò)，有少數(shù)文章超過(guò)4000個(gè)單詞。

當(dāng)我們按標(biāo)簽區(qū)分時(shí)，就文章的長(zhǎng)度而言，真實(shí)新聞和虛假新聞之間沒(méi)有明顯的區(qū)別，盡管在數(shù)據(jù)集中大多數(shù)真實(shí)新聞似乎都比虛假新聞短一些。

1. fig = px.histogram(df, x="text_len", y="text", color="label",  2.                    marginal="box",  3.                    hover_data=df.columns, nbins=100)  4. fig.update_layout(title_text='Distribution of article length', template="plotly_white")  5. fig.show()  text_len_hist.py

圖四

為了顯示不同新聞的文本長(zhǎng)度的概率密度，我們使用小提琴圖(violin plot)表示：

1. fig = px.violin(df, y='text_len', color='label',  2.                 violinmode='overlay',   3.                 hover_data=df.columns, template='plotly_white')  4. fig.show()

text_len_violin.py

圖五

Facebook vs. Harvard

平均而言，F(xiàn)acebook的帖子比哈佛健康的文章短得多：

1. df_new = df.loc[(df['source'] == 'Facebook') | (df['source'] == 'https://www.health.harvard.edu/')]  2.   3. fig = px.histogram(df_new, x="text_len", y="text", color='source',  4.                    marginal="box",  5.                    hover_data=df_new.columns, nbins=100)  6. fig.update_layout(title_text='Distribution of article length of two sources', template="plotly_white")  7. fig.show()

facebook_harvard_textlen_hist.py

圖六

我們也可以使用小提琴圖(violin plot)來(lái)呈現(xiàn)：

1. fig = px.violin(df_new, y='text_len', color='source',  2.                 violinmode='overlay',   3.                 hover_data=df_new.columns, template='plotly_white')  4. fig.show()

facebook_harvard_textlen_violin.py

圖七

也許我們大家都很熟悉，F(xiàn)acebook虛假帖子的內(nèi)容往往更短。發(fā)表文章的人試圖通過(guò)試探法而非說(shuō)服力來(lái)說(shuō)服讀者。

情感極性

1. x1 = df.loc[df['label']=='TRUE']['polarity']  2. x2 = df.loc[df['label'] == 'FAKE']['polarity']  3.   4. group_labels = ['TRUE', 'FAKE']  5.   6. colors = ['rgb(0, 0, 100)', 'rgb(0, 200, 200)']  7.   8. fig = ff.create_distplot(  9.     [x1, x2], group_labels,colors=colors)  10.   11. fig.update_layout(title_text='polarity', template="plotly_white")  12. fig.show()

label_polarity.py

圖八

真實(shí)新聞與虛假新聞在情感方面沒(méi)有明顯差異，我們可以使用小提琴圖(violin plot)來(lái)證實(shí)：

1. fig = p.violin(df, y='polarity', color="label",  2.                 violinmode='overlay',  3.                 template='plotly_white')  4. fig.show()

polarity_violin.py

圖九

當(dāng)我們比較這四個(gè)來(lái)源之間的情緒極性時(shí)，我們可以看到《紐約時(shí)報(bào)》和《自然新聞》的情緒分布比哈佛健康新聞和Facebook的情緒分布要窄得多。

1. x1 = df.loc[df['source']=='Facebook']['polarity']  2. x2 = df.loc[df['source'] == 'https://www.health.harvard.edu/']['polarity']  3. x3 = df.loc[df['source'] == 'https://www.nytimes.com/']['polarity']  4. x4 = df.loc[df['source'] == 'https://www.naturalnews.com/']['polarity']  5. group_labels = ['Facebook', 'Harvard', 'nytimes', 'naturalnews']  6.   7. colors = ['rgb(0, 0, 100)', 'rgb(0, 200, 200)', 'rgb(100, 0, 0)', 'rgb(200, 0, 200)']  8.   9. # Create distplot with custom bin_size  10. fig = ff.create_distplot(  11.     [x1, x2, x3, x4], group_labels,colors=colors)  12.   13. fig.update_layout(title_text='polarity', template="plotly_white")  14. fig.show()

polarity_source.py

圖十

這意味著《紐約時(shí)報(bào)》的新聞和數(shù)據(jù)中的自然新聞聽起來(lái)不那么具有情緒。

可以用以下小提琴圖(violin plot)來(lái)證實(shí)：

1. fig = go.Figure()  2.   3. sources = ['https://www.health.harvard.edu/', 'https://www.nytimes.com/', 'Facebook', 'https://www.naturalnews.com/']  4.   5. for source in sources:  6.     fig.add_trace(go.Violin(x=df['source'][df['source'] == source],  7.                             y=df['polarity'][df['source'] == source],  8.                             name=source,  9.                             box_visible=True,  10.                             meanline_visible=True))  11. fig.update_layout(title_text='Polarity of four sources', template='plotly_white')  12. fig.show()

source_violin.py

圖十一

情緒vs文章長(zhǎng)度vs真實(shí)性

我注意到我收集的新聞和帖子既不是非常積極，也不是非常消極。它們大多數(shù)處于適度的正數(shù)范圍內(nèi)，并且大多數(shù)長(zhǎng)度少于1000個(gè)字。

1. fig = px.density_contour(df, x='polarity', y='text_len', marginal_x='histogram', marginal_y='histogram', template='plotly_white')  2. fig.update_layout(title_text='Sentiment vs. Article length')  3. fig.show()

len_polarity.py

圖十二

情感與文章的長(zhǎng)度之間沒(méi)有明顯的關(guān)系。通常，文章的情感或篇幅不能反映其真實(shí)性。虛假新聞與真實(shí)新聞之間的區(qū)別可能是相當(dāng)隨意的。

1. fig = px.scatter(df, x='polarity', y='text_len', color='label', template="plotly_white")  2. fig.update_layout(title_text='Sentiment polarity')  3. fig.show()

polarity_scatter.py

圖十三

df.groupby(['source']).mean().sort_values('polarity', ascending=False)

圖十四

我注意到魯?shù)?middot;朱利安妮（Rudy Giuliani）的帖子是情感評(píng)分最高的帖子之一，所以我很好奇想知道這篇帖子是關(guān)于什么的：

df.loc[df['source'] == 'RudyGiuliani']['text'][880]

當(dāng)然是關(guān)于羥氯喹(Hydroxychloroquine)的啦~。

真實(shí)與虛假新聞的內(nèi)容

現(xiàn)在，我們將了解數(shù)據(jù)集中包含哪些主題：

1. common_bigram_true = get_top_n_bigram(df.loc[df['label'] == 'TRUE']['text'], 20)  2. for word, freq in common_bigram_true:  3.     print(word, freq)

true_bigram.py

1. common_bigram_fake = get_top_n_bigram(df.loc[df['label'] == 'FAKE']['text'], 20)  2. for word, freq in common_bigram_fake:  3.     print(word, freq)

fake_bigram.py

促進(jìn)治愈：這包括使用大劑量靜脈注射維生素C。
關(guān)于起源的推測(cè)：這個(gè)主題包括聲稱冠狀病毒是在用于生物武器的實(shí)驗(yàn)室中制造的，或者是5G技術(shù)導(dǎo)致了這種疾病。
關(guān)于有影響力人士的謠言：例如比爾·蓋茨和福西博士代表制藥公司策劃了冠狀病毒。
應(yīng)對(duì)人們的恐懼：例如梅林達(dá)·蓋茨基金會(huì)和約翰·霍普金斯大學(xué)在三個(gè)月前通過(guò)Event 201預(yù)測(cè)了冠狀病毒。

從我們的數(shù)據(jù)來(lái)看，真實(shí)和虛假新聞內(nèi)容之間的一個(gè)明顯區(qū)別是，虛假新聞似乎更多地使用了人的名字，這表明虛假新聞可能更加個(gè)性化。

naturalnews.com vs orthomolecular.org

以上兩個(gè)新聞來(lái)源都提倡陰謀論，但是它們卻關(guān)注不同的主題。

1. naturalnews_bigram = get_top_n_bigram(df.loc[df['source'] == 'https://www.naturalnews.com/']['text'], 20)  2. for word, freq in naturalnews_bigram:  3.     print(word, freq)

natural_bigram.py

naturalnews.com一直在傳播虛假信息，例如在中國(guó)實(shí)驗(yàn)室將冠狀病毒設(shè)計(jì)為生物武器，并且/或者傳播病毒來(lái)掩蓋暴露于5G無(wú)線技術(shù)有關(guān)有害健康的影響。

1. ortho_bigram = get_top_n_bigram(df.loc[df['source'] == 'http://orthomolecular.org/']['text'], 20)  2. for word, freq in ortho_bigram:  3.     print(word, freq)

ortho_bigram.py

orthomolecular.org一直在推廣使用大劑量靜脈注射維生素C作為治療方法，但尚無(wú)根據(jù)。

根據(jù)以上分析，大家可以隨時(shí)自行判斷其他新聞的真實(shí)性。

總結(jié)

首先，我們不知道在收集數(shù)據(jù)時(shí)是否存在選擇偏差。其次，雖然以上的新聞都是用戶參與度很高的新聞，但我們無(wú)法說(shuō)出這些新聞報(bào)導(dǎo)產(chǎn)生的實(shí)際流量。盡管有這些不足，但此數(shù)據(jù)集提供了合理的標(biāo)簽，并且我們知道其內(nèi)的所有新聞都已被廣泛閱讀和分享。

日日操夜夜添-日日操影院-日日草夜夜操-日日干干-精品一区二区三区波多野结衣-精品一区二区三区高清免费不卡

基于NLP的COVID-19虛假新聞檢測(cè)（附代碼）

數(shù)獨(dú)大挑戰(zhàn)2018-06-03

答題星2018-06-03

全階人生考試2018-06-03

運(yùn)動(dòng)步數(shù)有氧達(dá)人2018-06-03

每日養(yǎng)生app2018-06-03

體育訓(xùn)練成績(jī)?cè)u(píng)定2018-06-03