TWEET CLASSIFICATION
Using NLP to classify Disaster and Non Disaster Tweets
- 1.) Defining our question
- 2.) Understanding our data.
- 3.) EDA
- 4.) Modelling
- 5.) Conclusions and Recommendations
- 6.) Follow up Questions
- 7.) Model Deployment
Many institutions such as disaster-relief organizations and news agencies have been interested in using twitter to get information on emergencies. However, some of the tweets on twitter are real disaster tweets while some are not. Due to the increased use of twitter as a platform for announcing disasters, we have been tasked to build a machine learning model that determines whether or not a tweet is a real disaster tweet or not.
a) Specifying the Question
Predict which tweets are about real disasters and which ones are not.
b) Defining the Metric for Success
Our study will be successful if we are able to: Build models that classify tweets that are disasters and those that are not. Build models with an accuracy of 80%
c) Understanding the context
Twitter has become an important communication channel in times of emergency.The use of smart phones enables people to announce an emergency they’re observing in real-time and as a result, more agencies are interested in programmatically monitoring Twitter (i.e. disaster relief organizations and news agencies.)
d) Recording the Experimental Design
1) Business Understanding: Understanding the business problem. 2) Reading the data: Getting access to our train, test and sample submission data and reading it on python pandas.
3) Checking our data: Understanding our variables, the number of rows and columns per dataset as well as unique values in the data.
4) Data cleaning: Checking for any missing values, duplicates and solving them.
5) Text Preprocessing: removing noise from our text data,converting all our data to lowercase, removing stop-words and word tokenization .
6)EDA: visualizing our data.
7) Implementing the solution (Modelling): using classification algorithms like SVM, Naive Bayes and Bert Tokenizer to make predictions.
8) Conclusion: concluding on the best model for our predictions.
# Loading libraries
import numpy as np
import pandas as pd
# Reading the data
train= pd.read_csv("/content/train (7).csv")
test= pd.read_csv("/content/test (5).csv")
submit= pd.read_csv("/content/sample_submission.csv")
# Loading the head
print("Train")
print(train.head())
print("")
print("Test")
print(test.head())
print("")
print("Sample Submission")
print (submit.head())
# dtypes
print(train.dtypes)
print("")
print(test.dtypes)
# Shape
print("Train number of rows and columns are : ", train.shape)
print("Test number of rows and columns are : ", test.shape)
print("Sample submission number of rows and columns are : ", submit.shape)
# Lets see what non diasaster tweet looks like
non_disaster = train[train['target']==0]['text']
non_disaster.values[10]
# Lets see what diasaster tweet looks like
disaster_t = train[train['target']==1]['text']
disaster_t.values
# Null values
print("")
print("Train missing per column")
print(train.isnull().sum())
print("")
print("Test missing per column")
print(test.isnull().sum())
print("")
print("Sample submission missing per column")
print(submit.isnull().sum())
Missing variables per column train, test and submit
# Dropping null values
train.dropna(inplace=True)
test.dropna(inplace=True)
# Confirming the number of missing values for both train and test
print("Number of missing values in train",train.isnull().sum().sum())
print("Number of missing values in test",test.isnull().sum().sum())
# Duplicates
print("Train duplicated? ", train.duplicated().any())
print("")
print("Test duplicated? ", test.duplicated().any())
print("")
print("Sample submission duplicated? ",submit.duplicated().any())
No duplicates.
# Checking for unique values per column
for column in train.columns:
print('\n')
print(train[column].nunique())
print(train[column].unique())
Our target has two unique values.
# Editing the location entries that mean exactly the same thing
train['location'] = train['location'].replace(['United States'],'USA')
train['location'] = train['location'].replace(['United Kingdom'],'UK')
train['location'] = train['location'].replace(['NYC'],'New York')
train['location'] = train['location'].replace(['New York, NY'],'New York')
train['location'] = train['location'].replace(['Washington, D.C'],'Washington, DC')
train['location'] = train['location'].replace(['Los Angeles, CA'],'Los Angeles')
train['location'] = train['location'].replace(['Chicago, IL'],'Chicago')
train['location'] = train['location'].replace(['San Fransisco, CA'],'San Fransisco')
# checking for tweets that have been labelled as both disaster and not disaster
df_mislabeled = train.groupby(['text']).nunique().sort_values(by='target', ascending=False)
df_mislabeled = df_mislabeled[df_mislabeled['target'] > 1]['target']
df_mislabeled.index.tolist()
# Removing texts that are both disaster and not disaster shown above
train= train[~((train.text=='.POTUS #StrategicPatience is a strategy for #Genocide; refugees; IDP Internally displaced people; horror; etc. https://t.co/rqWuoy1fm4')|(train.text=='I Pledge Allegiance To The P.O.P.E. And The Burning Buildings of Epic City. ??????')|(train.text=='like for the music video I want some real action shit like burning buildings and police chases not some weak ben winston shit')|(train.text=='RT NotExplained: The only known image of infamous hijacker D.B. Cooper. http://t.co/JlzK2HdeTG')|(train.text=='In #islam saving a person is equal in reward to saving all humans! Islam is the opposite of terrorism!')|(train.text=="Mmmmmm I'm burning.... I'm burning buildings I'm building.... Oooooohhhh oooh ooh...")|(train.text== '#Allah describes piling up #wealth thinking it would last #forever as the description of the people of #Hellfire in Surah Humaza. #Reflect')|(train.text=='#foodscare #offers2go #NestleIndia slips into loss after #Magginoodle #ban unsafe and hazardous for #humanconsumption')
)]
# Confirming that the double labled data has been removed
df_2labeled = train.groupby(['text']).nunique().sort_values(by='target', ascending=False)
df_2labeled = df_2labeled[df_2labeled['target'] > 1]['target']
df_2labeled.index.tolist()
As shown above, the tweets labelled as both disaster and not disaster have been removed.
#Code to display all details in the columns
pd.set_option('display.max_columns', None)
pd.set_option('display.expand_frame_repr', False)
pd.set_option('max_colwidth', -1)
# Converting emojis and emoticons to words
!pip install emot
from emot.emo_unicode import UNICODE_EMO, EMOTICONS
# Converting emojis to words
import re
def convert_emojis(text):
for emot in UNICODE_EMO:
text = text.replace(emot, "_".join(UNICODE_EMO[emot].replace(",","").replace(":","").split()))
return text
# Converting emoticons to words
def convert_emoticons(text):
for emot in EMOTICONS:
text = re.sub(u'('+emot+')', "_".join(EMOTICONS[emot].replace(",","").split()), text)
return text
# Passing both functions to 'text_rare'
train['text'] = train['text'].apply(convert_emoticons)
train['text'] = train['text'].apply(convert_emojis)
import nltk
from nltk.tokenize import RegexpTokenizer # For tokenization
from nltk.stem import WordNetLemmatizer,PorterStemmer # For lemmatization
from nltk.corpus import stopwords# To remove stop words
nltk.download('stopwords')
nltk.download('wordnet')
import re
lemmatizer = WordNetLemmatizer()
stemmer = PorterStemmer()
def preprocess(sentence):
sentence=str(sentence)
sentence = sentence.lower()
sentence=sentence.replace('{html}',"") # removing html files
cleanr = re.compile('<.*?>') # Removing punctuation
cleantext = re.sub(cleanr, '', sentence)
rem_url=re.sub(r'http\S+', '',cleantext) # Removing links
rem_num = re.sub('[0-9]+', '', rem_url) # Removing numbers
tokenizer = RegexpTokenizer(r'\w+') # Tokenization
tokens = tokenizer.tokenize(rem_num)
filtered_words = [w for w in tokens if len(w) > 2 if not w in stopwords.words('english')]
stem_words=[stemmer.stem(w) for w in filtered_words]
lemma_words=[lemmatizer.lemmatize(w) for w in stem_words]
return " ".join(filtered_words)
# Mapping our loop to our datasets
test['cleanText']=test['text'].map(lambda s:preprocess(s))
train['cleanText']=train['text'].map(lambda s:preprocess(s))
# Spelling correction
!pip install textblob
# Spell check using text blob for the first 5 records
from textblob import TextBlob
train['cleanText'][:5].apply(lambda x: str(TextBlob(x).correct()))
# Common word removal
# Checking the first 10 most frequent words
from collections import Counter
cnt = Counter()
for text in train["cleanText"].values:
for word in text.split():
cnt[word] += 1
cnt.most_common(10)
The most common words have been removed
# Removing the frequent words
freq = set([w for (w, wc) in cnt.most_common(10)])
# function to remove the frequent words
def freqwords(text):
return " ".join([word for word in str(text).split() if word not
in freq])
# Passing the function freqwords
train["cleanText"] = train["cleanText"].apply(freqwords)
train["cleanText"].head()
#collapse-hide
import matplotlib.pyplot as plt
import seaborn as sns
#
fig, axes = plt.subplots(ncols=2, figsize=(17, 4), dpi=100)
plt.tight_layout()
train.groupby('target').count()['id'].plot(kind='pie', ax=axes[0], labels=['Not Disaster (57%)', 'Disaster (43%)'])
sns.countplot(x=train['target'], hue=train['target'], ax=axes[1])
axes[0].set_ylabel('')
axes[1].set_ylabel('')
axes[1].set_xticklabels(['Not Disaster (4342)', 'Disaster (3271)'])
axes[0].tick_params(axis='x', labelsize=15)
axes[0].tick_params(axis='y', labelsize=15)
axes[1].tick_params(axis='x', labelsize=15)
axes[1].tick_params(axis='y', labelsize=15)
axes[0].set_title('Target Distribution in Training Set', fontsize=13)
axes[1].set_title('Target Count in Training Set', fontsize=13)
plt.show()
#collapse-hide
import seaborn as sns
plt.figure(figsize=(5,6))
sns.countplot(y=train.location, order = train.location.value_counts().iloc[:25].index)
plt.title('Top 25 location from the tweets')
plt.show()
The highest number of records are those of USA
#collapse-hide
plt.figure(figsize=(5,6))
sns.countplot(y=train.keyword, order = train.keyword.value_counts().iloc[:25].index)
plt.title('Frequent 25 keywords from the tweets')
plt.show()
The most common key words are collision, wirlwind and fatalities.
def length(text):
'a function which shows the length of text'
return len(text)
#
train['length'] = train['text'].apply(length)
# #collapse-hide
plt.rcParams['figure.figsize'] = (18.0, 6.0)
bins = 150
plt.hist(train[train['target'] == 0]['length'], alpha = 0.6, bins=bins, label='Not')
plt.hist(train[train['target'] == 1]['length'], alpha = 0.8, bins=bins, label='Real')
plt.xlabel('length')
plt.ylabel('numbers')
plt.legend(loc='upper right')
#plt.xlim(0,150)
plt.show()
- Both of our targets are skewed to the left.
# hide_output
train['target_mean'] = train.groupby('keyword')['target'].transform('mean')
fig = plt.figure(figsize=(8, 72), dpi=100)
sns.countplot(y=train.sort_values(by='target_mean', ascending=False)['keyword'],
hue=train.sort_values(by='target_mean', ascending=False)['target'])
plt.tick_params(axis='x', labelsize=15)
plt.tick_params(axis='y', labelsize=12)
plt.legend(loc=1)
plt.title('Target Distribution in Keywords')
plt.show()
train.drop(columns=['target_mean'], inplace=True)
# hide_output
!pip install pandas_profiling
!pip3 install pandas_profiling --upgrade
# Creating a pandas profiling report for the train dataset
from pandas_profiling import ProfileReport
profile = ProfileReport(train,html={'style':{'full_width':True}})
profile
- From our dataset, the most repeated keywords are fatalities, deluge, armageddon, damage and harm.
- The highest recorded location is USA followed by UK and Canada.
- There is no correlation in our data.
# Getting the statistical distribution of our text data
lens = train.cleanText.str.split().apply(lambda x: len(x))
print(lens.describe())
lens.hist()
Our texts has an average of 9 words per tweet.
# The most used words in our text are
# In this step, I find the most frequent words in the data, extracting information about its content and topics.
from sklearn.feature_extraction.text import CountVectorizer
from nltk.corpus import stopwords
stops = set(stopwords.words('english')+['com'])
co = CountVectorizer(stop_words=stops)
counts = co.fit_transform(train.cleanText)
w=pd.DataFrame(counts.sum(axis=0),columns=co.get_feature_names()).T.sort_values(0,ascending=False).head(20)
w.head()
# Library
from collections import defaultdict
from wordcloud import STOPWORDS
from PIL import Image
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator
#
DISASTER_TWEETS = train['target'] == 1
def generate_ngrams(text, n_gram=1):
token = [token for token in text.lower().split(' ') if token != '' if token not in STOPWORDS]
ngrams = zip(*[token[i:] for i in range(n_gram)])
return [' '.join(ngram) for ngram in ngrams]
N = 5
# Unigrams
disaster_unigrams = defaultdict(int)
nondisaster_unigrams = defaultdict(int)
for tweet in train[DISASTER_TWEETS]['cleanText']:
for word in generate_ngrams(tweet):
disaster_unigrams[word] += 1
for tweet in train[~DISASTER_TWEETS]['cleanText']:
for word in generate_ngrams(tweet):
nondisaster_unigrams[word] += 1
df_disaster_unigrams = pd.DataFrame(sorted(disaster_unigrams.items(), key=lambda x: x[1])[::-1])
df_nondisaster_unigrams = pd.DataFrame(sorted(nondisaster_unigrams.items(), key=lambda x: x[1])[::-1])
# Bigrams
disaster_bigrams = defaultdict(int)
nondisaster_bigrams = defaultdict(int)
for tweet in train[DISASTER_TWEETS]['cleanText']:
for word in generate_ngrams(tweet, n_gram=2):
disaster_bigrams[word] += 1
for tweet in train[~DISASTER_TWEETS]['cleanText']:
for word in generate_ngrams(tweet, n_gram=2):
nondisaster_bigrams[word] += 1
df_disaster_bigrams = pd.DataFrame(sorted(disaster_bigrams.items(), key=lambda x: x[1])[::-1])
df_nondisaster_bigrams = pd.DataFrame(sorted(nondisaster_bigrams.items(), key=lambda x: x[1])[::-1])
# Trigrams
disaster_trigrams = defaultdict(int)
nondisaster_trigrams = defaultdict(int)
for tweet in train[DISASTER_TWEETS]['cleanText']:
for word in generate_ngrams(tweet, n_gram=3):
disaster_trigrams[word] += 1
for tweet in train[~DISASTER_TWEETS]['cleanText']:
for word in generate_ngrams(tweet, n_gram=3):
nondisaster_trigrams[word] += 1
df_disaster_trigrams = pd.DataFrame(sorted(disaster_trigrams.items(), key=lambda x: x[1])[::-1])
df_nondisaster_trigrams = pd.DataFrame(sorted(nondisaster_trigrams.items(), key=lambda x: x[1])[::-1])
# #collapse-hide
fig, axes = plt.subplots(ncols=2, figsize=(15, 5), dpi=100)
plt.tight_layout()
sns.barplot(y=df_disaster_unigrams[0].values[:N], x=df_disaster_unigrams[1].values[:N], ax=axes[0], color='orange')
sns.barplot(y=df_nondisaster_unigrams[0].values[:N], x=df_nondisaster_unigrams[1].values[:N], ax=axes[1], color='cyan')
for i in range(2):
axes[i].spines['right'].set_visible(False)
axes[i].set_xlabel('')
axes[i].set_ylabel('')
axes[i].tick_params(axis='x', labelsize=13)
axes[i].tick_params(axis='y', labelsize=13)
axes[0].set_title(f'Top {N} most common unigrams in Disaster Tweets', fontsize=15)
axes[1].set_title(f'Top {N} most common unigrams in Non-disaster Tweets', fontsize=15)
plt.show()
The most used words are like, amp, new and fire.
# check for frequent bi-gram words
co = CountVectorizer(ngram_range=(2,2),stop_words=stops)
counts = co.fit_transform(train.cleanText)
pd.DataFrame(counts.sum(axis=0),columns=co.get_feature_names()).T.sort_values(0,ascending=False).head(10)
#collapse-hide
fig, axes = plt.subplots(ncols=2, figsize=(15, 5), dpi=100)
plt.tight_layout()
sns.barplot(y=df_disaster_bigrams[0].values[:N], x=df_disaster_bigrams[1].values[:N], ax=axes[0], color='orange')
sns.barplot(y=df_nondisaster_bigrams[0].values[:N], x=df_nondisaster_bigrams[1].values[:N], ax=axes[1], color='cyan')
for i in range(2):
axes[i].spines['right'].set_visible(False)
axes[i].set_xlabel('')
axes[i].set_ylabel('')
axes[i].tick_params(axis='x', labelsize=13)
axes[i].tick_params(axis='y', labelsize=13)
axes[0].set_title(f'Top {N} most common bigrams in Disaster Tweets', fontsize=15)
axes[1].set_title(f'Top {N} most common bigrams in Non-disaster Tweets', fontsize=15)
plt.show()
Some of the most used bi-gram words are burning buildings (44), suicide bomber (29), looks like (28) and youtube video (27).
# Showing trigrams in each target
fig, axes = plt.subplots(ncols=2, figsize=(15, 5), dpi=100)
sns.barplot(y=df_disaster_trigrams[0].values[:N], x=df_disaster_trigrams[1].values[:N], ax=axes[0], color='orange')
sns.barplot(y=df_nondisaster_trigrams[0].values[:N], x=df_nondisaster_trigrams[1].values[:N], ax=axes[1], color='cyan')
for i in range(2):
axes[i].spines['right'].set_visible(False)
axes[i].set_xlabel('')
axes[i].set_ylabel('')
axes[i].tick_params(axis='x', labelsize=13)
axes[i].tick_params(axis='y', labelsize=11)
axes[0].set_title(f'Top {N} most common trigrams in Disaster Tweets', fontsize=15)
axes[1].set_title(f'Top {N} most common trigrams in Non-disaster Tweets', fontsize=15)
plt.show()
# To show distribution of stop words in our original data
import nltk
nltk.download('stopwords')
stop=set(stopwords.words('english'))
#
corpus=[]
new= train['text'].str.split()
new=new.values.tolist()
corpus=[word for i in new for word in i]
from collections import defaultdict
dic=defaultdict(int)
for word in corpus:
if word in stop:
dic[word]+=1
#collapse-hide
top=sorted(dic.items(), key=lambda x:x[1],reverse=True)[:10]
x,y=zip(*top)
plt.bar(x,y)
The highest recorded stop word is 'the' followed by 'a'.
Wordclouds
WordCloud is a technique to show which words are the most frequent among the given text
# Creating a word cloud
?WordCloud
#Joining all the tweets into one giant sentence
text = " ".join(review for review in train.cleanText)
print ("There are {} words in the combination of all review.".format(len(text)))
#All words in the dataset wordcloud
# Create stopword list:
stopwords = set(STOPWORDS)
# Generate a word cloud image
wordcloud = WordCloud(stopwords=stopwords, background_color="grey").generate(text)
# Display the generated image:
# the matplotlib way:
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.show()
#creating a disaster tweet dataframe
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator
disaster=train[train['target']== 1]
#Joining all the disaster_tweets into one giant sentence
disaster_text = " ".join(review for review in disaster.cleanText)
#creating a disaster tweet dataframe
non_disaster=train[train['target']== 0]
#Joining all the disaster_tweets into one giant sentence
non_disaster_text = " ".join(review for review in non_disaster.cleanText)
# Generate a word cloud image
wordcloud_disaster = WordCloud(stopwords=stopwords, background_color="black").generate(disaster_text)
# Generate a word cloud image
wordcloud_non_disaster = WordCloud(stopwords=stopwords, background_color="white").generate(non_disaster_text)
# Getting the words that are in the disaster tweets
thunder_mask =np.array(Image.open("black_thunderstorm.jpg"))
thunder_mask
# Create a word cloud image
wc = WordCloud(background_color="white", max_words=1000, mask=thunder_mask,
stopwords=stopwords, contour_width=3, contour_color='grey')
# Generate a wordcloud
wc.generate(disaster_text)
# store to file
wc.to_file("disaster.png")
# show
plt.figure(figsize=[20,10])
plt.imshow(wc, interpolation='bilinear')
plt.axis("off")
plt.show()
# hide_output
earth_mask =np.array(Image.open("earth.jpg"))
earth_mask
#collapse-hide
wc = WordCloud(background_color="white", max_words=1000, mask=earth_mask,
stopwords=stopwords, contour_width=3, contour_color='green')
# Generate a wordcloud
wc.generate(non_disaster_text)
# store to file
wc.to_file("disaster.png")
# show
plt.figure(figsize=[20,10])
plt.imshow(wc, interpolation='bilinear')
plt.axis("off")
plt.show()
Naive Bayes
Assumptions
- Features in the dataset are mutually independent. Occurrence of one feature does not affect the probability of occurrence of the other feature. This is relevant as each token will be independent of each other.
- The Bag of Words assumption which assumes that the position of the words in the document doesn’t matter.
- Naive Bayes is relatively robust, easy to implement, fast, and accurate, it is used in many different fields like Spam filtering in emails and text classification.
Multinomial Naïve Bayes uses term frequency i.e. the number of times a given term appears in a document. After normalization, term frequency can be used to compute maximum likelihood estimates based on the training data to estimate the conditional probability.
# Making predictions on our test data using Naive Bayes
from sklearn.feature_extraction.text import CountVectorizer # Transformer which we will use on our model
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline # It encapsulates transformers and predictors inside
#
unigrams_pipeline = Pipeline([('cleanText', CountVectorizer()),('target', MultinomialNB())])
unigrams_pipeline.fit(train.cleanText, train.target)
predictions = pd.DataFrame(
unigrams_pipeline.predict(test.cleanText)
#
)
predictions['id'] = test['id']
# Classification report on the train prediction
pred = unigrams_pipeline.predict(train.cleanText)
# Printing the classification report
print(classification_report(train.target,pred))
Our Naive Bayes model got a 95% accuracy score on the train data trained on it.
# Printing out predictions on the test dataset
predictions.columns = ['target','id']
predictions.head()
# re-arraging the columns to match the submission file
cols = predictions.columns.tolist()
cols = cols[-1:] + cols[:-1]
predictions = predictions[cols]
# Converting our submission to CSV
predictions.to_csv("submissionsnd.csv", index=False)
predictions.head()
- Our model got a score of 79.252 on Kaggle
SVM (Support Vector Machines)
SVM is an algorithm that determines the best decision boundary between vectors that belong to a given group (or category) and vectors that do not belong to it. In order to use SVM on our model, we will have to convert our text to vectors (countvectorizer).
Assumptions
- The margin are as large as possible.
- The support vectors are the most useful data points because they are the ones most likely to be incorrectly classified.
# Making predictions using linear SVM
from sklearn.feature_extraction.text import CountVectorizer
from sklearn import model_selection, svm
from sklearn.pipeline import Pipeline # which encapsulates transformers and predictors inside
#
unigrams_pipelines = Pipeline([('cleanText', CountVectorizer()),('target', svm.SVC(C=1.0, kernel='linear', degree=3, gamma='auto'))])
unigrams_pipelines.fit(train.cleanText, train.target)
predictsvm = pd.DataFrame(
unigrams_pipeline.predict(test.cleanText)
)
#
predictsvm['id'] = test['id']
# Classification report on the train prediction
predi = unigrams_pipelines.predict(train.cleanText)
# Printing the classification report
print(classification_report(train.target,predi))
Our SVM model got a 95% accuracy score on the train data trained on it.
predictsvm.columns = ['target','id']
predictsvm.head()
cols = predictsvm.columns.tolist()
cols = cols[-1:] + cols[:-1]
predictsvm = predictsvm[cols]
predictsvm.tail()
# Converting our predictions to svm
predictsvm.to_csv("submissionsvm2.csv", index=False)
Our SVM scored 79.773 on kaggle
BERT (Bidirectional Encoder Representations from Transformers)
- A transformer model that is pre-trained on a large corpus of unlabelled text including the entire Wikipedia (that’s 2,500 million words!) and Book Corpus (800 million words) making it have a deeper understanding of how language works.
- BERT is also a “deep bidirectional” model. Bidirectional means that BERT learns information from both the left and the right side of a token’s context during the training phase enabling it to understand the meaning of a language.
# Importing libraries
import tensorflow as tf
import pandas as pd
from tensorflow.keras.layers import Dense, Input
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.models import Model
from tensorflow.keras.callbacks import ModelCheckpoint
import tensorflow_hub as hub
train=pd.read_csv('/content/train (7).csv')
test=pd.read_csv('/content/test (5).csv')
!wget --quiet https://raw.githubusercontent.com/tensorflow/models/master/official/nlp/bert/tokenization.py
pip install sentencepiece
import tokenization
def bert_encode(texts, tokenizer, max_len=512):
all_tokens = []
all_masks = []
all_segments = []
for text in texts:
text = tokenizer.tokenize(text)
text = text[:max_len-2]
input_sequence = ["[CLS]"] + text + ["[SEP]"]
pad_len = max_len - len(input_sequence)
tokens = tokenizer.convert_tokens_to_ids(input_sequence)
tokens += [0] * pad_len
pad_masks = [1] * len(input_sequence) + [0] * pad_len
segment_ids = [0] * max_len
all_tokens.append(tokens)
all_masks.append(pad_masks)
all_segments.append(segment_ids)
return np.array(all_tokens), np.array(all_masks), np.array(all_segments)
def build_model(bert_layer, max_len=512):
input_word_ids = Input(shape=(max_len,), dtype=tf.int32, name="input_word_ids")
input_mask = Input(shape=(max_len,), dtype=tf.int32, name="input_mask")
segment_ids = Input(shape=(max_len,), dtype=tf.int32, name="segment_ids")
_, sequence_output = bert_layer([input_word_ids, input_mask, segment_ids])
clf_output = sequence_output[:, 0, :]
out = Dense(1, activation='sigmoid')(clf_output)
model = Model(inputs=[input_word_ids, input_mask, segment_ids], outputs=out)
model.compile(Adam(lr=2e-6), loss='binary_crossentropy', metrics=['accuracy'])
return model
# To know the time it takes
%%time
module_url = "https://tfhub.dev/tensorflow/bert_en_uncased_L-24_H-1024_A-16/1"
bert_layer = hub.KerasLayer(module_url, trainable=True)
## Loading a tokenizer from the bert layer
vocab_file = bert_layer.resolved_object.vocab_file.asset_path.numpy()
do_lower_case = bert_layer.resolved_object.do_lower_case.numpy()
tokenizer = tokenization.FullTokenizer(vocab_file, do_lower_case)
##Encoding the text into tokens, masks, and segment flags:
import numpy as np
train_input = bert_encode(train.text.values, tokenizer, max_len=160)
test_input = bert_encode(test.text.values, tokenizer, max_len=160)
train_labels = train.target.values
train_labels
test_input[:1]
model = build_model(bert_layer, max_len=160)
model.summary()
train_history = model.fit(
train_input, train_labels,
validation_split=0.2,
epochs=5,
batch_size=16
)
model.save('model.h5')
test_pred = model.predict(test_input)
len(test_pred)
submit['target'] = test_pred.round().astype(int)
submit.to_csv('submission.csv', index=False)
submit.head()
This model ranked number 334 out of 1356 with a 83% accuracy.Its a good baseline model and we can brainstorm on how to improve its accuracy
Flair
Flair is a NLP library which can be used for text classification.
Why use flair?
- It comprises of popular and state-of-the-art word embeddings, such as GloVe, BERT, ELMo, Character Embeddings, etc.
- Flair’s interface allows us to combine different word embeddings and use them to embed documents. This in turn leads to a significant uptick in results.
- Flair Embedding uses contextual string embeddings.
- Flair supports a number of languages other than English.
# hide_output
pip install flair
## How It Works
# loading the data and visualizing flair steps
df=pd.read_csv('/content/train (7).csv')
from IPython.display import Image
Image('/content/flair.png')
import pandas as pd
import numpy as np
df=pd.read_csv('/content/train (7).csv')
df.head()
#Some preprocessing
import re
allowed_chars = ' AaBbCcDdEeFfGgHhIiJjKkLlMmNnOoPpQqRrSsTtUuVvWwXxYyZz0123456789~`!@#$%^&*()-=_+[]{}|;:",./<>?'
punct = '!?,.@#'
maxlen = 280
def preprocess(text):
return ''.join([' ' + char + ' ' if char in punct else char for char in [char for char in re.sub(r'http\S+', 'http', text, flags=re.MULTILINE) if char in allowed_chars]])[:maxlen]
df['text']=df['text'].apply(preprocess)
df.head()
#Data Preparation
df['target'].value_counts()
condition=[
(df['target']==1),
(df['target']==0)
]
values=['Disaster','No_Disaster']
df['labels']=np.select(condition,values)
df.head()
df1=df[['text','labels']]
df1.head()
#Converting to FastText
df_fast_text=df1.copy()
df_fast_text['labels'] = '__label__'+ df_fast_text['labels'].astype(str)
## Rearanging the dataframe to the required format
df_fast_text=df_fast_text[['labels','text']]
df_fast_text.head()
## splitting the FastText into 3 required by the model
train_fst,test_fst,dev_fst=np.split(df_fast_text,[int(.6*len(df_fast_text)),int(.8*len(df_fast_text))])
# Viewing the shape of the three splits
print(train_fst.shape)
print(test_fst.shape)
print(dev_fst.shape)
### Storing allthe three in one folder
!mkdir -p data_fst
## storing them into one folder
train_fst.to_csv('/content/data_fst/train.csv',index=False,header=False,sep='\t')
test_fst.to_csv('/content/data_fst/test.csv',index=False,header=False,sep='\t')
dev_fst.to_csv('/content/data_fst/dev.csv',index=False,header=False,sep='\t')
## Cheking to see if the items exist
!ls data_fst
The files have been stored.
# Creating the corpus
from flair.datasets import ClassificationCorpus
from flair.data import Corpus
# Loading the folder we just created
data_folder='/content/data_fst/'
# Applying the corpus on our folder
corpus_fst:Corpus=ClassificationCorpus(data_folder)
# Creating the label dictionary
label_dict=corpus_fst.make_label_dictionary()
# hide_output
from flair.embeddings import FlairEmbeddings, BertEmbeddings
# init Flair embeddings
flair_forward_embedding = FlairEmbeddings('multi-forward')
flair_backward_embedding = FlairEmbeddings('multi-backward')
# init multilingual BERT
bert_embedding = BertEmbeddings('bert-base-multilingual-cased')
from flair.embeddings import DocumentLSTMEmbeddings,DocumentRNNEmbeddings
#Combinning the embedings
from flair.embeddings import StackedEmbeddings
# now create the StackedEmbedding object that combines all embeddings
stacked_embeddings = StackedEmbeddings(
embeddings=[flair_forward_embedding, flair_backward_embedding, bert_embedding])
from flair.data import Sentence
sentence = Sentence('The grass is green .')
# just embed a sentence using the StackedEmbedding as you would with any single embedding.
stacked_embeddings.embed(sentence)
# now check out the embedded tokens.
for token in sentence:
print(token)
print()
print(token.embedding)
print('-'*100)
word_embeddings=[flair_forward_embedding, flair_backward_embedding, bert_embedding]
documnt_embedings= DocumentRNNEmbeddings(word_embeddings,hidden_size=512,reproject_words=True,reproject_words_dimension=256)
## Building and Trainning The Model
from flair.models import TextClassifier
from flair.trainers import ModelTrainer
clf=TextClassifier(documnt_embedings,label_dictionary=label_dict)
### initialise
trainer=ModelTrainer(clf,corpus_fst)
# hide_output
trainer.train('data',max_epochs=10)
The flair model has given us an accuracy of 79%
## Making Prediction on Raw Text
new_clf=TextClassifier.load('/content/data/best-model.pt')
# Importing sentence
from flair.data import Sentence
# Creating the first sentence
frst_sentence=Sentence('I was just sitted, when i suddenly heard a loud bang')
# Making our model predict the first sentence/ tweet
new_clf.predict(frst_sentence)
frst_sentence.labels
According to our model, there is 76% chance that our first sentence is not a disaster tweet which is accurate.
# Creating the second sentence/ tweet
second_sentence=Sentence('Many have beeninjured , Afew Died on the spot')
# Making our model predict the second sentence / tweet.
new_clf.predict(second_sentence)
second_sentence.labels
According to our model, there is 67% chance that the second tweet is a disaster tweet which is quite accurate.
# Creating the third sentence/ tweet
third_sentence=Sentence('Tsunamis swept through rice fields and flooded the towns.')
# Making our model predict the third sentence / tweet.
new_clf.predict(third_sentence)
third_sentence.labels
According to our model, there is 93% chance that the third tweet is a disaster tweet.
The prototyping of the above model was one using the streamlit library
The code and the output is shown below .
NB. The code cannot Run on google colab because it doesnt support some streamlit dependencies.
The code should run well when written on a python file and run in an environment with streamlit and flair libraries
import datetime as dt
import re
import pandas as pd
import streamlit as st
from flair.data import Sentence
from flair.models import TextClassifier
from twitterscraper import query_tweets
# Preprocess function
allowed_chars = ' AaBbCcDdEeFfGgHhIiJjKkLlMmNnOoPpQqRrSsTtUuVvWwXxYyZz0123456789~`!@#$%^&*()-=_+[]{}|;:",./<>?'
punct = '!?,.@#'
maxlen = 280
@st.cache
def preprocess(text):
# Delete URLs, cut to maximum length, space out punctuation with spaces, and remove un allowed chars
return ''.join(
[' ' + char + ' ' if char in punct else char for char in [char for char in re.sub(r'http\S+', 'http', text,
flags=re.MULTILINE) if
char in allowed_chars]])
def main():
st.title('Fabulous Tokenizers Disaster Tweet Monitor')
# Load bert model once and cache it
with st.spinner("Loading deep learning model..."):
@st.cache(allow_output_mutation=True, show_spinner=False)
def load_model(_):
model = TextClassifier.load(r"C:\Users\ADMIN\Desktop\Fabulous Tokenizers Project\best-model.pt")
return model
model = load_model(10)
st.write("Are you wondering if A tweet is Real Disaster Tweet Or Not ??? ")
app_mode = st.selectbox("CLASSIFY SINGLE TWEET or #HASH TAGS",
["", "Single Tweet ", "# Tag",])
if app_mode == "Single Tweet ":
st.subheader('Single tweet classification')
tweet_input = st.text_input('Tweet:')
if tweet_input != '':
# Pre-process tweet
sentence = Sentence(preprocess(tweet_input))
# Make predictions
with st.spinner('Predicting...'):
model.predict(sentence)
st.write('Prediction:')
st.write(sentence.labels[0].value + ' with ',
sentence.labels[0].score * 100, '% confidence')
elif app_mode =="# Tag":
# TWEET SEARCH AND CLASSIFY
st.subheader('Search Twitter for Query')
# Get user input
query = st.text_input('Query:', '#')
# As long as the query is valid (not empty or equal to '#')...
if query != '' and query != '#':
with st.spinner(f'Searching for and analyzing {query}...'):
# Get English tweets from the past 4 weeks
tweets = query_tweets(query, begindate=dt.date.today() - dt.timedelta(weeks=16), lang='en')
# Initialize empty dataframe
tweet_data = pd.DataFrame({
'tweet': [],
'predicted-label': []
})
# Keep track of positive vs. negative tweets
pos_vs_neg = {'0': 0, '1': 0}
# Add data for each tweet
for tweet in tweets:
# Skip iteration if tweet is empty
if tweet.text in ('', ' '):
continue
# Make predictions
sentence = Sentence(preprocess(tweet.text))
model.predict(sentence)
sentiment = sentence.labels[0]
# Keep track of positive vs. negative tweets
disaster_vs_non_disaster[sentiment] += 1
# Append new data
tweet_data = tweet_data.append({'tweet': tweet.text, 'predicted-sentiment': sentiment,},
ignore_index=True)
st.dataframe(tweet_data)
# Show query data and sentiment if available
try:
st.write(tweet_data)
try:
st.write('Positive to negative tweet ratio:', pos_vs_neg['0'] / pos_vs_neg['1'])
except ZeroDivisionError: # if no negative tweets
st.write('All Non Disaster Tweets')
except NameError: # if no queries have been made yet
pass
if __name__ == "__main__":
main()
Requirements files
from IPython.display import Image
Image('/content/1.PNG')
Image('/content/2.PNG')