1.) Defining our question

Many institutions such as disaster-relief organizations and news agencies have been interested in using twitter to get information on emergencies. However, some of the tweets on twitter are real disaster tweets while some are not. Due to the increased use of twitter as a platform for announcing disasters, we have been tasked to build a machine learning model that determines whether or not a tweet is a real disaster tweet or not.

a) Specifying the Question

Predict which tweets are about real disasters and which ones are not.

b) Defining the Metric for Success

Our study will be successful if we are able to: Build models that classify tweets that are disasters and those that are not. Build models with an accuracy of 80%

c) Understanding the context

Twitter has become an important communication channel in times of emergency.The use of smart phones enables people to announce an emergency they’re observing in real-time and as a result, more agencies are interested in programmatically monitoring Twitter (i.e. disaster relief organizations and news agencies.)

d) Recording the Experimental Design

1) Business Understanding: Understanding the business problem. 2) Reading the data: Getting access to our train, test and sample submission data and reading it on python pandas.

3) Checking our data: Understanding our variables, the number of rows and columns per dataset as well as unique values in the data.

4) Data cleaning: Checking for any missing values, duplicates and solving them.

5) Text Preprocessing: removing noise from our text data,converting all our data to lowercase, removing stop-words and word tokenization .

6)EDA: visualizing our data.

7) Implementing the solution (Modelling): using classification algorithms like SVM, Naive Bayes and Bert Tokenizer to make predictions.

8) Conclusion: concluding on the best model for our predictions.

2.) Understanding our data.

Reading our data

# Loading libraries
import numpy as np
import pandas as pd
# Reading the data
train= pd.read_csv("/content/train (7).csv")
test= pd.read_csv("/content/test (5).csv")
submit= pd.read_csv("/content/sample_submission.csv")
# Loading the head
print("Train")
print(train.head())
print("")
print("Test")
print(test.head())
print("")
print("Sample Submission")
print (submit.head())
Train
   id keyword  ...                                               text target
0   1     NaN  ...  Our Deeds are the Reason of this #earthquake M...      1
1   4     NaN  ...             Forest fire near La Ronge Sask. Canada      1
2   5     NaN  ...  All residents asked to 'shelter in place' are ...      1
3   6     NaN  ...  13,000 people receive #wildfires evacuation or...      1
4   7     NaN  ...  Just got sent this photo from Ruby #Alaska as ...      1

[5 rows x 5 columns]

Test
   id keyword location                                               text
0   0     NaN      NaN                 Just happened a terrible car crash
1   2     NaN      NaN  Heard about #earthquake is different cities, s...
2   3     NaN      NaN  there is a forest fire at spot pond, geese are...
3   9     NaN      NaN           Apocalypse lighting. #Spokane #wildfires
4  11     NaN      NaN      Typhoon Soudelor kills 28 in China and Taiwan

Sample Submission
   id  target
0   0       0
1   2       0
2   3       0
3   9       0
4  11       0
# dtypes
print(train.dtypes)
print("")
print(test.dtypes)
id           int64
keyword     object
location    object
text        object
target       int64
dtype: object

id           int64
keyword     object
location    object
text        object
dtype: object
# Shape
print("Train number of rows and columns are : ", train.shape)
print("Test number of rows and columns are : ", test.shape)
print("Sample submission number of rows and columns are : ", submit.shape)
Train number of rows and columns are :  (7613, 5)
Test number of rows and columns are :  (3263, 4)
Sample submission number of rows and columns are :  (3263, 2)
# Lets see what non diasaster tweet looks like
non_disaster = train[train['target']==0]['text']
non_disaster.values[10]
"No way...I can't eat that shit"
# Lets see what diasaster tweet looks like
disaster_t = train[train['target']==1]['text']
disaster_t.values
array(['Our Deeds are the Reason of this #earthquake May ALLAH Forgive us all',
       'Forest fire near La Ronge Sask. Canada',
       "All residents asked to 'shelter in place' are being notified by officers. No other evacuation or shelter in place orders are expected",
       ...,
       'M1.94 [01:04 UTC]?5km S of Volcano Hawaii. http://t.co/zDtoyd8EbJ',
       'Police investigating after an e-bike collided with a car in Little Portugal. E-bike rider suffered serious non-life threatening injuries.',
       'The Latest: More Homes Razed by Northern California Wildfire - ABC News http://t.co/YmY4rSkQ3d'],
      dtype=object)

Data Cleaning

Missing values

# Null values
print("")
print("Train missing per column")
print(train.isnull().sum())
print("")
print("Test missing per column")
print(test.isnull().sum())
print("")
print("Sample submission missing per column")
print(submit.isnull().sum())
Train missing per column
id             0
keyword       61
location    2533
text           0
target         0
dtype: int64

Test missing per column
id             0
keyword       26
location    1105
text           0
dtype: int64

Sample submission missing per column
id        0
target    0
dtype: int64

Missing variables per column train, test and submit

# Dropping null values
train.dropna(inplace=True)
test.dropna(inplace=True)
# Confirming the number of missing values for both train and test
print("Number of missing values in train",train.isnull().sum().sum())
print("Number of missing values in test",test.isnull().sum().sum())
Number of missing values in train 0
Number of missing values in test 0

Duplicates

# Duplicates
print("Train duplicated? ", train.duplicated().any())

print("")
print("Test duplicated? ", test.duplicated().any())

print("")
print("Sample submission duplicated? ",submit.duplicated().any())
Train duplicated?  False

Test duplicated?  False

Sample submission duplicated?  False

No duplicates.

Checking for consistency and uniformity

# Checking for unique values per column
for column in train.columns:
  print('\n')
  print(train[column].nunique())
  print(train[column].unique())

5080
[   48    49    50 ... 10831 10832 10833]


221
['ablaze' 'accident' 'aftershock' 'airplane%20accident' 'ambulance'
 'annihilated' 'annihilation' 'apocalypse' 'armageddon' 'army' 'arson'
 'arsonist' 'attack' 'attacked' 'avalanche' 'battle' 'bioterror'
 'bioterrorism' 'blaze' 'blazing' 'bleeding' 'blew%20up' 'blight'
 'blizzard' 'blood' 'bloody' 'blown%20up' 'body%20bag' 'body%20bagging'
 'body%20bags' 'bomb' 'bombed' 'bombing' 'bridge%20collapse'
 'buildings%20burning' 'buildings%20on%20fire' 'burned' 'burning'
 'burning%20buildings' 'bush%20fires' 'casualties' 'casualty'
 'catastrophe' 'catastrophic' 'chemical%20emergency' 'cliff%20fall'
 'collapse' 'collapsed' 'collide' 'collided' 'collision' 'crash' 'crashed'
 'crush' 'crushed' 'curfew' 'cyclone' 'damage' 'danger' 'dead' 'death'
 'deaths' 'debris' 'deluge' 'deluged' 'demolish' 'demolished' 'demolition'
 'derail' 'derailed' 'derailment' 'desolate' 'desolation' 'destroy'
 'destroyed' 'destruction' 'detonate' 'detonation' 'devastated'
 'devastation' 'disaster' 'displaced' 'drought' 'drown' 'drowned'
 'drowning' 'dust%20storm' 'earthquake' 'electrocute' 'electrocuted'
 'emergency' 'emergency%20plan' 'emergency%20services' 'engulfed'
 'epicentre' 'evacuate' 'evacuated' 'evacuation' 'explode' 'exploded'
 'explosion' 'eyewitness' 'famine' 'fatal' 'fatalities' 'fatality' 'fear'
 'fire' 'fire%20truck' 'first%20responders' 'flames' 'flattened' 'flood'
 'flooding' 'floods' 'forest%20fire' 'forest%20fires' 'hail' 'hailstorm'
 'harm' 'hazard' 'hazardous' 'heat%20wave' 'hellfire' 'hijack' 'hijacker'
 'hijacking' 'hostage' 'hostages' 'hurricane' 'injured' 'injuries'
 'injury' 'inundated' 'inundation' 'landslide' 'lava' 'lightning'
 'loud%20bang' 'mass%20murder' 'mass%20murderer' 'massacre' 'mayhem'
 'meltdown' 'military' 'mudslide' 'natural%20disaster'
 'nuclear%20disaster' 'nuclear%20reactor' 'obliterate' 'obliterated'
 'obliteration' 'oil%20spill' 'outbreak' 'pandemonium' 'panic' 'panicking'
 'police' 'quarantine' 'quarantined' 'radiation%20emergency' 'rainstorm'
 'razed' 'refugees' 'rescue' 'rescued' 'rescuers' 'riot' 'rioting'
 'rubble' 'ruin' 'sandstorm' 'screamed' 'screaming' 'screams' 'seismic'
 'sinkhole' 'sinking' 'siren' 'sirens' 'smoke' 'snowstorm' 'storm'
 'stretcher' 'structural%20failure' 'suicide%20bomb' 'suicide%20bomber'
 'suicide%20bombing' 'sunk' 'survive' 'survived' 'survivors' 'terrorism'
 'terrorist' 'threat' 'thunder' 'thunderstorm' 'tornado' 'tragedy'
 'trapped' 'trauma' 'traumatised' 'trouble' 'tsunami' 'twister' 'typhoon'
 'upheaval' 'violent%20storm' 'volcano' 'war%20zone' 'weapon' 'weapons'
 'whirlwind' 'wild%20fires' 'wildfire' 'windstorm' 'wounded' 'wounds'
 'wreck' 'wreckage' 'wrecked']


3341
['Birmingham' 'Est. September 2012 - Bristol' 'AFRICA' ...
 'Vancouver, Canada' 'London ' 'Lincoln']


5028
['@bbcmtd Wholesale Markets ablaze http://t.co/lHYXEOHY6C'
 'We always try to bring the heavy. #metal #RT http://t.co/YAo1e0xngw'
 '#AFRICANBAZE: Breaking news:Nigeria flag set ablaze in Aba. http://t.co/2nndBGwyEi'
 ...
 "Three days off from work and they've pretty much all been wrecked hahaha shoutout to my family for that one"
 "#FX #forex #trading Cramer: Iger's 3 words that wrecked Disney's stock http://t.co/7enNulLKzM"
 '@engineshed Great atmosphere at the British Lion gig tonight. Hearing is wrecked. http://t.co/oMNBAtJEAO']


2
[1 0]

Our target has two unique values.

# Editing the location entries that mean exactly the same thing
train['location'] = train['location'].replace(['United States'],'USA')
train['location'] = train['location'].replace(['United Kingdom'],'UK')
train['location'] = train['location'].replace(['NYC'],'New York')
train['location'] = train['location'].replace(['New York, NY'],'New York')
train['location'] = train['location'].replace(['Washington, D.C'],'Washington, DC')
train['location'] = train['location'].replace(['Los Angeles, CA'],'Los Angeles')
train['location'] = train['location'].replace(['Chicago, IL'],'Chicago')
train['location'] = train['location'].replace(['San Fransisco, CA'],'San Fransisco')

Checking for anomalities

# checking for tweets that have been labelled as both disaster and not disaster
df_mislabeled = train.groupby(['text']).nunique().sort_values(by='target', ascending=False)
df_mislabeled = df_mislabeled[df_mislabeled['target'] > 1]['target']
df_mislabeled.index.tolist()
['.POTUS #StrategicPatience is a strategy for #Genocide; refugees; IDP Internally displaced people; horror; etc. https://t.co/rqWuoy1fm4',
 'I Pledge Allegiance To The P.O.P.E. And The Burning Buildings of Epic City. ??????',
 'like for the music video I want some real action shit like burning buildings and police chases not some weak ben winston shit',
 'RT NotExplained: The only known image of infamous hijacker D.B. Cooper. http://t.co/JlzK2HdeTG',
 'In #islam saving a person is equal in reward to saving all humans! Islam is the opposite of terrorism!',
 "Mmmmmm I'm burning.... I'm burning buildings I'm building.... Oooooohhhh oooh ooh...",
 '#Allah describes piling up #wealth thinking it would last #forever as the description of the people of #Hellfire in Surah Humaza. #Reflect',
 '#foodscare #offers2go #NestleIndia slips into loss after #Magginoodle #ban unsafe and hazardous for #humanconsumption']
# Removing texts that are both disaster and not disaster shown above
train= train[~((train.text=='.POTUS #StrategicPatience is a strategy for #Genocide; refugees; IDP Internally displaced people; horror; etc. https://t.co/rqWuoy1fm4')|(train.text=='I Pledge Allegiance To The P.O.P.E. And The Burning Buildings of Epic City. ??????')|(train.text=='like for the music video I want some real action shit like burning buildings and police chases not some weak ben winston shit')|(train.text=='RT NotExplained: The only known image of infamous hijacker D.B. Cooper. http://t.co/JlzK2HdeTG')|(train.text=='In #islam saving a person is equal in reward to saving all humans! Islam is the opposite of terrorism!')|(train.text=="Mmmmmm I'm burning.... I'm burning buildings I'm building.... Oooooohhhh oooh ooh...")|(train.text== '#Allah describes piling up #wealth thinking it would last #forever as the description of the people of #Hellfire in Surah Humaza. #Reflect')|(train.text=='#foodscare #offers2go #NestleIndia slips into loss after #Magginoodle #ban unsafe and hazardous for #humanconsumption')
)]
# Confirming that the double labled data has been removed
df_2labeled = train.groupby(['text']).nunique().sort_values(by='target', ascending=False)
df_2labeled = df_2labeled[df_2labeled['target'] > 1]['target']
df_2labeled.index.tolist()
[]

As shown above, the tweets labelled as both disaster and not disaster have been removed.

Text pre-processing

#Code to display all details in the columns
pd.set_option('display.max_columns', None)  
pd.set_option('display.expand_frame_repr', False)
pd.set_option('max_colwidth', -1)
/usr/local/lib/python3.6/dist-packages/ipykernel_launcher.py:4: FutureWarning: Passing a negative integer is deprecated in version 1.0 and will not be supported in future version. Instead, use None to not limit the column width.
  after removing the cwd from sys.path.
# Converting emojis and emoticons to words
!pip install emot
from emot.emo_unicode import UNICODE_EMO, EMOTICONS
# Converting emojis to words
import re
def convert_emojis(text):
    for emot in UNICODE_EMO:
        text = text.replace(emot, "_".join(UNICODE_EMO[emot].replace(",","").replace(":","").split()))
        return text
# Converting emoticons to words    
def convert_emoticons(text):
    for emot in EMOTICONS:
        text = re.sub(u'('+emot+')', "_".join(EMOTICONS[emot].replace(",","").split()), text)
        return text
# Passing both functions to 'text_rare'
train['text'] = train['text'].apply(convert_emoticons)
train['text'] = train['text'].apply(convert_emojis)
Requirement already satisfied: emot in /usr/local/lib/python3.6/dist-packages (2.1)
import nltk
from nltk.tokenize import RegexpTokenizer # For tokenization
from nltk.stem import WordNetLemmatizer,PorterStemmer # For lemmatization
from nltk.corpus import stopwords# To remove stop words
nltk.download('stopwords')
nltk.download('wordnet')
import re
lemmatizer = WordNetLemmatizer()
stemmer = PorterStemmer() 

def preprocess(sentence):
    sentence=str(sentence)
    sentence = sentence.lower()
    sentence=sentence.replace('{html}',"") # removing html files
    cleanr = re.compile('<.*?>') # Removing punctuation
    cleantext = re.sub(cleanr, '', sentence)
    rem_url=re.sub(r'http\S+', '',cleantext) # Removing links
    rem_num = re.sub('[0-9]+', '', rem_url)  # Removing numbers
    tokenizer = RegexpTokenizer(r'\w+') # Tokenization
    tokens = tokenizer.tokenize(rem_num)  
    filtered_words = [w for w in tokens if len(w) > 2 if not w in stopwords.words('english')]
    stem_words=[stemmer.stem(w) for w in filtered_words]
    lemma_words=[lemmatizer.lemmatize(w) for w in stem_words]
    return " ".join(filtered_words)

# Mapping our loop to our datasets
test['cleanText']=test['text'].map(lambda s:preprocess(s))
train['cleanText']=train['text'].map(lambda s:preprocess(s))
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
# Spelling correction
!pip install textblob
# Spell check using text blob for the first 5 records
from textblob import TextBlob
train['cleanText'][:5].apply(lambda x: str(TextBlob(x).correct()))
Requirement already satisfied: textblob in /usr/local/lib/python3.6/dist-packages (0.15.3)
Requirement already satisfied: nltk>=3.1 in /usr/local/lib/python3.6/dist-packages (from textblob) (3.2.5)
Requirement already satisfied: six in /usr/local/lib/python3.6/dist-packages (from nltk>=3.1->textblob) (1.15.0)
31    bbcmtd wholesale markets ablaze                     
32    always try bring heavy metal                        
33    africanbaze breaking news siberia flag set ablaze ba
34    crying set ablaze                                   
35    plus side look sky last night ablaze                
Name: cleanText, dtype: object
# Common word removal
# Checking the first 10 most frequent words
from collections import Counter
cnt = Counter()
for text in train["cleanText"].values:
    for word in text.split():
        cnt[word] += 1
        
cnt.most_common(10)
[('like', 232),
 ('amp', 216),
 ('new', 178),
 ('fire', 175),
 ('via', 159),
 ('get', 155),
 ('people', 130),
 ('one', 125),
 ('news', 123),
 ('emergency', 120)]

The most common words have been removed

# Removing the frequent words
freq = set([w for (w, wc) in cnt.most_common(10)])
# function to remove the frequent words
def freqwords(text):
    return " ".join([word for word in str(text).split() if word not 
in freq])
# Passing the function freqwords
train["cleanText"] = train["cleanText"].apply(freqwords)
train["cleanText"].head()
31    bbcmtd wholesale markets ablaze                 
32    always try bring heavy metal                    
33    africanbaze breaking nigeria flag set ablaze aba
34    crying set ablaze                               
35    plus side look sky last night ablaze            
Name: cleanText, dtype: object

3.) EDA

#collapse-hide
import matplotlib.pyplot as plt
import seaborn as sns
#
fig, axes = plt.subplots(ncols=2, figsize=(17, 4), dpi=100)
plt.tight_layout()

train.groupby('target').count()['id'].plot(kind='pie', ax=axes[0], labels=['Not Disaster (57%)', 'Disaster (43%)'])
sns.countplot(x=train['target'], hue=train['target'], ax=axes[1])

axes[0].set_ylabel('')
axes[1].set_ylabel('')
axes[1].set_xticklabels(['Not Disaster (4342)', 'Disaster (3271)'])
axes[0].tick_params(axis='x', labelsize=15)
axes[0].tick_params(axis='y', labelsize=15)
axes[1].tick_params(axis='x', labelsize=15)
axes[1].tick_params(axis='y', labelsize=15)

axes[0].set_title('Target Distribution in Training Set', fontsize=13)
axes[1].set_title('Target Count in Training Set', fontsize=13)
plt.show()

#collapse-hide
import seaborn as sns
plt.figure(figsize=(5,6))
sns.countplot(y=train.location, order = train.location.value_counts().iloc[:25].index)
plt.title('Top 25 location from the tweets')
plt.show()

The highest number of records are those of USA

#collapse-hide
plt.figure(figsize=(5,6))
sns.countplot(y=train.keyword, order = train.keyword.value_counts().iloc[:25].index)
plt.title('Frequent 25 keywords from the tweets')
plt.show()

The most common key words are collision, wirlwind and fatalities.

def length(text):    
    'a function which shows the length of text'
    return len(text)
#    
train['length'] = train['text'].apply(length)
# #collapse-hide
plt.rcParams['figure.figsize'] = (18.0, 6.0)
bins = 150
plt.hist(train[train['target'] == 0]['length'], alpha = 0.6, bins=bins, label='Not')
plt.hist(train[train['target'] == 1]['length'], alpha = 0.8, bins=bins, label='Real')
plt.xlabel('length')
plt.ylabel('numbers')
plt.legend(loc='upper right')
#plt.xlim(0,150)

plt.show()
  • Both of our targets are skewed to the left.
# hide_output
train['target_mean'] = train.groupby('keyword')['target'].transform('mean')

fig = plt.figure(figsize=(8, 72), dpi=100)

sns.countplot(y=train.sort_values(by='target_mean', ascending=False)['keyword'],
              hue=train.sort_values(by='target_mean', ascending=False)['target'])

plt.tick_params(axis='x', labelsize=15)
plt.tick_params(axis='y', labelsize=12)
plt.legend(loc=1)
plt.title('Target Distribution in Keywords')

plt.show()

train.drop(columns=['target_mean'], inplace=True)

Pandas Profiling

 # hide_output
 !pip install pandas_profiling
 !pip3 install pandas_profiling --upgrade
# Creating a pandas profiling report for the train dataset
from pandas_profiling import ProfileReport
profile = ProfileReport(train,html={'style':{'full_width':True}})
profile 



  • From our dataset, the most repeated keywords are fatalities, deluge, armageddon, damage and harm.
  • The highest recorded location is USA followed by UK and Canada.
  • There is no correlation in our data.
# Getting the statistical distribution of our text data
lens = train.cleanText.str.split().apply(lambda x: len(x))
print(lens.describe())
lens.hist()
count    5061.000000
mean     8.543173   
std      3.287412   
min      0.000000   
25%      6.000000   
50%      9.000000   
75%      11.000000  
max      20.000000  
Name: cleanText, dtype: float64
<matplotlib.axes._subplots.AxesSubplot at 0x7f4c2ea23f28>

Our texts has an average of 9 words per tweet.

# The most used words in our text are
# In this step, I find the most frequent words in the data, extracting information about its content and topics.
from sklearn.feature_extraction.text import CountVectorizer
from nltk.corpus import stopwords
stops =  set(stopwords.words('english')+['com'])
co = CountVectorizer(stop_words=stops)
counts = co.fit_transform(train.cleanText)
w=pd.DataFrame(counts.sum(axis=0),columns=co.get_feature_names()).T.sort_values(0,ascending=False).head(20)
w.head()
0
police 105
video 102
disaster 99
man 90
still 90
# Library
from collections import defaultdict
from wordcloud import STOPWORDS
from PIL import Image
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator
#
DISASTER_TWEETS = train['target'] == 1
def generate_ngrams(text, n_gram=1):
    token = [token for token in text.lower().split(' ') if token != '' if token not in STOPWORDS]
    ngrams = zip(*[token[i:] for i in range(n_gram)])
    return [' '.join(ngram) for ngram in ngrams]

N = 5

# Unigrams
disaster_unigrams = defaultdict(int)
nondisaster_unigrams = defaultdict(int)

for tweet in train[DISASTER_TWEETS]['cleanText']:
    for word in generate_ngrams(tweet):
        disaster_unigrams[word] += 1
        
for tweet in train[~DISASTER_TWEETS]['cleanText']:
    for word in generate_ngrams(tweet):
        nondisaster_unigrams[word] += 1
        
df_disaster_unigrams = pd.DataFrame(sorted(disaster_unigrams.items(), key=lambda x: x[1])[::-1])
df_nondisaster_unigrams = pd.DataFrame(sorted(nondisaster_unigrams.items(), key=lambda x: x[1])[::-1])

# Bigrams
disaster_bigrams = defaultdict(int)
nondisaster_bigrams = defaultdict(int)

for tweet in train[DISASTER_TWEETS]['cleanText']:
    for word in generate_ngrams(tweet, n_gram=2):
        disaster_bigrams[word] += 1
        
for tweet in train[~DISASTER_TWEETS]['cleanText']:
    for word in generate_ngrams(tweet, n_gram=2):
        nondisaster_bigrams[word] += 1
        
df_disaster_bigrams = pd.DataFrame(sorted(disaster_bigrams.items(), key=lambda x: x[1])[::-1])
df_nondisaster_bigrams = pd.DataFrame(sorted(nondisaster_bigrams.items(), key=lambda x: x[1])[::-1])

# Trigrams
disaster_trigrams = defaultdict(int)
nondisaster_trigrams = defaultdict(int)

for tweet in train[DISASTER_TWEETS]['cleanText']:
    for word in generate_ngrams(tweet, n_gram=3):
        disaster_trigrams[word] += 1
        
for tweet in train[~DISASTER_TWEETS]['cleanText']:
    for word in generate_ngrams(tweet, n_gram=3):
        nondisaster_trigrams[word] += 1
        
df_disaster_trigrams = pd.DataFrame(sorted(disaster_trigrams.items(), key=lambda x: x[1])[::-1])
df_nondisaster_trigrams = pd.DataFrame(sorted(nondisaster_trigrams.items(), key=lambda x: x[1])[::-1])
# #collapse-hide
fig, axes = plt.subplots(ncols=2, figsize=(15, 5), dpi=100)
plt.tight_layout()

sns.barplot(y=df_disaster_unigrams[0].values[:N], x=df_disaster_unigrams[1].values[:N], ax=axes[0], color='orange')
sns.barplot(y=df_nondisaster_unigrams[0].values[:N], x=df_nondisaster_unigrams[1].values[:N], ax=axes[1], color='cyan')

for i in range(2):
    axes[i].spines['right'].set_visible(False)
    axes[i].set_xlabel('')
    axes[i].set_ylabel('')
    axes[i].tick_params(axis='x', labelsize=13)
    axes[i].tick_params(axis='y', labelsize=13)

axes[0].set_title(f'Top {N} most common unigrams in Disaster Tweets', fontsize=15)
axes[1].set_title(f'Top {N} most common unigrams in Non-disaster Tweets', fontsize=15)

plt.show()

The most used words are like, amp, new and fire.

# check for frequent bi-gram words
co = CountVectorizer(ngram_range=(2,2),stop_words=stops)
counts = co.fit_transform(train.cleanText)
pd.DataFrame(counts.sum(axis=0),columns=co.get_feature_names()).T.sort_values(0,ascending=False).head(10)
0
burning buildings 38
suicide bomber 29
youtube video 27
oil spill 26
full read 26
liked youtube 26
prebreak best 25
mass murder 24
cross body 24
loud bang 23

 #collapse-hide
fig, axes = plt.subplots(ncols=2, figsize=(15, 5), dpi=100)
plt.tight_layout()

sns.barplot(y=df_disaster_bigrams[0].values[:N], x=df_disaster_bigrams[1].values[:N], ax=axes[0], color='orange')
sns.barplot(y=df_nondisaster_bigrams[0].values[:N], x=df_nondisaster_bigrams[1].values[:N], ax=axes[1], color='cyan')

for i in range(2):
    axes[i].spines['right'].set_visible(False)
    axes[i].set_xlabel('')
    axes[i].set_ylabel('')
    axes[i].tick_params(axis='x', labelsize=13)
    axes[i].tick_params(axis='y', labelsize=13)

axes[0].set_title(f'Top {N} most common bigrams in Disaster Tweets', fontsize=15)
axes[1].set_title(f'Top {N} most common bigrams in Non-disaster Tweets', fontsize=15)

plt.show()

Some of the most used bi-gram words are burning buildings (44), suicide bomber (29), looks like (28) and youtube video (27).

# Showing trigrams in each target
fig, axes = plt.subplots(ncols=2, figsize=(15, 5), dpi=100)

sns.barplot(y=df_disaster_trigrams[0].values[:N], x=df_disaster_trigrams[1].values[:N], ax=axes[0], color='orange')
sns.barplot(y=df_nondisaster_trigrams[0].values[:N], x=df_nondisaster_trigrams[1].values[:N], ax=axes[1], color='cyan')

for i in range(2):
    axes[i].spines['right'].set_visible(False)
    axes[i].set_xlabel('')
    axes[i].set_ylabel('')
    axes[i].tick_params(axis='x', labelsize=13)
    axes[i].tick_params(axis='y', labelsize=11)

axes[0].set_title(f'Top {N} most common trigrams in Disaster Tweets', fontsize=15)
axes[1].set_title(f'Top {N} most common trigrams in Non-disaster Tweets', fontsize=15)

plt.show()
# To show distribution of stop words in our original data
import nltk
nltk.download('stopwords')
stop=set(stopwords.words('english'))
#
corpus=[]
new= train['text'].str.split()
new=new.values.tolist()
corpus=[word for i in new for word in i]

from collections import defaultdict
dic=defaultdict(int)
for word in corpus:
    if word in stop:
        dic[word]+=1
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!

#collapse-hide
top=sorted(dic.items(), key=lambda x:x[1],reverse=True)[:10] 
x,y=zip(*top)
plt.bar(x,y)
<BarContainer object of 10 artists>

The highest recorded stop word is 'the' followed by 'a'.

Wordclouds


WordCloud is a technique to show which words are the most frequent among the given text

# Creating a word cloud
?WordCloud
#Joining all the tweets into one giant sentence
text = " ".join(review for review in train.cleanText)
print ("There are {} words in the combination of all review.".format(len(text)))
There are 316328 words in the combination of all review.
#All words in the dataset wordcloud
# Create stopword list:
stopwords = set(STOPWORDS)

# Generate a word cloud image
wordcloud = WordCloud(stopwords=stopwords, background_color="grey").generate(text)

# Display the generated image:
# the matplotlib way:
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.show()
#creating a disaster tweet dataframe
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator
disaster=train[train['target']== 1]
#Joining all the disaster_tweets into one giant sentence
disaster_text = " ".join(review for review in disaster.cleanText)
#creating a disaster tweet dataframe
non_disaster=train[train['target']== 0]
#Joining all the disaster_tweets into one giant sentence
non_disaster_text = " ".join(review for review in non_disaster.cleanText)


# Generate a word cloud image
wordcloud_disaster = WordCloud(stopwords=stopwords, background_color="black").generate(disaster_text)
# Generate a word cloud image
wordcloud_non_disaster = WordCloud(stopwords=stopwords, background_color="white").generate(non_disaster_text)
# Getting the words that are in the disaster tweets
thunder_mask =np.array(Image.open("black_thunderstorm.jpg"))
thunder_mask
array([[255, 255, 255, ..., 221, 221, 221],
       [255, 255, 255, ..., 221, 221, 221],
       [255, 255, 255, ..., 221, 221, 221],
       ...,
       [221, 221, 221, ..., 255, 255, 255],
       [221, 221, 221, ..., 255, 255, 255],
       [221, 221, 221, ..., 255, 255, 255]], dtype=uint8)
# Create a word cloud image
wc = WordCloud(background_color="white", max_words=1000, mask=thunder_mask,
               stopwords=stopwords, contour_width=3, contour_color='grey')

# Generate a wordcloud
wc.generate(disaster_text)

# store to file
wc.to_file("disaster.png")

# show
plt.figure(figsize=[20,10])
plt.imshow(wc, interpolation='bilinear')
plt.axis("off")
plt.show()
# hide_output
earth_mask =np.array(Image.open("earth.jpg"))
earth_mask

#collapse-hide
wc = WordCloud(background_color="white", max_words=1000, mask=earth_mask,
               stopwords=stopwords, contour_width=3, contour_color='green')

# Generate a wordcloud
wc.generate(non_disaster_text)

# store to file
wc.to_file("disaster.png")

# show
plt.figure(figsize=[20,10])
plt.imshow(wc, interpolation='bilinear')

plt.axis("off")
plt.show()

4.) Modelling

Naive Bayes

Assumptions

  • Features in the dataset are mutually independent. Occurrence of one feature does not affect the probability of occurrence of the other feature. This is relevant as each token will be independent of each other.
  • The Bag of Words assumption which assumes that the position of the words in the document doesn’t matter.
  • Naive Bayes is relatively robust, easy to implement, fast, and accurate, it is used in many different fields like Spam filtering in emails and text classification.

Multinomial Naïve Bayes uses term frequency i.e. the number of times a given term appears in a document. After normalization, term frequency can be used to compute maximum likelihood estimates based on the training data to estimate the conditional probability.

# Making predictions on our test data using Naive Bayes
from sklearn.feature_extraction.text import CountVectorizer # Transformer which we will use on our model
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline # It encapsulates transformers and predictors inside
# 
unigrams_pipeline = Pipeline([('cleanText', CountVectorizer()),('target', MultinomialNB())])
unigrams_pipeline.fit(train.cleanText, train.target)

predictions = pd.DataFrame(
    unigrams_pipeline.predict(test.cleanText)
#  
                           )
predictions['id'] = test['id']
# Classification report on the train prediction
pred = unigrams_pipeline.predict(train.cleanText)
# Printing the classification report
print(classification_report(train.target,pred))
              precision    recall  f1-score   support

           0       0.92      0.99      0.95      4333
           1       0.98      0.89      0.93      3260

    accuracy                           0.95      7593
   macro avg       0.95      0.94      0.94      7593
weighted avg       0.95      0.95      0.95      7593

Our Naive Bayes model got a 95% accuracy score on the train data trained on it.

# Printing out predictions on the test dataset
predictions.columns = ['target','id']
predictions.head()
target id
0 1 0
1 1 2
2 1 3
3 1 9
4 1 11
# re-arraging the columns to match the submission file
cols = predictions.columns.tolist()
cols = cols[-1:] + cols[:-1]
predictions = predictions[cols] 
# Converting our submission to CSV
predictions.to_csv("submissionsnd.csv", index=False)
predictions.head()
id target
0 0 1
1 2 1
2 3 1
3 9 1
4 11 1
  • Our model got a score of 79.252 on Kaggle

SVM (Support Vector Machines)

SVM is an algorithm that determines the best decision boundary between vectors that belong to a given group (or category) and vectors that do not belong to it. In order to use SVM on our model, we will have to convert our text to vectors (countvectorizer).

Assumptions

  • The margin are as large as possible.
  • The support vectors are the most useful data points because they are the ones most likely to be incorrectly classified.
# Making predictions using linear SVM
from sklearn.feature_extraction.text import CountVectorizer
from sklearn import model_selection, svm
from sklearn.pipeline import Pipeline # which encapsulates transformers and predictors inside
#
unigrams_pipelines = Pipeline([('cleanText', CountVectorizer()),('target', svm.SVC(C=1.0, kernel='linear', degree=3, gamma='auto'))])
unigrams_pipelines.fit(train.cleanText, train.target)

predictsvm = pd.DataFrame(
    unigrams_pipeline.predict(test.cleanText)

                           )
#
predictsvm['id'] = test['id']
# Classification report on the train prediction
predi = unigrams_pipelines.predict(train.cleanText)
# Printing the classification report
print(classification_report(train.target,predi))
              precision    recall  f1-score   support

           0       0.97      0.99      0.98      4342
           1       0.99      0.96      0.97      3271

    accuracy                           0.98      7613
   macro avg       0.98      0.98      0.98      7613
weighted avg       0.98      0.98      0.98      7613

Our SVM model got a 95% accuracy score on the train data trained on it.

predictsvm.columns = ['target','id']
predictsvm.head()
target id
0 1 0
1 1 2
2 1 3
3 0 9
4 1 11
cols = predictsvm.columns.tolist()
cols = cols[-1:] + cols[:-1]

predictsvm = predictsvm[cols] 
predictsvm.tail()
id target
3258 10861 1
3259 10865 1
3260 10868 1
3261 10874 1
3262 10875 0
# Converting our predictions to svm
predictsvm.to_csv("submissionsvm2.csv", index=False)

Our SVM scored 79.773 on kaggle

BERT (Bidirectional Encoder Representations from Transformers)

  • A transformer model that is pre-trained on a large corpus of unlabelled text including the entire Wikipedia (that’s 2,500 million words!) and Book Corpus (800 million words) making it have a deeper understanding of how language works.
  • BERT is also a “deep bidirectional” model. Bidirectional means that BERT learns information from both the left and the right side of a token’s context during the training phase enabling it to understand the meaning of a language.
# Importing libraries
import tensorflow as tf
import pandas as pd
from tensorflow.keras.layers import Dense, Input
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.models import Model
from tensorflow.keras.callbacks import ModelCheckpoint
import tensorflow_hub as hub
train=pd.read_csv('/content/train (7).csv')
test=pd.read_csv('/content/test (5).csv')
!wget --quiet https://raw.githubusercontent.com/tensorflow/models/master/official/nlp/bert/tokenization.py
pip install sentencepiece
Requirement already satisfied: sentencepiece in /usr/local/lib/python3.6/dist-packages (0.1.91)
import tokenization
def bert_encode(texts, tokenizer, max_len=512):
    all_tokens = []
    all_masks = []
    all_segments = []
    
    for text in texts:
        text = tokenizer.tokenize(text)
            
        text = text[:max_len-2]
        input_sequence = ["[CLS]"] + text + ["[SEP]"]
        pad_len = max_len - len(input_sequence)
        
        tokens = tokenizer.convert_tokens_to_ids(input_sequence)
        tokens += [0] * pad_len
        pad_masks = [1] * len(input_sequence) + [0] * pad_len
        segment_ids = [0] * max_len
        
        all_tokens.append(tokens)
        all_masks.append(pad_masks)
        all_segments.append(segment_ids)
    
    return np.array(all_tokens), np.array(all_masks), np.array(all_segments)
def build_model(bert_layer, max_len=512):
    input_word_ids = Input(shape=(max_len,), dtype=tf.int32, name="input_word_ids")
    input_mask = Input(shape=(max_len,), dtype=tf.int32, name="input_mask")
    segment_ids = Input(shape=(max_len,), dtype=tf.int32, name="segment_ids")

    _, sequence_output = bert_layer([input_word_ids, input_mask, segment_ids])
    clf_output = sequence_output[:, 0, :]
    out = Dense(1, activation='sigmoid')(clf_output)
    
    model = Model(inputs=[input_word_ids, input_mask, segment_ids], outputs=out)
    model.compile(Adam(lr=2e-6), loss='binary_crossentropy', metrics=['accuracy'])
    
    return model
# To know the time it takes
%%time
module_url = "https://tfhub.dev/tensorflow/bert_en_uncased_L-24_H-1024_A-16/1"
bert_layer = hub.KerasLayer(module_url, trainable=True)
CPU times: user 12.7 s, sys: 2.35 s, total: 15 s
Wall time: 14.7 s
## Loading a tokenizer from the bert layer
vocab_file = bert_layer.resolved_object.vocab_file.asset_path.numpy()
do_lower_case = bert_layer.resolved_object.do_lower_case.numpy()
tokenizer = tokenization.FullTokenizer(vocab_file, do_lower_case)
##Encoding the text into tokens, masks, and segment flags:
import numpy as np
train_input = bert_encode(train.text.values, tokenizer, max_len=160)
test_input = bert_encode(test.text.values, tokenizer, max_len=160)
train_labels = train.target.values
train_labels
array([1, 1, 1, ..., 1, 1, 1])
test_input[:1]
(array([[  101,  2074,  3047, ...,     0,     0,     0],
        [  101,  2657,  2055, ...,     0,     0,     0],
        [  101,  2045,  2003, ...,     0,     0,     0],
        ...,
        [  101,  2665,  2240, ...,     0,     0,     0],
        [  101, 12669,  3314, ...,     0,     0,     0],
        [  101,  1001,  2103, ...,     0,     0,     0]]),)
model = build_model(bert_layer, max_len=160)
model.summary()
Model: "functional_1"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
==================================================================================================
input_word_ids (InputLayer)     [(None, 160)]        0                                            
__________________________________________________________________________________________________
input_mask (InputLayer)         [(None, 160)]        0                                            
__________________________________________________________________________________________________
segment_ids (InputLayer)        [(None, 160)]        0                                            
__________________________________________________________________________________________________
keras_layer (KerasLayer)        [(None, 1024), (None 335141889   input_word_ids[0][0]             
                                                                 input_mask[0][0]                 
                                                                 segment_ids[0][0]                
__________________________________________________________________________________________________
tf_op_layer_strided_slice (Tens [(None, 1024)]       0           keras_layer[0][1]                
__________________________________________________________________________________________________
dense (Dense)                   (None, 1)            1025        tf_op_layer_strided_slice[0][0]  
==================================================================================================
Total params: 335,142,914
Trainable params: 335,142,913
Non-trainable params: 1
__________________________________________________________________________________________________
train_history = model.fit(
    train_input, train_labels,
    validation_split=0.2,
    epochs=5,
    batch_size=16
)

model.save('model.h5')
Epoch 1/5
381/381 [==============================] - 647s 2s/step - loss: 0.4596 - accuracy: 0.7946 - val_loss: 0.3964 - val_accuracy: 0.8286
Epoch 2/5
381/381 [==============================] - 645s 2s/step - loss: 0.3337 - accuracy: 0.8624 - val_loss: 0.3932 - val_accuracy: 0.8365
Epoch 3/5
381/381 [==============================] - 645s 2s/step - loss: 0.2527 - accuracy: 0.8992 - val_loss: 0.4139 - val_accuracy: 0.8345
Epoch 4/5
381/381 [==============================] - 645s 2s/step - loss: 0.1717 - accuracy: 0.9378 - val_loss: 0.4511 - val_accuracy: 0.8293
Epoch 5/5
381/381 [==============================] - 645s 2s/step - loss: 0.1028 - accuracy: 0.9649 - val_loss: 0.5067 - val_accuracy: 0.8260
test_pred = model.predict(test_input)
len(test_pred)
3263
submit['target'] = test_pred.round().astype(int)
submit.to_csv('submission.csv', index=False)
submit.head()
id target
0 0 1
1 2 1
2 3 1
3 9 1
4 11 1

This model ranked number 334 out of 1356 with a 83% accuracy.Its a good baseline model and we can brainstorm on how to improve its accuracy

Flair

Flair is a NLP library which can be used for text classification.

Why use flair?

  • It comprises of popular and state-of-the-art word embeddings, such as GloVe, BERT, ELMo, Character Embeddings, etc.
  • Flair’s interface allows us to combine different word embeddings and use them to embed documents. This in turn leads to a significant uptick in results.
  • Flair Embedding uses contextual string embeddings.
  • Flair supports a number of languages other than English.
# hide_output
pip install flair
## How It Works
# loading the data and visualizing flair steps
df=pd.read_csv('/content/train (7).csv')
from IPython.display import Image
Image('/content/flair.png')
import pandas as pd
import numpy as np
df=pd.read_csv('/content/train (7).csv')
df.head()
id keyword location text target
0 1 NaN NaN Our Deeds are the Reason of this #earthquake M... 1
1 4 NaN NaN Forest fire near La Ronge Sask. Canada 1
2 5 NaN NaN All residents asked to 'shelter in place' are ... 1
3 6 NaN NaN 13,000 people receive #wildfires evacuation or... 1
4 7 NaN NaN Just got sent this photo from Ruby #Alaska as ... 1
#Some preprocessing
import re

allowed_chars = ' AaBbCcDdEeFfGgHhIiJjKkLlMmNnOoPpQqRrSsTtUuVvWwXxYyZz0123456789~`!@#$%^&*()-=_+[]{}|;:",./<>?'
punct = '!?,.@#'
maxlen = 280

def preprocess(text):
    return ''.join([' ' + char + ' ' if char in punct else char for char in [char for char in re.sub(r'http\S+', 'http', text, flags=re.MULTILINE) if char in allowed_chars]])[:maxlen]
df['text']=df['text'].apply(preprocess)
df.head()
id keyword location text target
0 1 NaN NaN Our Deeds are the Reason of this # earthquake... 1
1 4 NaN NaN Forest fire near La Ronge Sask . Canada 1
2 5 NaN NaN All residents asked to shelter in place are be... 1
3 6 NaN NaN 13 , 000 people receive # wildfires evacuatio... 1
4 7 NaN NaN Just got sent this photo from Ruby # Alaska a... 1
#Data Preparation
df['target'].value_counts()
condition=[
           (df['target']==1),
           (df['target']==0)
]

values=['Disaster','No_Disaster']

df['labels']=np.select(condition,values)
df.head()
id keyword location text target labels
0 1 NaN NaN Our Deeds are the Reason of this # earthquake... 1 Disaster
1 4 NaN NaN Forest fire near La Ronge Sask . Canada 1 Disaster
2 5 NaN NaN All residents asked to shelter in place are be... 1 Disaster
3 6 NaN NaN 13 , 000 people receive # wildfires evacuatio... 1 Disaster
4 7 NaN NaN Just got sent this photo from Ruby # Alaska a... 1 Disaster
df1=df[['text','labels']]
df1.head()
text labels
0 Our Deeds are the Reason of this # earthquake... Disaster
1 Forest fire near La Ronge Sask . Canada Disaster
2 All residents asked to shelter in place are be... Disaster
3 13 , 000 people receive # wildfires evacuatio... Disaster
4 Just got sent this photo from Ruby # Alaska a... Disaster
#Converting to FastText
df_fast_text=df1.copy()
df_fast_text['labels'] = '__label__'+ df_fast_text['labels'].astype(str)
## Rearanging the dataframe to the required format
df_fast_text=df_fast_text[['labels','text']]
df_fast_text.head()
labels text
0 __label__Disaster Our Deeds are the Reason of this # earthquake...
1 __label__Disaster Forest fire near La Ronge Sask . Canada
2 __label__Disaster All residents asked to shelter in place are be...
3 __label__Disaster 13 , 000 people receive # wildfires evacuatio...
4 __label__Disaster Just got sent this photo from Ruby # Alaska a...
## splitting the FastText into 3 required by the model

train_fst,test_fst,dev_fst=np.split(df_fast_text,[int(.6*len(df_fast_text)),int(.8*len(df_fast_text))])
# Viewing the shape of the three splits
print(train_fst.shape)
print(test_fst.shape)
print(dev_fst.shape)
(4567, 2)
(1523, 2)
(1523, 2)
### Storing allthe three in one folder 
!mkdir -p data_fst
## storing them into one folder 
train_fst.to_csv('/content/data_fst/train.csv',index=False,header=False,sep='\t')
test_fst.to_csv('/content/data_fst/test.csv',index=False,header=False,sep='\t')
dev_fst.to_csv('/content/data_fst/dev.csv',index=False,header=False,sep='\t')
## Cheking to see if the items exist
!ls data_fst
dev.csv  test.csv  train.csv

The files have been stored.

# Creating the corpus
from flair.datasets import ClassificationCorpus
from flair.data import Corpus
# Loading the folder we just created
data_folder='/content/data_fst/'
# Applying the corpus on our folder
corpus_fst:Corpus=ClassificationCorpus(data_folder)
2020-10-07 07:13:49,184 Reading data from /content/data_fst
2020-10-07 07:13:49,185 Train: /content/data_fst/train.csv
2020-10-07 07:13:49,187 Dev: /content/data_fst/dev.csv
2020-10-07 07:13:49,188 Test: /content/data_fst/test.csv
# Creating the label dictionary
label_dict=corpus_fst.make_label_dictionary()
2020-10-07 07:13:49,259 Computing label dictionary. Progress:
100%|██████████| 6090/6090 [00:03<00:00, 1810.26it/s]
2020-10-07 07:13:52,735 [b'Disaster', b'No_Disaster']

# hide_output
from flair.embeddings import FlairEmbeddings, BertEmbeddings
# init Flair embeddings
flair_forward_embedding = FlairEmbeddings('multi-forward')
flair_backward_embedding = FlairEmbeddings('multi-backward')

# init multilingual BERT
bert_embedding = BertEmbeddings('bert-base-multilingual-cased')
from flair.embeddings import DocumentLSTMEmbeddings,DocumentRNNEmbeddings
#Combinning the embedings

from flair.embeddings import StackedEmbeddings

# now create the StackedEmbedding object that combines all embeddings
stacked_embeddings = StackedEmbeddings(
    embeddings=[flair_forward_embedding, flair_backward_embedding, bert_embedding])
from flair.data import Sentence
sentence = Sentence('The grass is green .')

# just embed a sentence using the StackedEmbedding as you would with any single embedding.
stacked_embeddings.embed(sentence)

# now check out the embedded tokens.
for token in sentence:
    print(token)
    print()
    print(token.embedding)
    print('-'*100)
Token: 1 The

tensor([0.6800, 0.2429, 0.0012,  ..., 0.3829, 0.4721, 0.2985], device='cuda:0')
----------------------------------------------------------------------------------------------------
Token: 2 grass

tensor([ 2.9200e-01,  2.2066e-02,  4.5290e-05,  ...,  8.5283e-01,
        -5.0724e-02,  3.4476e-01], device='cuda:0')
----------------------------------------------------------------------------------------------------
Token: 3 is

tensor([-0.5447,  0.0229,  0.0078,  ..., -0.1828,  0.7153,  0.0051],
       device='cuda:0')
----------------------------------------------------------------------------------------------------
Token: 4 green

tensor([1.4772e-01, 1.0973e-01, 8.5618e-04,  ..., 1.0157e+00, 7.5358e-01,
        1.1230e-01], device='cuda:0')
----------------------------------------------------------------------------------------------------
Token: 5 .

tensor([-1.5555e-01,  6.7598e-03,  5.3829e-06,  ..., -6.0930e-01,
         9.0591e-01,  1.7857e-01], device='cuda:0')
----------------------------------------------------------------------------------------------------
word_embeddings=[flair_forward_embedding, flair_backward_embedding, bert_embedding]
documnt_embedings= DocumentRNNEmbeddings(word_embeddings,hidden_size=512,reproject_words=True,reproject_words_dimension=256)
## Building and Trainning The Model
from flair.models import  TextClassifier
from flair.trainers import ModelTrainer
clf=TextClassifier(documnt_embedings,label_dictionary=label_dict)
### initialise
trainer=ModelTrainer(clf,corpus_fst)
# hide_output
trainer.train('data',max_epochs=10)

The flair model has given us an accuracy of 79%

## Making Prediction on Raw Text
new_clf=TextClassifier.load('/content/data/best-model.pt')
# Importing sentence
from flair.data import Sentence
2020-10-07 07:25:16,281 loading file /content/data/best-model.pt
# Creating the first sentence
frst_sentence=Sentence('I was just sitted, when i suddenly heard a loud bang')
# Making our model predict the first sentence/ tweet
new_clf.predict(frst_sentence)
frst_sentence.labels
[No_Disaster (0.7598)]

According to our model, there is 76% chance that our first sentence is not a disaster tweet which is accurate.

# Creating the second sentence/ tweet
second_sentence=Sentence('Many have beeninjured , Afew Died on the spot')
# Making our model predict the second sentence / tweet.
new_clf.predict(second_sentence)
second_sentence.labels
[Disaster (0.6697)]

According to our model, there is 67% chance that the second tweet is a disaster tweet which is quite accurate.

# Creating the third sentence/ tweet
third_sentence=Sentence('Tsunamis swept through rice fields and flooded the towns.')
# Making our model predict the third sentence / tweet.
new_clf.predict(third_sentence)
third_sentence.labels
[Disaster (0.9355)]

According to our model, there is 93% chance that the third tweet is a disaster tweet.

5.) Conclusions and Recommendations


  • For our project, Bert transformer and flair has performed better than the ordinary classification models.

6.) Follow up Questions


Did we have the correct data?

Yes

7.) Model Deployment

The prototyping of the above model was one using the streamlit library

The code and the output is shown below .

NB. The code cannot Run on google colab because it doesnt support some streamlit dependencies.

The code should run well when written on a python file and run in an environment with streamlit and flair libraries

import datetime as dt
import re

import pandas as pd
import streamlit as st
from flair.data import Sentence
from flair.models import TextClassifier
from twitterscraper import query_tweets


# Preprocess function
allowed_chars = ' AaBbCcDdEeFfGgHhIiJjKkLlMmNnOoPpQqRrSsTtUuVvWwXxYyZz0123456789~`!@#$%^&*()-=_+[]{}|;:",./<>?'
punct = '!?,.@#'
maxlen = 280

@st.cache
def preprocess(text):
    # Delete URLs, cut to maximum length, space out punctuation with spaces, and remove un allowed chars
    return ''.join(
        [' ' + char + ' ' if char in punct else char for char in [char for char in re.sub(r'http\S+', 'http', text,
                                                                                          flags=re.MULTILINE) if
                                                                  char in allowed_chars]])



def main():
    st.title('Fabulous Tokenizers Disaster Tweet Monitor')

    # Load bert model once and cache it
    with st.spinner("Loading deep learning model..."):
        @st.cache(allow_output_mutation=True, show_spinner=False)
        def load_model(_):
            model = TextClassifier.load(r"C:\Users\ADMIN\Desktop\Fabulous Tokenizers  Project\best-model.pt")
            return model

        model = load_model(10)
    st.write("Are you wondering if A tweet is Real Disaster Tweet Or Not ??? ")
    app_mode = st.selectbox("CLASSIFY SINGLE TWEET  or #HASH TAGS",
                            ["", "Single Tweet ", "# Tag",])
    if app_mode == "Single Tweet ":
        st.subheader('Single tweet classification')
        tweet_input = st.text_input('Tweet:')
        if tweet_input != '':
            # Pre-process tweet
            sentence = Sentence(preprocess(tweet_input))

            # Make predictions
            with st.spinner('Predicting...'):
                model.predict(sentence)

                st.write('Prediction:')
                st.write(sentence.labels[0].value + ' with ',
                         sentence.labels[0].score * 100, '% confidence')

    elif app_mode =="# Tag":

        # TWEET SEARCH AND CLASSIFY
        st.subheader('Search Twitter for Query')
        # Get user input
        query = st.text_input('Query:', '#')

        # As long as the query is valid (not empty or equal to '#')...
        if query != '' and query != '#':
            with st.spinner(f'Searching for and analyzing {query}...'):
                # Get English tweets from the past 4 weeks
                tweets = query_tweets(query, begindate=dt.date.today() - dt.timedelta(weeks=16), lang='en')

                # Initialize empty dataframe
                tweet_data = pd.DataFrame({
                    'tweet': [],
                    'predicted-label': []
                })

                # Keep track of positive vs. negative tweets
                pos_vs_neg = {'0': 0, '1': 0}

                # Add data for each tweet
                for tweet in tweets:
                    # Skip iteration if tweet is empty
                    if tweet.text in ('', ' '):
                        continue
                    # Make predictions
                    sentence = Sentence(preprocess(tweet.text))
                    model.predict(sentence)
                    sentiment = sentence.labels[0]
                    # Keep track of positive vs. negative tweets
                    disaster_vs_non_disaster[sentiment] += 1
                    # Append new data
                    tweet_data = tweet_data.append({'tweet': tweet.text, 'predicted-sentiment': sentiment,},
                                                   ignore_index=True)
                    st.dataframe(tweet_data)

        # Show query data and sentiment if available
        try:
            st.write(tweet_data)
            try:
                st.write('Positive to negative tweet ratio:', pos_vs_neg['0'] / pos_vs_neg['1'])
            except ZeroDivisionError:  # if no negative tweets
                st.write('All Non Disaster Tweets')
        except NameError:  # if no queries have been made yet
            pass












if __name__ == "__main__":
    main()

Requirements files

from IPython.display import Image
Image('/content/1.PNG')
Image('/content/2.PNG')