Reputation: 147
I am practicing on using NLTK to remove certain features from raw tweets and subsequently hoping to remove tweets that are (to me) irelevant (e.g. empty tweet or single word tweets). However, it seems that some of the single word tweets are not removed. I am also facing an issue with not able to remove any stopword that are either at the beginning or end of sentence.
Any advice? At the moment, I hope to pass back a sentence as an output rather than a list of tokenized words.
Any other comment on improving the code (processing time, elegance) are welcome.
import string
import numpy as np
import nltk
from nltk.corpus import stopwords
cache_english_stopwords=stopwords.words('english')
cache_en_tweet_stopwords=stopwords.words('english_tweet')
# For clarity, df is a pandas dataframe with a column['text'] together with other headers.
def tweet_clean(df):
temp_df = df.copy()
# Remove hyperlinks
temp_df.loc[:, "text"] = temp_df.loc[:, "text"].replace('https?:\/\/.*\/\w*', '', regex=True)
# Remove hashtags
# temp_df.loc[:,"text"]=temp_df.loc[:,"text"].replace('#\w*', '', regex=True)
temp_df.loc[:, "text"] = temp_df.loc[:, "text"].replace('#', ' ', regex=True)
# Remove citations
temp_df.loc[:, "text"] = temp_df.loc[:, "text"].replace('\@\w*', '', regex=True)
# Remove tickers
temp_df.loc[:, "text"] = temp_df.loc[:, "text"].replace('\$\w*', '', regex=True)
# Remove punctuation
temp_df.loc[:, "text"] = temp_df.loc[:, "text"].replace('[' + string.punctuation + ']+', '', regex=True)
# Remove stopwords
for tweet in temp_df.loc[:,"text"]:
tweet_tokenized=nltk.word_tokenize(tweet)
for w in tweet_tokenized:
if (w.lower() in cache_english_stopwords) | (w.lower() in cache_en_tweet_stopwords):
temp_df.loc[:, "text"] = temp_df.loc[:, "text"].replace('[\W*\s?\n?]'+w+'[\W*\s?]', ' ', regex=True)
#print("w in stopword")
# Remove quotes
temp_df.loc[:, "text"] = temp_df.loc[:, "text"].replace('\&*[amp]*\;|gt+', '', regex=True)
# Remove RT
temp_df.loc[:, "text"] = temp_df.loc[:, "text"].replace('\s+rt\s+', '', regex=True)
# Remove linebreak, tab, return
temp_df.loc[:, "text"] = temp_df.loc[:, "text"].replace('[\n\t\r]+', ' ', regex=True)
# Remove via with blank
temp_df.loc[:, "text"] = temp_df.loc[:, "text"].replace('via+\s', '', regex=True)
# Remove multiple whitespace
temp_df.loc[:, "text"] = temp_df.loc[:, "text"].replace('\s+\s+', ' ', regex=True)
# Remove single word sentence
for tweet_sw in temp_df.loc[:, "text"]:
tweet_sw_tokenized = nltk.word_tokenize(tweet_sw)
if len(tweet_sw_tokenized) <= 1:
temp_df.loc["text"] = np.nan
# Remove empty rows
temp_df.loc[(temp_df["text"] == '') | (temp_df['text'] == ' ')] = np.nan
temp_df = temp_df.dropna()
return temp_df
Upvotes: 2
Views: 10468
Reputation: 147
With advice from mquantin, I have modified my code to clean tweets individually as a sentence. Here is my amateur attempt with a sample tweet that I believe covers most scenarios (Let me know if you encounter any other cases that deserve a clean up):
import string
import re
from nltk.corpus import stopwords
from nltk.tokenize import TweetTokenizer
cache_english_stopwords=stopwords.words('english')
def tweet_clean(tweet):
# Remove tickers
sent_no_tickers=re.sub(r'\$\w*','',tweet)
print('No tickers:')
print(sent_no_tickers)
tw_tknzr=TweetTokenizer(strip_handles=True, reduce_len=True)
temp_tw_list = tw_tknzr.tokenize(sent_no_tickers)
print('Temp_list:')
print(temp_tw_list)
# Remove stopwords
list_no_stopwords=[i for i in temp_tw_list if i.lower() not in cache_english_stopwords]
print('No Stopwords:')
print(list_no_stopwords)
# Remove hyperlinks
list_no_hyperlinks=[re.sub(r'https?:\/\/.*\/\w*','',i) for i in list_no_stopwords]
print('No hyperlinks:')
print(list_no_hyperlinks)
# Remove hashtags
list_no_hashtags=[re.sub(r'#', '', i) for i in list_no_hyperlinks]
print('No hashtags:')
print(list_no_hashtags)
# Remove Punctuation and split 's, 't, 've with a space for filter
list_no_punctuation=[re.sub(r'['+string.punctuation+']+', ' ', i) for i in list_no_hashtags]
print('No punctuation:')
print(list_no_punctuation)
# Remove multiple whitespace
new_sent = ' '.join(list_no_punctuation)
# Remove any words with 2 or fewer letters
filtered_list = tw_tknzr.tokenize(new_sent)
list_filtered = [re.sub(r'^\w\w?$', '', i) for i in filtered_list]
print('Clean list of words:')
print(list_filtered)
filtered_sent =' '.join(list_filtered)
clean_sent=re.sub(r'\s\s+', ' ', filtered_sent)
#Remove any whitespace at the front of the sentence
clean_sent=clean_sent.lstrip(' ')
print('Clean sentence:')
print(clean_sent)
s0=' RT @Amila #Test\nTom\'s newly listed Co. & Mary\'s unlisted Group to supply tech for nlTK.\nh.. $TSLA $AAPL https:// t.co/x34afsfQsh'
tweet_clean(s0)
Upvotes: 3
Reputation: 1158
What is df? a list of tweets?
You maybe should consider cleaning the tweet one after the other and not as a list of tweets. It would be easier to have a function tweet_cleaner(single_tweet)
.
nltk provides a TweetTokenizer to clean the tweets.
the "re" package provides good solutions to use regex.
I advice you to create a variable for an easier use of temp_df.loc[:, "text"]
Deleting stopwords in a sentence is described [here] (Stopword removal with NLTK):
clean_wordlist = [i for i in sentence.lower().split() if i not in stopwords]
If you want to use regex (with the re package), you can
create a regex pattern composed of all the stopwords (out of the tweet_clean function):
stop_pattern = re.compile('|'.join(stoplist)(?siu))
(?siu) for multiline, ignorecase, unicode
and use this pattern to clean any string
clean_string = stop_pattern.sub('', input_string)
(you can concatenate the 2 stoplists if having separate ones is not needed)
To remove 1 words tweet you could only keep the one longest than 1 word:
if len(tweet_sw_tokenized) >= 1:
kept_ones.append(tweet_sw)
Upvotes: 3