Reputation: 37
I know that my explaination is rather long but I found it necessary. Hopefully someone is patient and a helpful soul :) I'm doing a sentiment analysis project atm and I'm stuck i the pre-process part. I did the import of the csv file, made it into a dataframe, transformed the variables/columns into the right data types. Then I did the tokenization like this, where i choose the variable I wanted to tokenize (tweet content) in the dataframe (df_tweet1):
# Tokenization
tknzr = TweetTokenizer()
tokenized_sents = [tknzr.tokenize(str(i)) for i in df_tweet1['Tweet Content']]
for i in tokenized_sents:
print(i)
The output is a list of list with words (tokens).
Then I perform stop word removal:
# Stop word removal
from nltk.corpus import stopwords
stop_words = set(stopwords.words("english"))
#add words that aren't in the NLTK stopwords list
new_stopwords = ['!', ',', ':', '&', '%', '.', '’']
new_stopwords_list = stop_words.union(new_stopwords)
clean_sents = []
for m in tokenized_sents:
stop_m = [i for i in m if str(i).lower() not in new_stopwords_list]
clean_sents.append(stop_m)
The output is the same but without stop words
The next two steps are confusing to me (part-of-speech tagging and lemmatization). I tried two things:
1) Convert the previous output into a list of strings
new_test = [' '.join(x) for x in clean_sents]
since I thought that would enable me to use this code to do both steps in one:
from pywsd.utils import lemmatize_sentence
text = new_test
lemm_text = lemmatize_sentence(text, keepWordPOS=True)
I got the this error: TypeError: expected string or bytes-like object
2) Perform POS and lemmatizaion seperately. First POS using clean_sents as input:
# PART-OF-SPEECH
def process_content(clean_sents):
try:
tagged_list = []
for lst in clean_sents[:500]:
for item in lst:
words = nltk.word_tokenize(item)
tagged = nltk.pos_tag(words)
tagged_list.append(tagged)
return tagged_list
except Exception as e:
print(str(e))
output_POS_clean_sents = process_content(clean_sents)
The output is a list of lists with words with a tag attached Then I want to lemmatize this output, but how? I tried two modules, but both gave me error:
from pywsd.utils import lemmatize_sentence
lemmatized= [[lemmatize_sentence(output_POS_clean_sents) for word in s]
for s in output_POS_clean_sents]
# AND
from nltk.stem.wordnet import WordNetLemmatizer
lmtzr = WordNetLemmatizer()
lemmatized = [[lmtzr.lemmatize(word) for word in s]
for s in output_POS_clean_sents]
print(lemmatized)
The errors were respectively:
TypeError: expected string or bytes-like object
AttributeError: 'tuple' object has no attribute 'endswith'
Upvotes: 1
Views: 806
Reputation: 1250
If you're using a dataframe I suggest you to store the pre processing steps results in a new column. In this way you can always check the output, and you can always create a list of lists to use as an input for a model in a line of code afterwords. Another advantage of this approach is that you can easily visualise the line of preprocessing and add other steps wherever you need without getting confused.
Regarding your code, It can be optimised (for example you could perform stop words removal and tokenisation at the same time) and I see a bit of confusion about the steps you performed. For example you performe multiple times lemmatisation, using also different libraries, and there is no point in doing that. In my opinion nltk works just fine, personally I use other libraries to preprocess tweets only to deal with emojis, urls and hashtags, all stuff specifically related to tweets.
# I won't write all the imports, you get them from your code
# define new column to store the processed tweets
df_tweet1['Tweet Content Clean'] = pd.Series(index=df_tweet1.index)
tknzr = TweetTokenizer()
lmtzr = WordNetLemmatizer()
stop_words = set(stopwords.words("english"))
new_stopwords = ['!', ',', ':', '&', '%', '.', '’']
new_stopwords_list = stop_words.union(new_stopwords)
# iterate through each tweet
for ind, row in df_tweet1.iterrows():
# get initial tweet: ['This is the initial tweet']
tweet = row['Tweet Content']
# tokenisation, stopwords removal and lemmatisation all at once
# out: ['initial', 'tweet']
tweet = [lmtzr.lemmatize(i) for i in tknzr.tokenize(tweet) if i.lower() not in new_stopwords_list]
# pos tag, no need to lemmatise again after.
# out: [('initial', 'JJ'), ('tweet', 'NN')]
tweet = nltk.pos_tag(tweet)
# save processed tweet into the new column
df_tweet1.loc[ind, 'Tweet Content Clean'] = tweet
So on overall all you need are 4 lines, one for getting the tweet string, two to preprocess the text, another one to store the tweet. You can add extra processing step paying attention to the output of each step (for example tokenisation return a list of strings, pos tagging return a list of tuples, reason why you are getting troubles).
If you want then you can create a list of lists containing all tweet in the dataframe:
# out: [[('initial', 'JJ'), ('tweet', 'NN')], [second tweet], [third tweet]]
all_tweets = [tweet for tweet in df_tweet1['Tweet Content Clean']]
Upvotes: 1
Reputation: 758
In the first part new_test
is a list of strings. lemmatize_sentence
expects a string so passing new_test
will raise an error like the one you got. You would have to pass each string separately and then create a list from each lemmatized strings. So:
text = new_test
lemm_text = [lemmatize_sentence(sentence, keepWordPOS=True) for sentence in text]
should create a list of lemmatized sentences.
I actually once did a project that seems similar to what you are doing. I made the following function to lemmatize strings:
import lemmy, re
def remove_stopwords(lst):
with open('stopwords.txt', 'r') as sw:
#read the stopwords file
stopwords = sw.read().split('\n')
return [word for word in lst if not word in stopwords]
def lemmatize_strings(body_text, language = 'da', remove_stopwords_ = True):
"""Function to lemmatize a string or a list of strings, i.e. remove prefixes. Also removes punctuations.
-- body_text: string or list of strings
-- language: language of the passed string(s), e.g. 'en', 'da' etc.
"""
if isinstance(body_text, str):
body_text = [body_text] #Convert whatever passed to a list to support passing of single string
if not hasattr(body_text, '__iter__'):
raise TypeError('Passed argument should be a sequence.')
lemmatizer = lemmy.load(language) #load lemmatizing dictionary
lemma_list = [] #list to store each lemmatized string
word_regex = re.compile('[a-zA-Z0-9æøåÆØÅ]+') #All charachters and digits i.e. all possible words
for string in body_text:
#remove punctuation and split words
matches = word_regex.findall(string)
#split words and lowercase them unless they are all caps
lemmatized_string = [word.lower() if not word.isupper() else word for word in matches]
#remove words that are in the stopwords file
if remove_stopwords_:
lemmatized_string = remove_stopwords(lemmatized_string)
#lemmatize each word and choose the shortest word of suggested lemmatizations
lemmatized_string = [min(lemmatizer.lemmatize('', word), key=len) for word in lemmatized_string]
#remove words that are in the stopwords file
if remove_stopwords_:
lemmatized_string = remove_stopwords(lemmatized_string)
lemma_list.append(' '.join(lemmatized_string))
return lemma_list if len(lemma_list) > 1 else lemma_list[0] #return list if list was passed, else return string
You could have a look at that if you want, but don't feel obligated. I would be more than glad if it can help you get any ideas, I spent a lot of time trying to figure it out myself!
Let me know :-)
Upvotes: 1