Reputation: 117
I am trying to lemmatize words in a particular column ('body') using pandas.
I have tried the following code, that I found here
import nltk
from nltk.tokenize import word_tokenize
from nltk.stem.wordnet import WordNetLemmatizer
lemmatizer = nltk.stem.WordNetLemmatizer()
wordnet_lemmatizer = WordNetLemmatizer()
df['body'] = df['body'].apply(lambda x: "".join([Word(word).lemmatize() for word in
df['body'].head()
When I attempt to run the code, I get an error message that simply says
File "<ipython-input-41-c002479904b0>", line 33
df['body'] = df['body'].apply(lambda x: "".join([Word(word).lemmatize() for word in x)
^
SyntaxError: invalid syntax
I have also tried the solution presented in this post but didn't have any luck.
UPDATE: this is the full code so far
import pandas as pd
import re
import string
df1 = pd.read_csv('RP_text_posts.csv')
df2 = pd.read_csv('RP_text_comments.csv')
# Renaming columns so the post part - currently 'selftext' matches the post variable in the comments - 'body'
df1.columns = ['author','subreddit','score','num_comments','retrieved_on','id','created_utc','body']
# Dropping columns that aren't subreddit or the post content
df1 = df1.drop(columns=['author','score','num_comments','retrieved_on','id','created_utc'])
df2 = df2.drop(labels=None, columns=['author', 'score', 'created_utc'])
# Combining data
df = pd.concat([df1, df2])
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem.wordnet import WordNetLemmatizer
lemmatizer = nltk.stem.WordNetLemmatizer()
wordnet_lemmatizer = WordNetLemmatizer()
stop = stopwords.words('english')
# Lemmatizing
df['body'] = df['body'].apply(lambda x: "".join([Word(word).lemmatize() for word in x)
df['body'].head()`
Upvotes: 0
Views: 869
Reputation: 2819
It miss the end of the lambda function:
df['words'] = df['words'].apply(lambda x: "".join([Word(word).lemmatize() for word in x]))
Update The line should be more like that but you can only lemmatize by one pos(adjective, or verb, or ...):
df['words'] = df['body'].apply(lambda x: " ".join([wordnet_lemmatizer.lemmatize(word) for word in word_tokenize(x)]))
print(df.head()))
If you want more, you can try the following code:
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.corpus import wordnet
lemmatizer = nltk.stem.WordNetLemmatizer()
wordnet_lemmatizer = WordNetLemmatizer()
stop = stopwords.words('english')
def nltk_tag_to_wordnet_tag(nltk_tag):
if nltk_tag.startswith('J'):
return wordnet.ADJ
elif nltk_tag.startswith('V'):
return wordnet.VERB
elif nltk_tag.startswith('N'):
return wordnet.NOUN
elif nltk_tag.startswith('R'):
return wordnet.ADV
else:
return None
def lemmatize_sentence(sentence):
#tokenize the sentence and find the POS tag for each token
nltk_tagged = nltk.pos_tag(nltk.word_tokenize(sentence))
#tuple of (token, wordnet_tag)
wordnet_tagged = map(lambda x: (x[0], nltk_tag_to_wordnet_tag(x[1])), nltk_tagged)
lemmatized_sentence = []
for word, tag in wordnet_tagged:
if tag is None:
#if there is no available tag, append the token as is
lemmatized_sentence.append(word)
else:
#else use the tag to lemmatize the token
lemmatized_sentence.append(lemmatizer.lemmatize(word, tag))
return " ".join(lemmatized_sentence)
# Lemmatizing
df['words'] = df['body'].apply(lambda x: lemmatize_sentence(x))
print(df.head())
df result:
body | words
0 Best scores, good cats, it rocks | Best score , good cat , it rock
1 You received best scores | You receive best score
2 Good news | Good news
3 Bad news | Bad news
4 I am loving it | I be love it
5 it rocks a lot | it rock a lot
6 it is still good to do better | it be still good to do good
Upvotes: 1