Reputation: 499
I want to check data column 'tokenizing' and 'lemmatization' is same or not like the table. But, giving me an error
tokenizing | lemmatization | check |
---|---|---|
[pergi, untuk, melakukan, penanganan, banjir] | [pergi, untuk, laku, tangan, banjir] | False |
[baca, buku, itu, asik] | [baca, buku, itu, asik] | True |
from spacy.lang.id import Indonesian
import pandas as pd
nlp = Indonesian()
nlp.add_pipe('lemmatizer')
nlp.initialize()
data = [
'pergi untuk melakukan penanganan banjir',
'baca buku itu asik'
]
df = pd.DataFrame({'text': data})
#Tokenization
def tokenizer(words):
return [token for token in nlp(words)]
#Lemmatization
def lemmatizer(token):
return [lem.lemma_ for lem in token]
df['tokenizing'] = df['text'].apply(tokenizer)
df['lemmatization'] = df['tokenizing'].apply(lemmatizer)
#Check similarity
df.to_clipboard(sep='\s\s+')
df['check'] = df['tokenizing'].eq(df['lemmatization'])
df
How to compare?
result before error df.to_clipboard()
text tokenizing lemmatization
0 pergi untuk melakukan penanganan banjir [pergi, untuk, melakukan, penanganan, banjir] [pergi, untuk, laku, tangan, banjir]
1 baca buku itu asik [baca, buku, itu, asik] [baca, buku, itu, asik]
The error is fixed. It is because typo. And after fixed the typo the result is like this the result is all False. What I want is like the table.
Upvotes: 1
Views: 59
Reputation: 1851
Base on your code, you forgot i on df['lemmatizaton']
.
So that change
df['lemmatizaton'] = df['tokenizing'].apply(lemmatizer)
to
df['lemmatization'] = df['tokenizing'].apply(lemmatizer)
Then it may work.
Upvotes: 1