caeruleum
caeruleum

Reputation: 499

Check data column is same or not with Pandas

I want to check data column 'tokenizing' and 'lemmatization' is same or not like the table. But, giving me an error

error

tokenizing lemmatization check
[pergi, untuk, melakukan, penanganan, banjir] [pergi, untuk, laku, tangan, banjir] False
[baca, buku, itu, asik] [baca, buku, itu, asik] True
from spacy.lang.id import Indonesian
import pandas as pd

nlp = Indonesian()
nlp.add_pipe('lemmatizer')
nlp.initialize()

data = [
    'pergi untuk melakukan penanganan banjir',
    'baca buku itu asik'
]

df = pd.DataFrame({'text': data})

#Tokenization
def tokenizer(words):
    return [token for token in nlp(words)]


#Lemmatization
def lemmatizer(token):
    return [lem.lemma_ for lem in token]


df['tokenizing'] = df['text'].apply(tokenizer)
df['lemmatization'] = df['tokenizing'].apply(lemmatizer)

#Check similarity
df.to_clipboard(sep='\s\s+')
df['check'] = df['tokenizing'].eq(df['lemmatization'])
df

How to compare? result before error df.to_clipboard()

                                      text                                     tokenizing                         lemmatization
0  pergi untuk melakukan penanganan banjir  [pergi, untuk, melakukan, penanganan, banjir]  [pergi, untuk, laku, tangan, banjir]
1                       baca buku itu asik                        [baca, buku, itu, asik]               [baca, buku, itu, asik]

Update

The error is fixed. It is because typo. And after fixed the typo the result is like this result the result is all False. What I want is like the table.

Upvotes: 1

Views: 59

Answers (1)

AfterFray
AfterFray

Reputation: 1851

Base on your code, you forgot i on df['lemmatizaton'].

So that change

df['lemmatizaton'] = df['tokenizing'].apply(lemmatizer)

to

df['lemmatization'] = df['tokenizing'].apply(lemmatizer)

Then it may work.

Upvotes: 1

Related Questions