using nltk.word_tokenize generates error "expected string or bytes-like object" in pandas data frame

Question

For the following data frame:

index      sentences                                            category
1          the side effects are terrible !                         SSRI
2          They are killing me,,, I want to stop                   SNRI
3          I need to contact my physicians ?                        SSRI
4          How to stop it.. I am surprised because of its effect.   SSRI
5                                                                   SSRI
6                    NAN                                            SNRI

I am trying to tokenize the sentences in sentences columns. sentences column has some null values. This is my code, but it does not work.

df["sentences"] = df.sentences.replace (r'[^a-zA-Z]', '', regex= True, inplace = True)

df["tokenized_sents"] = df["sentences"].apply(nltk.word_tokenize)

I alo tried this:

df["sentences"] = df.sentences.replace (r'[^a-zA-Z]', 'null', regex= True, inplace = True)

It creates the following error:

expected string or bytes-like object

Any suggestion ?

su79eu7k · Accepted Answer

#  I added NaN, None to your date for demonstration, please check below first df.
print(df)  

df["tokenized_sents"] = df["sentences"].fillna("").map(nltk.word_tokenize)
print(df)

First print,

   index                                          sentences category
0      1                    the side effects are terrible !     SSRI
1      2              They are killing me,,, I want to stop     SNRI
2      3                  I need to contact my physicians ?     SSRI
3      4  How to stop it.. I am surprised because of its...     SSRI
4      5                                                NaN     SNRI
5      5                                               None     None

Second print,

   index                                          sentences category  \
0      1                    the side effects are terrible !     SSRI   
1      2              They are killing me,,, I want to stop     SNRI   
2      3                  I need to contact my physicians ?     SSRI   
3      4  How to stop it.. I am surprised because of its...     SSRI   
4      5                                                NaN     SNRI   
5      5                                               None     None   

                                     tokenized_sents  
0             [the, side, effects, are, terrible, !]  
1  [They, are, killing, me, ,, ,, ,, I, want, to,...  
2          [I, need, to, contact, my, physicians, ?]  
3  [How, to, stop, it.., I, am, surprised, becaus...  
4                                                 []  
5                                                 []

By the way, if you used inplace=True explicitly, you don't have to assign it to your original df again.

df.sentences.replace(r'[^a-zA-Z]', '', regex=True, inplace=True)
#  instead of, df["sentences"] = df.sentences.replace(r'[^a-zA-Z]', '', regex=True, inplace=True)

using nltk.word_tokenize generates error "expected string or bytes-like object" in pandas data frame

Answers (1)

Related Questions

using nltk.word_tokenize generates error &quot;expected string or bytes-like object&quot; in pandas data frame

Answers (1)

Related Questions

using nltk.word_tokenize generates error "expected string or bytes-like object" in pandas data frame