bravopapa
bravopapa

Reputation: 445

re.sub : How to solve TypeError: expected string or bytes-like object

i have a dataframe called tweet of the following types:

                        Id                                               Text
0      1281015183687720961  @AngelaRuchTruck has @BubbaWallace beat, by fa...
1      1281015160803667968  I’m an old, white male. I marched in the 60s a...
2      1281014374744891392  This is me and I am saying #EnoughIsEnoughNS L...
3      1281014363193819139  The Ultimate Fighter Finale! Join in on the fu...
4      1281014339433095169                       This #blm $hit is about done
...                    ...                                                ...
12529  1279207822207725569  First thing I see, getting here #BLM #BLMDC #B...
12530  1279206857253543936  So here’s a thought for all of you #BLM people...
12531  1279206802035539969  #campingworld #Hamilton #BreakTheSilenceForSus...
12532  1279205845474127872  #Day 3.168 . . #artmenow #drawmenow #nodapl #n...
12533  1279205399535792128  Oh but wait ....... Breonna Taylor! #BreonnaTa...

I am trying to clean the text tweet['Text'] using the following code

tweet['cleaned_text'] = re.sub(r"(?:\@RT|http?\://|https?\://|www)\S+", "", tweet['Text'])

tweet['cleaned_text']= re.sub(r'^RT[\s]+', '', tweet['cleaned_text']))

But i get this error:

~\AppData\Local\Continuum\anaconda3\lib\re.py in sub(pattern, repl, string, count, flags)
    190     a callable, it's passed the Match object and must return
    191     a replacement string to be used."""
--> 192     return _compile(pattern, flags).sub(repl, string, count)
    193 
    194 def subn(pattern, repl, string, count=0, flags=0):

TypeError: expected string or bytes-like object

A suggested answer is to use the following code:

cleaned = []
txt = list(tweet['Text'])
for i  in txt:
    cleaned.append(re.sub(r"(?:\@RT|http?\://|https?\://|www)\S+", "", i))
tweet['cleaned_text'] = cleaned

the code works fine. However, tweet['cleaned_text'] is still not a string. For example when I use the following code:

Blobtweet = TextBlob(tweet["cleaned_text"]) 

I get this error

~\AppData\Local\Continuum\anaconda3\lib\site-packages\textblob\blob.py in __init__(self, text, tokenizer, pos_tagger, np_extractor, analyzer, parser, classifier, clean_html)
    368         if not isinstance(text, basestring):
    369             raise TypeError('The `text` argument passed to `__init__(text)` '
--> 370                             'must be a string, not {0}'.format(type(text)))
    371         if clean_html:
    372             raise NotImplementedError("clean_html has been deprecated. "

TypeError: The `text` argument passed to `__init__(text)` must be a string, not <class 'pandas.core.series.Series'>

########### or

text=tweet['cleaned_text']
text = text.lower()  
tokens = tokenizer.tokenize(text)

I get the following error:

AttributeError: 'Series' object has no attribute 'lower'

All those examples worked fine when i have a string

Upvotes: 0

Views: 2493

Answers (1)

Roshin Raphel
Roshin Raphel

Reputation: 2689

tweet['cleaned_text'] returns a column, not a string, you have to iterate throuh each element of the column.

cleaned = []
txt = list(tweet['Text'])
for i  in txt:
    t = re.sub(r"(?:\@RT|http?\://|https?\://|www)\S+", "", i)
    cleaned.append(re.sub(r'^RT[\s]+', '', t))
tweet['cleaned_text'] = cleaned

Upvotes: 2

Related Questions