Error while extracting only text from a string in a column into another column in python dataframe using regex

Question

I am trying to extract only text part into a new column from a string data in rows of a column in pandas dataframe.

When I tried something like this:

import pandas as pd
import re

df = pd.DataFrame({"Id": [1,2] , "Text" : ["This is 20/06/2019; 00:13:45; Time stamp", "This is another 23/04/2019 11:23:35 Time stamp"]})

print(df)

I got a dataframe as below:

   Id                                            Text
0   1          This is 20/06/2019; 00:13:45; Time stamp
1   2  This is another 23/04/2019 11:23:35 Time stamp

This is fine.

And then I tried to extract the only text data from the string in each row of column:

df['Text2']= re.findall(r'\w+', df['Text'])

This is where I got the error:

TypeError                                 Traceback (most recent call last)
 in 
----> 1 df['Text2']= re.findall(r'\w+', df['Text'])
      2 
      3 print(df)

C:\Anaconda3\lib\re.py in findall(pattern, string, flags)
    221 
    222     Empty matches are included in the result."""
--> 223     return _compile(pattern, flags).findall(string)
    224 
    225 def finditer(pattern, string, flags=0):

TypeError: expected string or bytes-like object

Whereas I was expecting something like this:

   Id                        Text
0   1          This is Time stamp
1   2  This is another Time stamp

What wrong have I done, here?

jezrael · Accepted Answer

Use Series.str.findall with Series.str.join with regex for get text in length 3 or more characters:

df['Text2']= df['Text'].str.findall(r'[A-Za-z]{3,}').str.join(' ')
print(df)
   Id                                            Text                    Text2
0   1        This is 20/06/2019; 00:13:45; Time stamp          This Time stamp
1   2  This is another 23/04/2019 11:23:35 Time stamp  This another Time stamp

Error while extracting only text from a string in a column into another column in python dataframe using regex

Answers (2)

Related Questions