loveR
loveR

Reputation: 509

Error while extracting only text from a string in a column into another column in python dataframe using regex

I am trying to extract only text part into a new column from a string data in rows of a column in pandas dataframe.

When I tried something like this:

import pandas as pd
import re

df = pd.DataFrame({"Id": [1,2] , "Text" : ["This is 20/06/2019; 00:13:45; Time stamp", "This is another 23/04/2019 11:23:35 Time stamp"]})

print(df)

I got a dataframe as below:

   Id                                            Text
0   1          This is 20/06/2019; 00:13:45; Time stamp
1   2  This is another 23/04/2019 11:23:35 Time stamp

This is fine.

And then I tried to extract the only text data from the string in each row of column:

df['Text2']= re.findall(r'\w+', df['Text']) 

This is where I got the error:

TypeError                                 Traceback (most recent call last)
<ipython-input-5-b04317009801> in <module>
----> 1 df['Text2']= re.findall(r'\w+', df['Text'])
      2 
      3 print(df)

C:\Anaconda3\lib\re.py in findall(pattern, string, flags)
    221 
    222     Empty matches are included in the result."""
--> 223     return _compile(pattern, flags).findall(string)
    224 
    225 def finditer(pattern, string, flags=0):

TypeError: expected string or bytes-like object

Whereas I was expecting something like this:

   Id                        Text
0   1          This is Time stamp
1   2  This is another Time stamp

What wrong have I done, here?

Upvotes: 2

Views: 418

Answers (2)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 627129

I suggest using

df['Text2'] = df['Text'].str.findall(r'[^\W\d_]{3,}').str.join(' ')

Or, to match only whole words (i.e. only those not "glued" to numbers or underscores):

df['Text2'] = df['Text'].str.findall(r'\b[^\W\d_]{3,}\b').str.join(' ')

The [^\W\d_]{3,} pattern will match 3 or more any Unicode letter chunks from the content you have in Text column.

The .str.findall will find all overlapping matches in the input string and output a list of these matches.

The .str.join(' ') will convert the list of values to a single space-separated string.

Upvotes: 2

jezrael
jezrael

Reputation: 863281

Use Series.str.findall with Series.str.join with regex for get text in length 3 or more characters:

df['Text2']= df['Text'].str.findall(r'[A-Za-z]{3,}').str.join(' ')
print(df)
   Id                                            Text                    Text2
0   1        This is 20/06/2019; 00:13:45; Time stamp          This Time stamp
1   2  This is another 23/04/2019 11:23:35 Time stamp  This another Time stamp

Upvotes: 2

Related Questions