Reputation: 509
I am trying to extract only text part into a new column from a string data in rows of a column in pandas dataframe.
When I tried something like this:
import pandas as pd
import re
df = pd.DataFrame({"Id": [1,2] , "Text" : ["This is 20/06/2019; 00:13:45; Time stamp", "This is another 23/04/2019 11:23:35 Time stamp"]})
print(df)
I got a dataframe as below:
Id Text
0 1 This is 20/06/2019; 00:13:45; Time stamp
1 2 This is another 23/04/2019 11:23:35 Time stamp
This is fine.
And then I tried to extract the only text data from the string in each row of column:
df['Text2']= re.findall(r'\w+', df['Text'])
This is where I got the error:
TypeError Traceback (most recent call last)
<ipython-input-5-b04317009801> in <module>
----> 1 df['Text2']= re.findall(r'\w+', df['Text'])
2
3 print(df)
C:\Anaconda3\lib\re.py in findall(pattern, string, flags)
221
222 Empty matches are included in the result."""
--> 223 return _compile(pattern, flags).findall(string)
224
225 def finditer(pattern, string, flags=0):
TypeError: expected string or bytes-like object
Whereas I was expecting something like this:
Id Text
0 1 This is Time stamp
1 2 This is another Time stamp
What wrong have I done, here?
Upvotes: 2
Views: 418
Reputation: 627129
I suggest using
df['Text2'] = df['Text'].str.findall(r'[^\W\d_]{3,}').str.join(' ')
Or, to match only whole words (i.e. only those not "glued" to numbers or underscores):
df['Text2'] = df['Text'].str.findall(r'\b[^\W\d_]{3,}\b').str.join(' ')
The [^\W\d_]{3,}
pattern will match 3 or more any Unicode letter chunks from the content you have in Text
column.
The .str.findall
will find all overlapping matches in the input string and output a list of these matches.
The .str.join(' ')
will convert the list of values to a single space-separated string.
Upvotes: 2
Reputation: 863281
Use Series.str.findall
with Series.str.join
with regex for get text in length 3 or more characters:
df['Text2']= df['Text'].str.findall(r'[A-Za-z]{3,}').str.join(' ')
print(df)
Id Text Text2
0 1 This is 20/06/2019; 00:13:45; Time stamp This Time stamp
1 2 This is another 23/04/2019 11:23:35 Time stamp This another Time stamp
Upvotes: 2