Reputation: 481
I have a pandas dataframe with a column where I have to retrieve specific names. The only problem is, those names are not always at the same place and all the values of that columns do not have the same length, so I cannot use the split function . However, I have noticed that before those names, there is a always a combination of 4 to 7 digits. I believe it's the identifier for the name.
So how can I use regular expression to go through that column and retrieve the names I need.
Here is a example from the jupyter notebook:
df['info']
csx_Gb009_broken screen_231400_Iphone 7
000345_SamsungS8_tfes_Vodafone_is56t34_3G
Ins45_56003_Huawei P8_
What I want is something like this:
df['Phones']
Iphone 7
SamsungS8
Huawei P8
I want to have something like the above knowing that those names come before a combination of 4 to 7 digits and end by an underscore.
Upvotes: 1
Views: 105
Reputation: 626794
You may use
df['Phones'] = df['info'].str.extract(r'\d{4}_([^_]+)')
The pattern matches:
\d{4}
- 4 digits_
- an underscore([^_]+)
- Capturing group 1 (this value will be returned by str.extract
): one or more chars other than _
.See the regex demo.
Upvotes: 1