Reputation: 3608
I have
pd.DataFrame({'text':['fewfwePDFerglergl','htrZIPg','gemlHTML']})
text
0 wePDFerglergl
1 htrZIPg
2 gemlHTML
a column 10k rows long. Each column contains one of ['PDF','ZIP','HTML']. The length of each entry in text is 14char max.
how do I get:
pd.DataFrame({'text':['wePDFerglergl','htrZIPg','gemlHTML'],'file_type':['pdf','zip','html']})
text file_type
0 wePDFerglergl pdf
1 htrZIPg zip
2 gemlHTML html
I tried df.text[0].find('ZIP')
for a single entry, but do not know how to stitch it all together to test and return the correct value for each row in the column
Any suggestions?
Upvotes: 0
Views: 75
Reputation: 42946
We can use str.extract
here with the regex flag for in-case sensitive (?i)
words = ['pdf','zip','html']
df['file_type'] = df['text'].str.extract(f'(?i)({"|".join(words)})')
Or we use the flags=re.IGNORECASE
argument:
import re
df['file_type'] = df['text'].str.extract(f'({"|".join(words)})', flags=re.IGNORECASE)
Output
text file_type
0 fewfwePDFerglergl PDF
1 htrZIPg ZIP
2 gemlHTML HTML
If you want file_type
as lower case, chain str.lower()
:
df['file_type'] = df['text'].str.extract(f'(?i)({"|".join(words)})')[0].str.lower()
text file_type
0 fewfwePDFerglergl pdf
1 htrZIPg zip
2 gemlHTML html
Details:
The pipe (|
) is the or
operator in regular expressions. So with:
"|".join(words)
'pdf|zip|html'
We get the following in pseudocode:
extract "pdf" or "zip" or "html" from our string
Upvotes: 1
Reputation: 1299
You could use regex for this:
import re
regex = re.compile(r'(PDF|ZIP|HTML)')
This matches any of the desired substrings. To extract these matches in order in proper case, here's a one-liner:
file_type = [re.search(regex, x).group().lower() for x in df['text']]
This returns the following list:
['pdf', 'zip', 'html']
Then to add the column:
df['file_type'] = file_type
Upvotes: 0