make new column based on presence of a word in another

Question

I have

pd.DataFrame({'text':['fewfwePDFerglergl','htrZIPg','gemlHTML']})
    text
0   wePDFerglergl
1   htrZIPg
2   gemlHTML

a column 10k rows long. Each column contains one of ['PDF','ZIP','HTML']. The length of each entry in text is 14char max.

how do I get:

pd.DataFrame({'text':['wePDFerglergl','htrZIPg','gemlHTML'],'file_type':['pdf','zip','html']})
    text            file_type
0   wePDFerglergl   pdf
1   htrZIPg         zip
2   gemlHTML        html

I tried df.text[0].find('ZIP') for a single entry, but do not know how to stitch it all together to test and return the correct value for each row in the column

Any suggestions?

Erfan · Accepted Answer

We can use str.extract here with the regex flag for in-case sensitive (?i)

words =  ['pdf','zip','html']
df['file_type'] = df['text'].str.extract(f'(?i)({"|".join(words)})')

Or we use the flags=re.IGNORECASE argument:

import re
df['file_type'] = df['text'].str.extract(f'({"|".join(words)})', flags=re.IGNORECASE)

Output

                text file_type
0  fewfwePDFerglergl       PDF
1            htrZIPg       ZIP
2           gemlHTML      HTML

If you want file_type as lower case, chain str.lower():

df['file_type'] = df['text'].str.extract(f'(?i)({"|".join(words)})')[0].str.lower()

                text file_type
0  fewfwePDFerglergl       pdf
1            htrZIPg       zip
2           gemlHTML      html

Details: The pipe (|) is the or operator in regular expressions. So with:

"|".join(words)

'pdf|zip|html'

We get the following in pseudocode:

extract "pdf" or "zip" or "html" from our string

make new column based on presence of a word in another

Answers (2)

Related Questions