frank
frank

Reputation: 3608

make new column based on presence of a word in another

I have

pd.DataFrame({'text':['fewfwePDFerglergl','htrZIPg','gemlHTML']})
    text
0   wePDFerglergl
1   htrZIPg
2   gemlHTML

a column 10k rows long. Each column contains one of ['PDF','ZIP','HTML']. The length of each entry in text is 14char max.

how do I get:

pd.DataFrame({'text':['wePDFerglergl','htrZIPg','gemlHTML'],'file_type':['pdf','zip','html']})
    text            file_type
0   wePDFerglergl   pdf
1   htrZIPg         zip
2   gemlHTML        html

I tried df.text[0].find('ZIP') for a single entry, but do not know how to stitch it all together to test and return the correct value for each row in the column

Any suggestions?

Upvotes: 0

Views: 75

Answers (2)

Erfan
Erfan

Reputation: 42946

We can use str.extract here with the regex flag for in-case sensitive (?i)

words =  ['pdf','zip','html']
df['file_type'] = df['text'].str.extract(f'(?i)({"|".join(words)})')

Or we use the flags=re.IGNORECASE argument:

import re
df['file_type'] = df['text'].str.extract(f'({"|".join(words)})', flags=re.IGNORECASE)

Output

                text file_type
0  fewfwePDFerglergl       PDF
1            htrZIPg       ZIP
2           gemlHTML      HTML

If you want file_type as lower case, chain str.lower():

df['file_type'] = df['text'].str.extract(f'(?i)({"|".join(words)})')[0].str.lower()
                text file_type
0  fewfwePDFerglergl       pdf
1            htrZIPg       zip
2           gemlHTML      html

Details: The pipe (|) is the or operator in regular expressions. So with:

"|".join(words)

'pdf|zip|html'

We get the following in pseudocode:

extract "pdf" or "zip" or "html" from our string

Upvotes: 1

neutrino_logic
neutrino_logic

Reputation: 1299

You could use regex for this:

import re
regex = re.compile(r'(PDF|ZIP|HTML)')

This matches any of the desired substrings. To extract these matches in order in proper case, here's a one-liner:

file_type = [re.search(regex, x).group().lower() for x in df['text']]

This returns the following list:

['pdf', 'zip', 'html']

Then to add the column:

df['file_type'] = file_type

Upvotes: 0

Related Questions