Alejandro L
Alejandro L

Reputation: 139

Extract strings from columns using regex

I want to extract the following strings from the title column and append to a new column named hazard_extract like in the below example.

test = {'title': ['Other', 'Microbiological - Listeria', 'Extraneous Material', 'Chemical', 'Chemical - Histamine', 'Labelling, Other'], 'hazard_extract':['Other', 'Microbiological', 'Extraneous Material', 'Chemical', 'Chemical', 'Labelling']}
example = pd.DataFrame(test)
example

    title                       hazard_extract
0   Other                       Other
1   Microbiological - Listeria  Microbiological
2   Extraneous Material         Extraneous Material
3   Chemical                    Chemical
4   Chemical - Histamine        Chemical
5   Labelling, Other            Labelling

However, I am using the code below - if the string does not have a - or , it does not extract the string. In this case, how can I extract both words as in Extraneous Material and a single word as in Chemical or Other?

example['hazard_extract'] = example['title'].str.extract(r'^(.*?),? ')
    title                       hazard_extract
0   Other                       NaN
1   Microbiological - Listeria  Microbiological
2   Extraneous Material         Extraneous
3   Chemical                    NaN
4   Chemical - Histamine        Chemical
5   Labelling, Other            Labelling

Thank you so much for all the help!

Upvotes: 1

Views: 146

Answers (3)

Mark Tolonen
Mark Tolonen

Reputation: 177471

No need for a complicated regular expression:

import pandas as pd

test = {'title': ['Other', 'Microbiological - Listeria', 'Extraneous Material', 'Chemical', 'Chemical - Histamine', 'Labelling, Other']}
example = pd.DataFrame(test)
print(example)
print()
example['hazard_extract'] = example['title'].str.split(' -|,').str[0]
print(example)
                        title
0                       Other
1  Microbiological - Listeria
2         Extraneous Material
3                    Chemical
4        Chemical - Histamine
5            Labelling, Other

                        title       hazard_extract
0                       Other                Other
1  Microbiological - Listeria      Microbiological
2         Extraneous Material  Extraneous Material
3                    Chemical             Chemical
4        Chemical - Histamine             Chemical
5            Labelling, Other            Labelling

Upvotes: 1

Epsi95
Epsi95

Reputation: 9047

The easiest will be to use split

example['title'].str.split(r'[-,]').str[0].str.strip()
0                  Other
1       Microbiological 
2    Extraneous Material
3               Chemical
4              Chemical 
5              Labelling

Upvotes: 1

ashkangh
ashkangh

Reputation: 1624

Try this:

example['title'].str.extract(r'^(\w*\s*\w*)\s*[\,\-]?.*')

Upvotes: 0

Related Questions