Archit
Archit

Reputation: 588

extracting the data from a column in pandas dataframe using regular expression

I have a dataframe df as defined below

import pandas as pd
df = pd.DataFrame(
    {
        "ID": [1, 2, 3, 4, 5],
        "name": [
            "Hello Kitty how=1234 when=2345",
            "how=3456 Hello Puppy when=7685",
            "how=646 It is an Helloexample when=9089",
            "for how=6574 stackoverflow when=5764",
            "Hello  when=3632 World how=7654",
        ],
    }
)





df
Out[100]: 
   ID                                     name
0   1           Hello Kitty how=1234 when=2345
1   2           how=3456 Hello Puppy when=7685
2   3  how=646 It is an Helloexample when=9089
3   4     for how=6574 stackoverflow when=5764
4   5           Hello  when=3632 World how=7654

I want to extract the values written that are after how and when into two separate columns how and when. How can I do the same using regular expression ?

For example: in first record I should get 1234 in column how and 2345 in column when. In last record I should get 7654 in column how and 3632 in column when

Upvotes: 1

Views: 1494

Answers (2)

Valdi_Bo
Valdi_Bo

Reputation: 31011

Use df.name.str.extract(...). The first argument in this method is pattern. Include there two named capturing groups, for each fragment to capture.

Something like:

df.name.str.extract(r'(?P<how>(?<=how=)[\d.]+)|(?P<when>(?<=when=)[\d.]+)')

The pattern should be passed as a raw string, due to contained backslashes.

Upvotes: 0

Rakesh
Rakesh

Reputation: 82815

Using str.extract

Ex:

df = pd.DataFrame(
    {
        "ID": [1, 2, 3, 4, 5],
        "name": [
            "Hello Kitty how=1234 when=2345",
            "how=3456 Hello Puppy when=7685",
            "how=646 It is an Helloexample when=9089",
            "for how=6574 stackoverflow when=5764",
            "Hello  when=3632 World how=7654",
        ],
    }
)
df['when'] = df['name'].str.extract(r"when=(\w+)")  #If only int use `(\d+)`
df['how'] = df['name'].str.extract(r"how=(\w+)")    #If only int use `(\d+)`
print(df)

Output:

   ID                                     name  when   how
0   1           Hello Kitty how=1234 when=2345  2345  1234
1   2           how=3456 Hello Puppy when=7685  7685  3456
2   3  how=646 It is an Helloexample when=9089  9089   646
3   4     for how=6574 stackoverflow when=5764  5764  6574
4   5          Hello  when=3632 World how=7654  3632  7654

Upvotes: 3

Related Questions