BCArg
BCArg

Reputation: 2250

Use regular expression to extract elements from a pandas data frame

From the following data frame:

d = {'col1':['a-1524112-124', 'b-1515', 'c-584854', 'a-15154']}

df = pd.DataFrame.from_dict(d)

My ultimate goal is to extract the letters a, b or c (as string) in a pandas series. For that I am using the .findall() method from the re module, as shown below:

# import the module
import re
# define the patterns
pat = 'a|b|c'

# extract the patterns from the elements in the specified column
df['col1'].str.findall(pat)

The problem is that the output i.e. the letters a, b or c, in each row, will be present in a list (of a single element), as shown below:

Out[301]: 
0    [a]
1    [b]
2    [c]
3    [a]

While I would like to have the letters a, b or c as string, as shown below:

0    a
1    b
2    c
3    a

I know that if I combine re.search() with .group() I can get a string, but if I do:

df['col1'].str.search(pat).group()

I will get the following error message:

AttributeError: 'StringMethods' object has no attribute 'search'

Using .str.split() won't do the job because, in my original dataframe, I want to capture strings that might contain the delimiter (e.g. I might want to capture a-b)

Does anyone know a simple solution for that, perhaps avoiding iterative operations such as a for loop or list comprehension?

Upvotes: 1

Views: 2662

Answers (3)

A l w a y s S u n n y
A l w a y s S u n n y

Reputation: 38552

Simply try with str.split() like this- df["col1"].str.split("-", n = 1, expand = True)

import pandas as pd
d = {'col1':['a-1524112-124', 'b-1515', 'c-584854', 'a-15154']}
df = pd.DataFrame.from_dict(d)
df['col1'] = df["col1"].str.split("-", n = 1, expand = True) 
print(df.head())

Output:

  col1
0    a
1    b
2    c
3    a

Upvotes: 0

BENY
BENY

Reputation: 323396

Fix your code

pat = 'a|b|c'
df['col1'].str.findall(pat).str[0]
Out[309]: 
0    a
1    b
2    c
3    a
Name: col1, dtype: object

Upvotes: 0

Dani Mesejo
Dani Mesejo

Reputation: 61930

Use extract with capturing groups:

import pandas as pd

d = {'col1':['a-1524112-124', 'b-1515', 'c-584854', 'a-15154']}

df = pd.DataFrame.from_dict(d)

result = df['col1'].str.extract('(a|b|c)')

print(result)

Output

   0
0  a
1  b
2  c
3  a

Upvotes: 1

Related Questions