Hosea
Hosea

Reputation: 205

Faster method of extracting characters for multiple columns in dataframe

I have a Panda dataframe with multiple columns that has string data in a format like this:

id col1          col2          col3
1  '1:correct'   '0:incorrect' '1:correct'
2  '0:incorrect' '1:correct'   '1:correct'

What I would like to do is to extract the numeric character before the colon : symbol. The resulting data should look like this:

id col1    col2    col3
1  1       0       1
2  0       1       1

What I have tried is using regex, like following:

colname = ['col1','col2','col3']
row = len(df)
for col in colname:
    df[col] = df[col].str.findall(r"(\d+):")
    for i in range(0,row):
        df[col].iloc[i] = df[col].iloc[i][0]
    df[col] = df[col].astype('int64')

The second loop selects the first and only element in a list created by regex. I then convert the object dtype to integer. This code basically does what I want, but it is way too slow even for a small dataset with few thousand rows. I have heard that loops are not very efficient in Python.

Is there a faster, more Pythonic way of extracting numerics in a string and converting it to integers?

Upvotes: 1

Views: 443

Answers (2)

sammywemmy
sammywemmy

Reputation: 28699

An option is using list comprehension; since this involves strings, you should get fast speed:

import re

pattern = re.compile(r"\d(?=:)")

result = {key: [int(pattern.search(arr).group(0)) 
                if isinstance(arr, str) 
                else arr
               for arr in value.array]
           for key, value in df.items()}

pd.DataFrame(result)

       id   col1    col2    col3
   0    1      1       0    1
   1    2      0       1    1

Upvotes: 1

jezrael
jezrael

Reputation: 862791

Use Series.str.extract for get first value before : in DataFrame.apply for processing each column by lambda function:

colname = ['col1','col2','col3']

f = lambda x: x.str.extract(r"(\d+):", expand=False)
df[colname] = df[colname].apply(f).astype('int64')

print (df)
   id  col1  col2  col3
0   1     1     0     1
1   2     0     1     1

Another solution with split and selecting first value before ::

colname = ['col1','col2','col3']

f = lambda x: x.str.strip("'").str.split(':').str[0]
df[colname] = df[colname].apply(f).astype('int64')

print (df)
   id  col1  col2  col3
0   1     1     0     1
1   2     0     1     1

Upvotes: 2

Related Questions