Faster method of extracting characters for multiple columns in dataframe

Question

I have a Panda dataframe with multiple columns that has string data in a format like this:

id col1          col2          col3
1  '1:correct'   '0:incorrect' '1:correct'
2  '0:incorrect' '1:correct'   '1:correct'

What I would like to do is to extract the numeric character before the colon : symbol. The resulting data should look like this:

id col1    col2    col3
1  1       0       1
2  0       1       1

What I have tried is using regex, like following:

colname = ['col1','col2','col3']
row = len(df)
for col in colname:
    df[col] = df[col].str.findall(r"(\d+):")
    for i in range(0,row):
        df[col].iloc[i] = df[col].iloc[i][0]
    df[col] = df[col].astype('int64')

The second loop selects the first and only element in a list created by regex. I then convert the object dtype to integer. This code basically does what I want, but it is way too slow even for a small dataset with few thousand rows. I have heard that loops are not very efficient in Python.

Is there a faster, more Pythonic way of extracting numerics in a string and converting it to integers?

jezrael · Accepted Answer

Use Series.str.extract for get first value before : in DataFrame.apply for processing each column by lambda function:

colname = ['col1','col2','col3']

f = lambda x: x.str.extract(r"(\d+):", expand=False)
df[colname] = df[colname].apply(f).astype('int64')

print (df)
   id  col1  col2  col3
0   1     1     0     1
1   2     0     1     1

Another solution with split and selecting first value before ::

colname = ['col1','col2','col3']

f = lambda x: x.str.strip("'").str.split(':').str[0]
df[colname] = df[colname].apply(f).astype('int64')

print (df)
   id  col1  col2  col3
0   1     1     0     1
1   2     0     1     1

Faster method of extracting characters for multiple columns in dataframe

Answers (2)

Related Questions