Reputation: 205
I have a Panda dataframe with multiple columns that has string data in a format like this:
id col1 col2 col3
1 '1:correct' '0:incorrect' '1:correct'
2 '0:incorrect' '1:correct' '1:correct'
What I would like to do is to extract the numeric character before the colon : symbol. The resulting data should look like this:
id col1 col2 col3
1 1 0 1
2 0 1 1
What I have tried is using regex, like following:
colname = ['col1','col2','col3']
row = len(df)
for col in colname:
df[col] = df[col].str.findall(r"(\d+):")
for i in range(0,row):
df[col].iloc[i] = df[col].iloc[i][0]
df[col] = df[col].astype('int64')
The second loop selects the first and only element in a list created by regex. I then convert the object dtype to integer. This code basically does what I want, but it is way too slow even for a small dataset with few thousand rows. I have heard that loops are not very efficient in Python.
Is there a faster, more Pythonic way of extracting numerics in a string and converting it to integers?
Upvotes: 1
Views: 443
Reputation: 28699
An option is using list comprehension; since this involves strings, you should get fast speed:
import re
pattern = re.compile(r"\d(?=:)")
result = {key: [int(pattern.search(arr).group(0))
if isinstance(arr, str)
else arr
for arr in value.array]
for key, value in df.items()}
pd.DataFrame(result)
id col1 col2 col3
0 1 1 0 1
1 2 0 1 1
Upvotes: 1
Reputation: 862791
Use Series.str.extract
for get first value before :
in DataFrame.apply
for processing each column by lambda function:
colname = ['col1','col2','col3']
f = lambda x: x.str.extract(r"(\d+):", expand=False)
df[colname] = df[colname].apply(f).astype('int64')
print (df)
id col1 col2 col3
0 1 1 0 1
1 2 0 1 1
Another solution with split and selecting first value before :
:
colname = ['col1','col2','col3']
f = lambda x: x.str.strip("'").str.split(':').str[0]
df[colname] = df[colname].apply(f).astype('int64')
print (df)
id col1 col2 col3
0 1 1 0 1
1 2 0 1 1
Upvotes: 2