Reputation: 10011
If I have a dataframe like this:
id str
01 abc_d(a)
02 ab_d(a)
03 abcd_e(a)
04 a_b(a)
How can i get a dataframe as following ? Sorry i makeup this dataframe to represent my real issues. Thanks.
id str
01 d
02 d
03 e
04 b
Upvotes: 3
Views: 1163
Reputation: 402303
(Bad Answer)
Series.str.split
soupdf['str'] = df['str'].str.split('(').str[0].str.split('_').str[-1]
df
id str
0 1 d
1 2 d
2 3 e
3 4 b
(Less Bad answer)
Series.str.extract
df['str'] = df['str'].str.extract(r'_([^_]+)\(', expand=False)
df
id str
0 1 d
1 2 d
2 3 e
3 4 b
Regex methods come with their fair share of overhead, and str.extract
does not do much to make things better.
(Better Answer)
re.search
with list compimport re
p = re.compile(r'(?<=_)[^_]+(?=\()')
df['str'] = [p.search(x)[0] for x in df['str'].tolist()]
df
id str
0 1 d
1 2 d
2 3 e
3 4 b
This should be faster than the above methods. I find list comprehensions are really fast compared to most vectorised string pandas methods, even if this does use regex. I pre-compile the pattern in advance to alleviate some of the performance concerns.
(Also a better answer)
str.split
with list compdf['str'] = [
x.split('(', 1)[0].split('_')[1] for x in df['str'].tolist()
]
df
id str
0 1 d
1 2 d
2 3 e
3 4 b
This combines the best of both worlds, the performance of a list comp and the speed of pure python string splitting. Should be the fastest.
Performance
df_test = pd.concat([df] * 10000, ignore_index=True)
%timeit df_test['str'].str.extract(r'_([^_]+)\(', expand=False)
%timeit df_test['str'].str.split('(').str[0].str.split('_').str[-1]
%timeit [p.search(x)[0] for x in df_test['str'].tolist()]
%timeit [x.split('(', 1)[0].split('_')[1] for x in df_test['str'].tolist()]
70.4 ms ± 623 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
99.6 ms ± 730 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
31 ms ± 877 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
30 ms ± 431 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) # fastest but not by much
Upvotes: 4
Reputation: 164623
Using pd.Series.str.split
. Specific to your particular format.
df['str'] = df['str'].str.split('_').str[-1].str[0]
print(df)
id str
0 1 d
1 2 d
2 3 e
3 4 b
Upvotes: 1
Reputation: 18208
May be you can try split
similar to example:
df['str'] = df['str'].str.split('_').str.get(1).str[0]
Or,
df['str'] = df['str'].str.split('_').str.get(1).str.split('(').str[0]
Upvotes: 1
Reputation: 323226
Using extract
df['str']=df['str'].str.extract("\_(.*)\(",expand=True)
df
Out[585]:
id str
0 1 d
1 2 d
2 3 e
3 4 b
Upvotes: 3