Reputation: 262
I have a rather large bioinformatics dataset that I'm processing using pandas. It looks something like this:
>>> df = pd.DataFrame([['a=1|b=4', 'a=2|b=3', 'a=1|b=1'],
[None]*3, [None]*3], index=['metadata', 'a', 'b']).T
>>> df
metadata a b
0 a=1|b=4 None None
1 a=2|b=3 None None
2 a=1|b=1 None None
I want to extract the metadata from the 'metadata' column and put it in the a and b columns (yes, bioinformatics file formats are terrible). The way I'm currently doing this is using df.iterrows():
>>> def make_dict(string):
return dict([kv.split('=') for kv in string.split('|')])
>>> for idx, row in df.iterrows():
for k, v in make_dict(row['metadata']).iteritems():
df[k][idx] = v
>>> df
metadata a b
0 a=1|b=4 1 4
1 a=2|b=3 2 3
2 a=1|b=1 1 1
This works, but is extremely slow for large datasets. My question is: is there a way to get the same result without using iterrows, using vectorized pandas functions, and get the same result?
Upvotes: 2
Views: 367
Reputation: 28946
This should be faster
In [124]: regex = r'\w=(\d+)\|\w=(\d+)'
In [127]: df[['a', 'b']] = df.metadata.str.extract(regex)
In [128]: df
Out[128]:
metadata a b
0 a=1|b=4 1 4
1 a=2|b=3 2 3
2 a=1|b=1 1 1
[3 rows x 3 columns]
EDIT:
Interesting, looks like it's actually slower, but I suspect that this will scale well
In [131]: %%timeit
.....: df[['a', 'b']] = df.metadata.str.extract(regex)
.....:
1000 loops, best of 3: 973 µs per loop
In [135]: %%timeit
for idx, row in df.iterrows():
for k, v in make_dict(row['metadata']).items():
df[k][idx] = v
.....:
1000 loops, best of 3: 440 µs per loop
Upvotes: 2