Koppology
Koppology

Reputation: 262

add multiple items to row using vectorized pandas function, not iterrows?

I have a rather large bioinformatics dataset that I'm processing using pandas. It looks something like this:

>>> df = pd.DataFrame([['a=1|b=4', 'a=2|b=3', 'a=1|b=1'], 
     [None]*3, [None]*3], index=['metadata', 'a', 'b']).T
>>> df
  metadata     a     b
0  a=1|b=4  None  None
1  a=2|b=3  None  None
2  a=1|b=1  None  None

I want to extract the metadata from the 'metadata' column and put it in the a and b columns (yes, bioinformatics file formats are terrible). The way I'm currently doing this is using df.iterrows():

>>> def make_dict(string):
        return dict([kv.split('=') for kv in string.split('|')])

>>> for idx, row in df.iterrows():
        for k, v in make_dict(row['metadata']).iteritems():
            df[k][idx] = v

>>> df
  metadata  a  b
0  a=1|b=4  1  4
1  a=2|b=3  2  3
2  a=1|b=1  1  1

This works, but is extremely slow for large datasets. My question is: is there a way to get the same result without using iterrows, using vectorized pandas functions, and get the same result?

Upvotes: 2

Views: 367

Answers (1)

TomAugspurger
TomAugspurger

Reputation: 28946

This should be faster

In [124]: regex = r'\w=(\d+)\|\w=(\d+)'

In [127]: df[['a', 'b']] = df.metadata.str.extract(regex)

In [128]: df
Out[128]: 
  metadata  a  b
0  a=1|b=4  1  4
1  a=2|b=3  2  3
2  a=1|b=1  1  1

[3 rows x 3 columns]

EDIT:

Interesting, looks like it's actually slower, but I suspect that this will scale well

In [131]: %%timeit
   .....: df[['a', 'b']] = df.metadata.str.extract(regex)
   .....: 
1000 loops, best of 3: 973 µs per loop

In [135]: %%timeit
for idx, row in df.iterrows():
        for k, v in make_dict(row['metadata']).items():
            df[k][idx] = v
   .....: 
1000 loops, best of 3: 440 µs per loop

Upvotes: 2

Related Questions