user3650713
user3650713

Reputation: 47

Pandas Dataframe: Reduce Diagonal Sub-Frame to Single Row or How to Fill a Dataframe Piece by Piece

In a client/server application, data is requested from the server and the incoming replies mapped using a request id:

--> Request data for item i using request_id 1
--> Request data for item j using request_id 2
:
<-- Data element i.p for request_id1
<-- Data element j.p for request_id2
<-- Data element i.q for request_id1
<-- Data element j.q for request_id2
<-- Data element i.r for request_id1
<-- Data element j.r for request_id2

The incoming chunks are added to list of dictionaries as simulated by the following code:

import pandas
import random

md = list()
md.append({'request_id': 1, 'p': random.random()})
md.append({'request_id': 2, 'p': random.random()})
md.append({'request_id': 1, 'q': random.random()})
md.append({'request_id': 2, 'q': random.random()})
md.append({'request_id': 1, 'r': random.random()})
md.append({'request_id': 2, 'r': random.random()})

df = pandas.DataFrame(md).set_index('request_id')

print df

When creating a dataframe from the list md, only one column per row has a value, all others are NaN. Above code produces the following result:

                   p         q         r
request_id                              
1           0.955755       NaN       NaN
2           0.920858       NaN       NaN
1                NaN  0.583634       NaN
2                NaN  0.456644       NaN
1                NaN       NaN  0.198991
2                NaN       NaN  0.774762

[6 rows x 3 columns]

How can I reduce the df to just one row per request_id? What I really need is the following:

                   p         q         r
request_id                              
1           0.955755  0.583634  0.198991
2           0.920858  0.456644  0.774762

[2 rows x 3 columns]

The chunks are coming in in no particular order and for each request, an end of request messages is received. It is guaranteed that each chunk is sent only once (if at all) and thus only one (or zero) data element per row is not NaN.

A typical application is the asynchronous data retrieval of an option chain with various data elements such as Price, IV, Delta, Gamma, Theta, Vega for every single option.

Upvotes: 2

Views: 930

Answers (1)

unutbu
unutbu

Reputation: 879601

Perhaps instead of using the list of dicts, make md a dict of dicts:

import pandas
import random
import collections

md = collections.defaultdict(dict)
md['p'][1] = random.random()
md['p'][2] = random.random()
md['q'][2] = random.random()
md['q'][1] = random.random()
md['r'][1] = random.random()
md['r'][2] = random.random()

df = pandas.DataFrame(md)
df.index.name = 'request_id'

print df

yields something like

                   p         q         r
request_id                              
1           0.127898  0.565351  0.966917
2           0.983144  0.593652  0.617639

[2 rows x 3 columns]

Although it is inefficient to create a large DataFrame, just to shrink it later, if you must use a list of dicts, you could combine the rows like this:

import pandas as pd
import random

md = list()
md.append({'request_id': 1, 'p': random.random()})
md.append({'request_id': 2, 'p': random.random()})
md.append({'request_id': 1, 'q': random.random()})
md.append({'request_id': 2, 'q': random.random()})
md.append({'request_id': 1, 'r': random.random()})
md.append({'request_id': 2, 'r': random.random()})
df = pd.DataFrame(md).set_index('request_id')
df = pd.concat([df[col].dropna() for col in df.columns], axis=1)
print(df)

This drops the NaNs from each column, then uses pd.concat to combine the list of Series into one DataFrame.

Upvotes: 2

Related Questions