Reputation: 47
In a client/server application, data is requested from the server and the incoming replies mapped using a request id:
--> Request data for item i using request_id 1 --> Request data for item j using request_id 2 : <-- Data element i.p for request_id1 <-- Data element j.p for request_id2 <-- Data element i.q for request_id1 <-- Data element j.q for request_id2 <-- Data element i.r for request_id1 <-- Data element j.r for request_id2
The incoming chunks are added to list of dictionaries as simulated by the following code:
import pandas
import random
md = list()
md.append({'request_id': 1, 'p': random.random()})
md.append({'request_id': 2, 'p': random.random()})
md.append({'request_id': 1, 'q': random.random()})
md.append({'request_id': 2, 'q': random.random()})
md.append({'request_id': 1, 'r': random.random()})
md.append({'request_id': 2, 'r': random.random()})
df = pandas.DataFrame(md).set_index('request_id')
print df
When creating a dataframe from the list md, only one column per row has a value, all others are NaN. Above code produces the following result:
p q r
request_id
1 0.955755 NaN NaN
2 0.920858 NaN NaN
1 NaN 0.583634 NaN
2 NaN 0.456644 NaN
1 NaN NaN 0.198991
2 NaN NaN 0.774762
[6 rows x 3 columns]
How can I reduce the df to just one row per request_id? What I really need is the following:
p q r
request_id
1 0.955755 0.583634 0.198991
2 0.920858 0.456644 0.774762
[2 rows x 3 columns]
The chunks are coming in in no particular order and for each request, an end of request messages is received. It is guaranteed that each chunk is sent only once (if at all) and thus only one (or zero) data element per row is not NaN.
A typical application is the asynchronous data retrieval of an option chain with various data elements such as Price, IV, Delta, Gamma, Theta, Vega for every single option.
Upvotes: 2
Views: 930
Reputation: 879601
Perhaps instead of using the list of dicts, make md
a dict of dicts:
import pandas
import random
import collections
md = collections.defaultdict(dict)
md['p'][1] = random.random()
md['p'][2] = random.random()
md['q'][2] = random.random()
md['q'][1] = random.random()
md['r'][1] = random.random()
md['r'][2] = random.random()
df = pandas.DataFrame(md)
df.index.name = 'request_id'
print df
yields something like
p q r
request_id
1 0.127898 0.565351 0.966917
2 0.983144 0.593652 0.617639
[2 rows x 3 columns]
Although it is inefficient to create a large DataFrame, just to shrink it later, if you must use a list of dicts, you could combine the rows like this:
import pandas as pd
import random
md = list()
md.append({'request_id': 1, 'p': random.random()})
md.append({'request_id': 2, 'p': random.random()})
md.append({'request_id': 1, 'q': random.random()})
md.append({'request_id': 2, 'q': random.random()})
md.append({'request_id': 1, 'r': random.random()})
md.append({'request_id': 2, 'r': random.random()})
df = pd.DataFrame(md).set_index('request_id')
df = pd.concat([df[col].dropna() for col in df.columns], axis=1)
print(df)
This drops the NaNs from each column, then uses pd.concat to combine the list of Series into one DataFrame.
Upvotes: 2