Kenan
Kenan

Reputation: 14094

Convert series of dict and concat to dataframe

My dataframe

                             Items  Count  ScannedCount  
0  {'comp': {'S': '2019-08-02'}...   1032          1032 
1  {'comp': {'S': '2019-08-27'}...   1032          1032  

The items series looks like this

{'comp': {'S': '2019-08-02T16:54:55.035196+03:00'}, 'ID': {'S': '336'}, 'dID': {'S': '1763523'}, 'fname': {'S': '558012'}}

Using the second answer from this post allows me to convert the series to a dataframe. The issues is how to scale that operation since it happens on each row,

Current approach:

Looping through each row and concat them into a series (very slow)

item_df = pd.DataFrame(df['Items'].iloc[i]) for i in range(df.shape[0])]).reset_index(drop=True), df], axis=1)

Concat the results with the original dataframe

df = pd.concat([temp, df], axis=1)

I believe the for loop in the first part is the bottleneck. Is there a faster way to convert a series to dataframe and concat it back to the original dataframe.

Expected output:

                 comp   ID  dID      fname   Count  ScannedCount  
0  2019-08-02T16:54:55  336 1763523  548012  1032   1032
1  2019-09-01T14:52:24  336 1763523  528012  1032   1032

Upvotes: 3

Views: 790

Answers (3)

Quant Christo
Quant Christo

Reputation: 1430

Try this more columnar solution. It works with assumptation that there is always 'S' key in dictionary.

df_tmp = df['Items'].apply(pd.Series)
for c in df_tmp.columns:
    df[c] = df_tmp[c].apply(lambda x: x.get('S'))
df = df.drop(columns='Items')

Upvotes: 1

Alex
Alex

Reputation: 1126

test data:

import pandas as pd
test_data = {'item' : [{'comp': {'S': '2019-08-02T16:54:55.035196+03:00'}, 'ID': {'S': '336'}, 'dID': {'S': '1763523'}, 'fname': {'S': '558012'}}, {'comp': {'S': '2019-09-02T16:54:55.035196+03:00'}, 'ID': {'S': '336'}, 'dID': {'S': '1763523'}, 'fname': {'S': '558012'}}], 'Count': [1032,1032], 'ScannedCount':[1032,1032]}

df = pd.DataFrame.from_dict(test_data)

Out[64]:  
                                                    item  Count  ScannedCount
    0  {'comp': {'S': '2019-08-02T16:54:55.035196+03:...   1032          1032
    1  {'comp': {'S': '2019-09-02T16:54:55.035196+03:...   1032          1032

As far as i understand, you try to get something like this:

def extract(row):

    item_series = pd.Series({k:v for k,v in row['item'].items()})
    result = row.append(item_series)

    return result

df = df.apply(extract, axis = 1)

This gives you:

Out[67]: 
                                            comp            ID               dID  \
    0  {'S': '2019-08-02T16:54:55.035196+03:00'}  {'S': '336'}  {'S': '1763523'}   
    1  {'S': '2019-09-02T16:54:55.035196+03:00'}  {'S': '336'}  {'S': '1763523'}   

             fname  
    0  {'S': '558012'}  
    1  {'S': '558012'}  

## skipped some columns for clarity

Upvotes: 0

89f3a1c
89f3a1c

Reputation: 1488

Looks like iterating over all the df is inevitable. I'm not sure if it's better, but I know the pandas way of doing so is by using iterrows().

In the docs they also mention itertuples() for specific use cases, but again, I'm no expert here.

Hope this helps!

Upvotes: 0

Related Questions