Reputation: 111

Converting Keys of Pandas DF of Dictionaries to Columns and values as Data

I have a pandas DF with 5000 rows and 400+ columns containing identifiers as an index column with the remaining columns containing key value pairs in the form of a dictionary.

eg.

Identifier             0               1
identifier 1     {'key':'value'}     {'key2':'value2'}

I'm trying to get to a point where the data looks like this:

Identifier             key           key2
identifier1          value          value 2

I know we can use the following in order to create a df from a nested dictionary:

df = pd.DataFrame(data).T

But that would require the data to look like this:

data = {'Identifier1': {'key': 'value', 'key2': 'value2'}}

When i use df.to_dict() on my current dataframe in order to get it to the above format, it looks like so:

{'Identifier1':{'0': {'key':'value'}, '1': {'key1':'value1'}...}}

Which I understand why. My question is two fold I suppose:

How can I get my data to a place where df = pd.DataFrame(data).T works ? Is there a better way to go about this?

EXAMPLE as requested:

Data looks like this:

FileID	0	1
file1	{'key1':value1}	{'key3':value1}
file2	{'key2':value1}	{'key1':value2}

I'd like it to look like this:

FileID	key 1	key2	key3
file1	value1		value1
file2	value2	value 1

EDIT 2: There are nonetypes within the dataframe I am working with as well. In trying out the solutions myself, I've come to realize that. Apologies for not including that information initially.

Upvotes: 2

Answers (2)

mitoRibo

Reputation: 4548

A combination of melting, popitem(), and pivoting back.

This first code block just creates a test table with plenty of different types of null values.

import pandas as pd
import numpy as np

#example table
df = pd.DataFrame({
    'FileID': ['File_1','File_2','File_3','File_4'],
    'a': [{'key_5':9}, None, {}, {'key_9':13}], 
    'b': [np.nan, {'key_9':1}, {'key_11':2}, {'key_3':9}],
    'c': [{'key_6':4}, {'key_3':8}, '', {'key_11':10}],
})
print(df)

Here's the code to clean and reshape the data

#Melt to long form and extract the key/vals's with popitem
df = df.melt(id_vars='FileID').dropna()
df[['key','value']] = (
    df['value'].apply(lambda d: d.popitem() if d else (None,None)).to_list()
)

#will fail if the same key is used multiple times in the same row 
df = df.dropna().pivot(index='FileID',columns='key',values='value').reset_index()
df.columns.name = ''
print(df)

Upvotes: 0

constantstranger

Reputation: 9379

UPDATED:

Here is my understanding of the question (as updated, including in the comments, by OP):

Given a dataframe with:
- leftmost column FileID and an arbitrary number of additional columns with each value in each row containing either None or a dict with 0 or more key/value pairs
Produce a dataframe with:
- the same leftmost column FileID and one additional column for each unique key in the union of all keys in the dicts contained in the original dataframe
- in each row, the dict value for the key equal to the column label which was found in the original dataframe in the row with matching FileID

Here is code to do what is asked:

df = pd.concat([df.FileID.to_frame(), df.apply(lambda x: {'FileID':x.FileID} | 
    {k:v for col in df.columns[1:] for k, v in (x[col] or {}).items()}, axis=1)], axis=1)
def foo(x):
    for y in x:
        if y is not np.nan:
            return y
    return np.nan
df2 = pd.DataFrame.from_records(df.set_index('FileID').stack()).groupby('FileID').agg(foo).reset_index()
df2 = df2[['FileID'] + sorted(df2.columns[1:])]

Explanation:

use apply to create one dict per row which is the union of the dicts in each column as well as a key/value pair for the row's FileID
use concat to have the leftmost column be unchanged as FileID
use set_index and stack to prepare the above dataframe to be used as a sequence of dicts to initialize a new dataframe using from_records such that it will have a column for each unique key in the union of all dicts in the original dataframe
use agg with a custom function that for each FileID delivers the first non-NaN value for each key column, or NaN if that key is not found within the group for a given FileID
use reset_index to restore FileID as a column
(optional) sort the column labels

Full test code:

import pandas as pd
import numpy as np

df = pd.DataFrame({'FileID':['file1','file2'], 0:[{'key1':'value1'}, {'key2':'value1'}], 1:[{'key3':'value1'}, None], 2:[{}, {'key1':'value2'}]})
print(df)

df = pd.concat([df.FileID.to_frame(), df.apply(lambda x: {'FileID':x.FileID} | 
    {k:v for col in df.columns[1:] for k, v in (x[col] or {}).items()}, axis=1)], axis=1)
def foo(x):
    for y in x:
        if y is not np.nan:
            return y
    return np.nan
df2 = pd.DataFrame.from_records(df.set_index('FileID').stack()).groupby('FileID').agg(foo).reset_index()
df2 = df2[['FileID'] + sorted(df2.columns[1:])]
print(df2)

Input:

  FileID                   0                   1                   2
0  file1  {'key1': 'value1'}  {'key3': 'value1'}                  {}
1  file2  {'key2': 'value1'}                None  {'key1': 'value2'}

Output:

  FileID    key1    key2    key3
0  file1  value1     NaN  value1
1  file2  value2  value1     NaN

Upvotes: 2

Converting Keys of Pandas DF of Dictionaries to Columns and values as Data

Answers (2)

Related Questions