Reputation: 111
I have a pandas DF with 5000 rows and 400+ columns containing identifiers as an index column with the remaining columns containing key value pairs in the form of a dictionary.
eg.
Identifier 0 1
identifier 1 {'key':'value'} {'key2':'value2'}
I'm trying to get to a point where the data looks like this:
Identifier key key2
identifier1 value value 2
I know we can use the following in order to create a df from a nested dictionary:
df = pd.DataFrame(data).T
But that would require the data to look like this:
data = {'Identifier1': {'key': 'value', 'key2': 'value2'}}
When i use df.to_dict() on my current dataframe in order to get it to the above format, it looks like so:
{'Identifier1':{'0': {'key':'value'}, '1': {'key1':'value1'}...}}
Which I understand why. My question is two fold I suppose:
How can I get my data to a place where df = pd.DataFrame(data).T works ? Is there a better way to go about this?
EXAMPLE as requested:
Data looks like this:
FileID | 0 | 1 |
---|---|---|
file1 | {'key1':value1} | {'key3':value1} |
file2 | {'key2':value1} | {'key1':value2} |
I'd like it to look like this:
FileID | key 1 | key2 | key3 |
---|---|---|---|
file1 | value1 | value1 | |
file2 | value2 | value 1 |
EDIT 2: There are nonetypes within the dataframe I am working with as well. In trying out the solutions myself, I've come to realize that. Apologies for not including that information initially.
Upvotes: 2
Views: 2404
Reputation: 4548
A combination of melting, popitem(), and pivoting back.
This first code block just creates a test table with plenty of different types of null values.
import pandas as pd
import numpy as np
#example table
df = pd.DataFrame({
'FileID': ['File_1','File_2','File_3','File_4'],
'a': [{'key_5':9}, None, {}, {'key_9':13}],
'b': [np.nan, {'key_9':1}, {'key_11':2}, {'key_3':9}],
'c': [{'key_6':4}, {'key_3':8}, '', {'key_11':10}],
})
print(df)
Here's the code to clean and reshape the data
#Melt to long form and extract the key/vals's with popitem
df = df.melt(id_vars='FileID').dropna()
df[['key','value']] = (
df['value'].apply(lambda d: d.popitem() if d else (None,None)).to_list()
)
#will fail if the same key is used multiple times in the same row
df = df.dropna().pivot(index='FileID',columns='key',values='value').reset_index()
df.columns.name = ''
print(df)
Upvotes: 0
Reputation: 9379
UPDATED:
Here is my understanding of the question (as updated, including in the comments, by OP):
FileID
and an arbitrary number of additional columns with each value in each row containing either None or a dict with 0 or more key/value pairsFileID
and one additional column for each unique key in the union of all keys in the dicts contained in the original dataframeFileID
Here is code to do what is asked:
df = pd.concat([df.FileID.to_frame(), df.apply(lambda x: {'FileID':x.FileID} |
{k:v for col in df.columns[1:] for k, v in (x[col] or {}).items()}, axis=1)], axis=1)
def foo(x):
for y in x:
if y is not np.nan:
return y
return np.nan
df2 = pd.DataFrame.from_records(df.set_index('FileID').stack()).groupby('FileID').agg(foo).reset_index()
df2 = df2[['FileID'] + sorted(df2.columns[1:])]
Explanation:
apply
to create one dict per row which is the union of the dicts in each column as well as a key/value pair for the row's FileID
concat
to have the leftmost column be unchanged as FileID
set_index
and stack
to prepare the above dataframe to be used as a sequence of dicts to initialize a new dataframe using from_records
such that it will have a column for each unique key in the union of all dicts in the original dataframeagg
with a custom function that for each FileID
delivers the first non-NaN value for each key column, or NaN if that key is not found within the group for a given FileID
reset_index
to restore FileID
as a columnFull test code:
import pandas as pd
import numpy as np
df = pd.DataFrame({'FileID':['file1','file2'], 0:[{'key1':'value1'}, {'key2':'value1'}], 1:[{'key3':'value1'}, None], 2:[{}, {'key1':'value2'}]})
print(df)
df = pd.concat([df.FileID.to_frame(), df.apply(lambda x: {'FileID':x.FileID} |
{k:v for col in df.columns[1:] for k, v in (x[col] or {}).items()}, axis=1)], axis=1)
def foo(x):
for y in x:
if y is not np.nan:
return y
return np.nan
df2 = pd.DataFrame.from_records(df.set_index('FileID').stack()).groupby('FileID').agg(foo).reset_index()
df2 = df2[['FileID'] + sorted(df2.columns[1:])]
print(df2)
Input:
FileID 0 1 2
0 file1 {'key1': 'value1'} {'key3': 'value1'} {}
1 file2 {'key2': 'value1'} None {'key1': 'value2'}
Output:
FileID key1 key2 key3
0 file1 value1 NaN value1
1 file2 value2 value1 NaN
Upvotes: 2