Alex Lenail
Alex Lenail

Reputation: 14450

How to transform list of tuples of dicts to a pandas dataframe

Problem Statement

I have a list of tuples of dicts: [(A, B), (A, B),...]. I wrote A and B for the dictionaries because the keys are the same across these "types".

I want a dataframe with some keys from A and some keys from B.

Some of the keys in A are also present in B. I'd like to keep the keys from A.

Ways of approaching it:

I can think of a couple ways, and I'm curious which will be more performant. I've listed them in the order of my best guess as to performance:

As a complete newbie to pandas, it seems to difficult to tell what each operation is, and when it's building a view or doing a copy, so I can't tell what is expensive and what isn't.

Specifics:

Here's some simplified example data,

[({"chrom": "chr1", "gStart": 1000, "gEnd": 2000, "other": "drop this"}, 
  {"chrom": "chr1": "pStart": 1500, "pEnd": 2500, "drop": "this"}), 
 ({"chrom": "chr2", "gStart": 8000, "gEnd": 8500, "other": "unimportant"}, 
  {"chrom": "chr2": "pStart": 7500, "pEnd": 9500, "drop": "me"}) ]

The result I'd like I think would be the outcome of:

 pd.DataFrame.from_records([ 
  {"chrom": "chr1", "gStart": 1000, "gEnd": 2000, "pStart": 1500, "pEnd": 2500}, 
  {"chrom": "chr2", "gStart": 8000, "gEnd": 8500, "pStart": 7500, "pEnd": 9500}  ] )

Pseudocode of the solution I'd like:

I think this would work if dictionaries had a nice, in-place select method:

A_fields = [...]
B_fields = [...]
A_B_merged = [a.select(A_fields).extend(b.select(B_fields)) for a, b in A_B_not_merged]

A_B_dataframe = pd.DataFrame.from_records(A_B_merged)

Upvotes: 3

Views: 646

Answers (2)

kuriouscoder
kuriouscoder

Reputation: 5582

Using plain old dictionary merge that merges the start and end dictionary via pythonic (Python 3.5+) way and then uses from_records to construct DataFrame.

pd.DataFrame.from_records([{**d[0],**d[1]} for d in k])

  chrom  gEnd  gStart  pEnd  pStart
0  chr1  2000    1000  2500    1500
1  chr2  8500    8000  9500    7500

Upvotes: 0

Zeugma
Zeugma

Reputation: 32095

You need to go down two levels of your input to get it processed. Your best friend is then chain.from_iterable:

import itertools as it
pd.DataFrame.from_records(it.chain.from_iterable(l))
Out[21]: 
  chrom    gEnd  gStart    pEnd  pStart
0  chr1  2000.0  1000.0     NaN     NaN
1  chr1     NaN     NaN  2500.0  1500.0
2  chr2  8500.0  8000.0     NaN     NaN
3  chr2     NaN     NaN  9500.0  7500.0

This requires classic and easy cleanup:

pd.DataFrame.from_records(it.chain.from_iterable(l)).set_index('chrom').stack().unstack()
Out[22]: 
         gEnd  gStart    pEnd  pStart
chrom                                
chr1   2000.0  1000.0  2500.0  1500.0
chr2   8500.0  8000.0  9500.0  7500.0

Upvotes: 1

Related Questions