Reputation: 14450
I have a list of tuples of dicts: [(A, B), (A, B),...]. I wrote A
and B
for the dictionaries because the keys are the same across these "types".
I want a dataframe with some keys from A
and some keys from B
.
Some of the keys in A
are also present in B
. I'd like to keep the keys from A
.
I can think of a couple ways, and I'm curious which will be more performant. I've listed them in the order of my best guess as to performance:
A list comprehension, building new dictionaries (or extending A
with parts of B
) and then pd.DataFrame.from_records
.
pd.DataFrame.from_records
has an exclude parameter. Merge the larger dicts first and then exclude columns when building the dataframe.
Transpose the list of tuples (maybe zip(*)
?), create two dataframes with .from_records
, one for each A and B, remove unnecessary columns from each, and then glue the resulting dataframes together side by side.
Make each dict (row) a dataframe and then glue them on top of one another vertically (append
or concat
or something).
As a complete newbie to pandas, it seems to difficult to tell what each operation is, and when it's building a view or doing a copy, so I can't tell what is expensive and what isn't.
Am I missing an approach to this?
Are my solutions in the correct order of performance?
If instead of dictionaries, A
and B
were dataframes, would concatenating them be faster? How much memory overhead does a dataframe have, and is it ever common practice to have a one-row dataframe?
Here's some simplified example data,
[({"chrom": "chr1", "gStart": 1000, "gEnd": 2000, "other": "drop this"},
{"chrom": "chr1": "pStart": 1500, "pEnd": 2500, "drop": "this"}),
({"chrom": "chr2", "gStart": 8000, "gEnd": 8500, "other": "unimportant"},
{"chrom": "chr2": "pStart": 7500, "pEnd": 9500, "drop": "me"}) ]
The result I'd like I think would be the outcome of:
pd.DataFrame.from_records([
{"chrom": "chr1", "gStart": 1000, "gEnd": 2000, "pStart": 1500, "pEnd": 2500},
{"chrom": "chr2", "gStart": 8000, "gEnd": 8500, "pStart": 7500, "pEnd": 9500} ] )
I think this would work if dictionaries had a nice, in-place select
method:
A_fields = [...]
B_fields = [...]
A_B_merged = [a.select(A_fields).extend(b.select(B_fields)) for a, b in A_B_not_merged]
A_B_dataframe = pd.DataFrame.from_records(A_B_merged)
Upvotes: 3
Views: 646
Reputation: 5582
Using plain old dictionary merge that merges the start and end dictionary via pythonic (Python 3.5+) way and then uses from_records
to construct DataFrame
.
pd.DataFrame.from_records([{**d[0],**d[1]} for d in k])
chrom gEnd gStart pEnd pStart
0 chr1 2000 1000 2500 1500
1 chr2 8500 8000 9500 7500
Upvotes: 0
Reputation: 32095
You need to go down two levels of your input to get it processed. Your best friend is then chain.from_iterable
:
import itertools as it
pd.DataFrame.from_records(it.chain.from_iterable(l))
Out[21]:
chrom gEnd gStart pEnd pStart
0 chr1 2000.0 1000.0 NaN NaN
1 chr1 NaN NaN 2500.0 1500.0
2 chr2 8500.0 8000.0 NaN NaN
3 chr2 NaN NaN 9500.0 7500.0
This requires classic and easy cleanup:
pd.DataFrame.from_records(it.chain.from_iterable(l)).set_index('chrom').stack().unstack()
Out[22]:
gEnd gStart pEnd pStart
chrom
chr1 2000.0 1000.0 2500.0 1500.0
chr2 8500.0 8000.0 9500.0 7500.0
Upvotes: 1