aabbccddeeff
aabbccddeeff

Reputation: 13

Converting both a dictionary's keys and values to columns in a pandas dataframe efficiently

I have a dictionary like so:

dict1 = {k1:v1,k2:v2,k3:v3}

and I want to turn this dictionary into a dataframe. I have previously seen other questions here using pd.Series(dict1), and this yields a DataFrame like so:

  Index  col1
    k1    v1
    k2    v2
    k3    v3

But in my case, I want the DataFrame to be like:

Index  col1   col2  col3  col4   col5   col6
  0     k1     v1    k2    v2     k3     v3

So I want to have each key and also each value as a column, and use none of them as indexes, which the traditionally recommended methods to turn a dict into a DataFrame usually use. In this example, I want the DataFrame to be a 1x6, rather than a 2x3 or 3x2.

I also have a very large dictionary of N dictionaries that I would like to apply this to, which would yield an Nx6 dataframe in this case, so hopefully the given method would not take too long to apply. Does anyone have any idea how to do this? Thanks

Upvotes: 1

Views: 54

Answers (1)

Diptangsu Goswami
Diptangsu Goswami

Reputation: 5965

You can get the items of the dict and flatten it.
I've used itertools.chain to flatten the dict.
Then take the transpose of the resulting dataframe created from the items.

>>> import pandas as pd
>>> from itertools import chain
>>> d = {i: i*i for i in range(1, 6)}  # example dict
>>> d
{1: 1, 2: 4, 3: 9, 4: 16, 5: 25}
>>> df = pd.DataFrame(chain(*d.items())).T
>>> df
   0  1  2  3  4  5  6   7  8   9
0  1  1  2  4  3  9  4  16  5  25

With the dict in your question, it would look like this,

>>> dict1 = {'k1': 'v1', 'k2': 'v2', 'k3': 'v3'}
>>> pd.DataFrame(chain(*dict1.items())).T
    0   1   2   3   4   5
0  k1  v1  k2  v2  k3  v3

If you want columns with different names, simply rename them.


Here are some time comparisons for this method for varying sizes of the input dictionary.
Comparisons for 100, 10,000 and 1,00,000 items respectively.

In [18]: d100_items = {i: i*i for i in range(100)}.items()

In [19]: d10_000_items = {i: i*i for i in range(10_000)}.items()

In [20]: d1_00_000_items = {i: i*i for i in range(1_00_000)}.items()

In [22]: %timeit pd.DataFrame(chain(*d100_items)).T
329 µs ± 10 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [23]: %timeit pd.DataFrame(chain(*d10_000_items)).T
4.62 ms ± 83.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [24]: %timeit pd.DataFrame(chain(*d1_00_000_items)).T
56.8 ms ± 1.4 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Upvotes: 1

Related Questions