Reputation: 13
I have a dictionary like so:
dict1 = {k1:v1,k2:v2,k3:v3}
and I want to turn this dictionary into a dataframe. I have previously seen other questions here using pd.Series(dict1)
, and this yields a DataFrame like so:
Index col1
k1 v1
k2 v2
k3 v3
But in my case, I want the DataFrame to be like:
Index col1 col2 col3 col4 col5 col6
0 k1 v1 k2 v2 k3 v3
So I want to have each key and also each value as a column, and use none of them as indexes, which the traditionally recommended methods to turn a dict into a DataFrame usually use. In this example, I want the DataFrame to be a 1x6, rather than a 2x3 or 3x2.
I also have a very large dictionary of N dictionaries that I would like to apply this to, which would yield an Nx6 dataframe in this case, so hopefully the given method would not take too long to apply. Does anyone have any idea how to do this? Thanks
Upvotes: 1
Views: 54
Reputation: 5965
You can get the items of the dict
and flatten it.
I've used itertools.chain
to flatten the dict
.
Then take the transpose of the resulting dataframe created from the items.
>>> import pandas as pd
>>> from itertools import chain
>>> d = {i: i*i for i in range(1, 6)} # example dict
>>> d
{1: 1, 2: 4, 3: 9, 4: 16, 5: 25}
>>> df = pd.DataFrame(chain(*d.items())).T
>>> df
0 1 2 3 4 5 6 7 8 9
0 1 1 2 4 3 9 4 16 5 25
With the dict
in your question, it would look like this,
>>> dict1 = {'k1': 'v1', 'k2': 'v2', 'k3': 'v3'}
>>> pd.DataFrame(chain(*dict1.items())).T
0 1 2 3 4 5
0 k1 v1 k2 v2 k3 v3
If you want columns with different names, simply rename them.
Here are some time comparisons for this method for varying sizes of the input dictionary.
Comparisons for 100
, 10,000
and 1,00,000
items respectively.
In [18]: d100_items = {i: i*i for i in range(100)}.items()
In [19]: d10_000_items = {i: i*i for i in range(10_000)}.items()
In [20]: d1_00_000_items = {i: i*i for i in range(1_00_000)}.items()
In [22]: %timeit pd.DataFrame(chain(*d100_items)).T
329 µs ± 10 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [23]: %timeit pd.DataFrame(chain(*d10_000_items)).T
4.62 ms ± 83.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [24]: %timeit pd.DataFrame(chain(*d1_00_000_items)).T
56.8 ms ± 1.4 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Upvotes: 1