freebie
freebie

Reputation: 1957

pandas.DataFrame.from_dict faster alternative

I have a dictionary of equally sized numpy arrays, that I want to get into a DataFrame, where each on of the values in the numpy arrays becomes a column. For example:

import numpy as np
import pandas as np

my_dict = {
    "key_1": np.arange((50_000)),
    "key_2": np.arange((50_000)),
}
df = pd.DataFrame.from_dict(my_dict, orient="index")

#        0      1      2      3      4      5      6      7      8      9      10     ...  49989  49990  49991  49992  49993  49994  49995  49996  49997  49998  49999
# key_1      0      1      2      3      4      5      6      7      8      9     10  ...  49989  49990  49991  49992  49993  49994  49995  49996  49997  49998  49999
# key_2      0      1      2      3      4      5      6      7      8      9     10  ...  49989  49990  49991  49992  49993  49994  49995  49996  49997  49998  49999

# [2 rows x 50000 columns]

If you run this, you'll see the my_dict creation is very quick, but the pandas.DataFrame.from_dict is noticeably slow.

I'm looking for a way of getting something equivalent that has better performance.

Just to give more context, just incase this step can maybe be skipped completely, the final intent is to get that pandas.DataFrame reindexed with a arbitrarily list of key names:

occurrences = ["key_1", "key_1", "key_2", "key_1", "key_2"]
df = df.reindex(occurrences)

#        0      1      2      3      4      5      6      7      8      9      10     ...  49989  49990  49991  49992  49993  49994  49995  49996  49997  49998  49999
# key_1      0      1      2      3      4      5      6      7      8      9     10  ...  49989  49990  49991  49992  49993  49994  49995  49996  49997  49998  49999
# key_1      0      1      2      3      4      5      6      7      8      9     10  ...  49989  49990  49991  49992  49993  49994  49995  49996  49997  49998  49999
# key_2      0      1      2      3      4      5      6      7      8      9     10  ...  49989  49990  49991  49992  49993  49994  49995  49996  49997  49998  49999
# key_1      0      1      2      3      4      5      6      7      8      9     10  ...  49989  49990  49991  49992  49993  49994  49995  49996  49997  49998  49999
# key_2      0      1      2      3      4      5      6      7      8      9     10  ...  49989  49990  49991  49992  49993  49994  49995  49996  49997  49998  49999

# [5 rows x 50000 columns]

Maybe a faster way just going directly from occurrences and my_dict.

EDIT: Sorry, I should have mentioned, the np.arange((50_000)) is just an example to make a large array. I can assert the data looks like that, it's just arbitrary values, but all of the same length. Also, I'm trying to avoid python loops (such as comprehensions) and use pandas or numpy as the size of these items can get very large.

Thanks.

Upvotes: 0

Views: 1612

Answers (1)

Sayandip Dutta
Sayandip Dutta

Reputation: 15872

You can do:

>>> occurrences = ["key_1", "key_2", "key_3", "key_4", "key_5"]
>>> df = pd.DataFrame(dict.fromkeys(occurrences, range(50_000))).T

If indeed there are duplicate indices:

>>> occurrences = ['key_1', 'key_1', 'key_2', 'key_1', 'key_2']
>>> df = pd.DataFrame(
             data=np.repeat([np.arange(50_000)], repeats=len(occurrences), axis=0), 
             index=occurrences
         )

In case you already have a predefined dict. For example, my_dict, use transpose:

>>> my_dict = {
    "key_1": np.arange((50_000)),
    "key_2": np.arange((50_000)),
}
>>> %timeit pd.DataFrame.from_dict(my_dict, orient="index")
3.21 s ± 62.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

>>> %timeit pd.DataFrame.from_dict(my_dict).T
551 µs ± 75 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Upvotes: 3

Related Questions