Reputation: 1957
I have a dictionary of equally sized numpy arrays, that I want to get into a DataFrame, where each on of the values in the numpy arrays becomes a column. For example:
import numpy as np
import pandas as np
my_dict = {
"key_1": np.arange((50_000)),
"key_2": np.arange((50_000)),
}
df = pd.DataFrame.from_dict(my_dict, orient="index")
# 0 1 2 3 4 5 6 7 8 9 10 ... 49989 49990 49991 49992 49993 49994 49995 49996 49997 49998 49999
# key_1 0 1 2 3 4 5 6 7 8 9 10 ... 49989 49990 49991 49992 49993 49994 49995 49996 49997 49998 49999
# key_2 0 1 2 3 4 5 6 7 8 9 10 ... 49989 49990 49991 49992 49993 49994 49995 49996 49997 49998 49999
# [2 rows x 50000 columns]
If you run this, you'll see the my_dict
creation is very quick, but the pandas.DataFrame.from_dict
is noticeably slow.
I'm looking for a way of getting something equivalent that has better performance.
Just to give more context, just incase this step can maybe be skipped completely, the final intent is to get that pandas.DataFrame
reindexed with a arbitrarily list of key names:
occurrences = ["key_1", "key_1", "key_2", "key_1", "key_2"]
df = df.reindex(occurrences)
# 0 1 2 3 4 5 6 7 8 9 10 ... 49989 49990 49991 49992 49993 49994 49995 49996 49997 49998 49999
# key_1 0 1 2 3 4 5 6 7 8 9 10 ... 49989 49990 49991 49992 49993 49994 49995 49996 49997 49998 49999
# key_1 0 1 2 3 4 5 6 7 8 9 10 ... 49989 49990 49991 49992 49993 49994 49995 49996 49997 49998 49999
# key_2 0 1 2 3 4 5 6 7 8 9 10 ... 49989 49990 49991 49992 49993 49994 49995 49996 49997 49998 49999
# key_1 0 1 2 3 4 5 6 7 8 9 10 ... 49989 49990 49991 49992 49993 49994 49995 49996 49997 49998 49999
# key_2 0 1 2 3 4 5 6 7 8 9 10 ... 49989 49990 49991 49992 49993 49994 49995 49996 49997 49998 49999
# [5 rows x 50000 columns]
Maybe a faster way just going directly from occurrences
and my_dict
.
EDIT: Sorry, I should have mentioned, the np.arange((50_000))
is just an example to make a large array. I can assert the data looks like that, it's just arbitrary values, but all of the same length. Also, I'm trying to avoid python loops (such as comprehensions) and use pandas
or numpy
as the size of these items can get very large.
Thanks.
Upvotes: 0
Views: 1612
Reputation: 15872
You can do:
>>> occurrences = ["key_1", "key_2", "key_3", "key_4", "key_5"]
>>> df = pd.DataFrame(dict.fromkeys(occurrences, range(50_000))).T
If indeed there are duplicate indices:
>>> occurrences = ['key_1', 'key_1', 'key_2', 'key_1', 'key_2']
>>> df = pd.DataFrame(
data=np.repeat([np.arange(50_000)], repeats=len(occurrences), axis=0),
index=occurrences
)
In case you already have a predefined dict
. For example, my_dict
, use transpose
:
>>> my_dict = {
"key_1": np.arange((50_000)),
"key_2": np.arange((50_000)),
}
>>> %timeit pd.DataFrame.from_dict(my_dict, orient="index")
3.21 s ± 62.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
>>> %timeit pd.DataFrame.from_dict(my_dict).T
551 µs ± 75 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Upvotes: 3