Reputation: 4322
So the issues is that I have a large dataframe (some millions of rows) and I need to split it into separate dfs based on a value of metric (which can have several thousand unique values in the df) and then put all individual dfs into a dictionary.
The data looks like this:
>>> df.sample(20)
Out[104]:
time mhi metric
1953310 2020-09-26 09:57:59 0.364575 100004_uf7-15_l14-40
5748967 2020-11-15 14:50:27 0.430073 100004_uf11-15_l10-45
3124709 2020-10-17 23:32:50 1.000000 100004_uf5-21_l26-40
2201278 2020-10-01 12:30:26 0.020645 100004_uf09-27_l26-46
5515393 2020-11-14 03:48:50 1.000000 100004_uf9-18_l26-35
1813859 2020-09-25 00:48:42 0.572557 100004_uf7-24_l10-40
1656151 2020-09-24 00:39:28 0.673656 100004_uf07-24_l32-42
4796411 2020-11-10 09:21:54 1.000000 100004_uf5-15_l22-30
92122 2020-07-06 07:20:37 1.000000 100004_uf5-21_l26-30
3690550 2020-10-25 23:40:57 0.268361 100004_uf09-18_l28-42
4946382 2020-11-11 01:58:22 1.000000 100004_uf5-18_l22-35
3899731 2020-11-01 11:48:08 1.000000 100004_uf7-15_l22-30
5996972 2020-11-17 10:55:22 1.000000 100004_uf07-21_l32-42
7471727 2021-01-01 11:52:45 1.000000 100004_uf07-27_l30-42
3669036 2020-10-25 20:10:33 1.000000 100004_uf5-21_l10-35
1166225 2020-09-17 11:58:21 1.000000 100004_uf7-15_l22-30
5832113 2020-11-16 02:52:32 0.349082 100004_uf07-21_l28-54
1458903 2020-09-21 21:04:32 0.524897 100004_uf07-18_l30-42
3094785 2020-10-17 15:46:02 1.000000 100004_uf5-24_l18-30
674615 2020-08-05 02:31:14 0.401657 100004_uf11-18_l34-46
What I'm currently doing is this:
versions = df.metric.unique()
mhi_dict = {ver: df.loc[df.metric == ver] for ver in versions}
Yet this is proving to be very time consuming, takes over 5 minutes on an average for ~1500 unique versions. Is there a way to speed it up somehow?
Upvotes: 0
Views: 120
Reputation: 240
df_grouped = df.groupby('metric')
mhi_dict = {}
for key in df_grouped.groups:
group = df_grouped.get_group(key)
mhi_dict[key] = group
Upvotes: 1