SeasonedLeo
SeasonedLeo

Reputation: 71

Resampling of categorical column in pandas data frame

I need some help in figuring out this. Have been trying a few things but not working. I have a pandas data frame shown below(in the end) : The data is available at irregular intervals ( frequency not fixed). I am looking to sample the data at a fixed frequency for eg every 1 minute. If the column is a float then mean every 1 minute works fine

df1.resample('1T',base = 1).mean()

but since the data is categorical mean doesn't make sense, I also tried sum which is also not making sense from sampling. What essentially I need is the max count of the column when sampled at 1 minute To do this I used the following code to apply the custom function to the values that fall in 1 minute when resampling . .

    def custome_mod(arraylike):
      vals, counts = np.unique(arraylike, return_counts=True)
  return (np.argwhere(counts == np.max(counts)))

df1.resample('1T',base = 1).apply(custome_mod) 

The output I am expecting is : data frame available at every 1 minute and value with maximum count for the data that fall in that 1 minute . For some reason it does not seem to work and gives me error . Have been trying to debugg for a very long time . Can somebody please provide some inputs/code check ?

The error I get is following :

ValueError: zero-size array to reduction operation maximum which has no identity

ValueError                                Traceback (most recent call last)
/databricks/python/lib/python3.7/site-packages/pandas/core/groupby/generic.py in aggregate(self, func, *args, **kwargs)
    264             try:
--> 265                 return self._python_agg_general(func, *args, **kwargs)
    266             except (ValueError, KeyError):

/databricks/python/lib/python3.7/site-packages/pandas/core/groupby/groupby.py in _python_agg_general(self, func, *args, **kwargs)
    935 
--> 936             result, counts = self.grouper.agg_series(obj, f)
    937             assert result is not None

/databricks/python/lib/python3.7/site-packages/pandas/core/groupby/ops.py in agg_series(self, obj, func)
    862         grouper = libreduction.SeriesBinGrouper(obj, func, self.bins, dummy)
--> 863         return grouper.get_result()
    864 

pandas/_libs/reduction.pyx in pandas._libs.reduction.SeriesBinGrouper.get_result()

pandas/_libs/reduction.pyx in pandas._libs.reduction._BaseGrouper._apply_to_group()

pandas/_libs/reduction.pyx in pandas._libs.reduction._check_result_array()

ValueError: Function does not reduce

During handling of the above exception, another exception occurred:

ValueError                                Traceback (most recent call last)
/databricks/python/lib/python3.7/site-packages/pandas/core/resample.py in _groupby_and_aggregate(self, how, grouper, *args, **kwargs)
    358                 # Check if the function is reducing or not.
--> 359                 result = grouped._aggregate_item_by_item(how, *args, **kwargs)
    360             else:

/databricks/python/lib/python3.7/site-packages/pandas/core/groupby/generic.py in _aggregate_item_by_item(self, func, *args, **kwargs)
   1171             try:
-> 1172                 result[item] = colg.aggregate(func, *args, **kwargs)
   1173 

/databricks/python/lib/python3.7/site-packages/pandas/core/groupby/generic.py in aggregate(self, func, *args, **kwargs)
    268                 #  see see test_groupby.test_basic
--> 269                 result = self._aggregate_named(func, *args, **kwargs)
    270 

/databricks/python/lib/python3.7/site-packages/pandas/core/groupby/generic.py in _aggregate_named(self, func, *args, **kwargs)
    453             if isinstance(output, (Series, Index, np.ndarray)):
--> 454                 raise ValueError("Must produce aggregated value")
    455             result[name] = output

ValueError: Must produce aggregated value

During handling of the above exception, another exception occurred:

ValueError                                Traceback (most recent call last)
<command-36984414005459> in <module>
----> 1 df1.resample('1T',base = 1).apply(custome_mod)

/databricks/python/lib/python3.7/site-packages/pandas/core/resample.py in aggregate(self, func, *args, **kwargs)
    283             how = func
    284             grouper = None
--> 285             result = self._groupby_and_aggregate(how, grouper, *args, **kwargs)
    286 
    287         result = self._apply_loffset(result)

/databricks/python/lib/python3.7/site-packages/pandas/core/resample.py in _groupby_and_aggregate(self, how, grouper, *args, **kwargs)
    380             # we have a non-reducing function
    381             # try to evaluate
--> 382             result = grouped.apply(how, *args, **kwargs)
    383 
    384         result = self._apply_loffset(result)

/databricks/python/lib/python3.7/site-packages/pandas/core/groupby/groupby.py in apply(self, func, *args, **kwargs)
    733         with option_context("mode.chained_assignment", None):
    734             try:
--> 735                 result = self._python_apply_general(f)
    736             except TypeError:
    737                 # gh-20949

/databricks/python/lib/python3.7/site-packages/pandas/core/groupby/groupby.py in _python_apply_general(self, f)
    749 
    750     def _python_apply_general(self, f):
--> 751         keys, values, mutated = self.grouper.apply(f, self._selected_obj, self.axis)
    752 
    753         return self._wrap_applied_output(

/databricks/python/lib/python3.7/site-packages/pandas/core/groupby/ops.py in apply(self, f, data, axis)
    204             # group might be modified
    205             group_axes = group.axes
--> 206             res = f(group)
    207             if not _is_indexed_like(res, group_axes):
    208                 mutated = True

<command-36984414005658> in custome_mod(arraylike)
      1 def custome_mod(arraylike):
      2   vals, counts = np.unique(arraylike, return_counts=True)
----> 3   return (np.argwhere(counts == np.max(counts)))

<__array_function__ internals> in amax(*args, **kwargs)

/databricks/python/lib/python3.7/site-packages/numpy/core/fromnumeric.py in amax(a, axis, out, keepdims, initial, where)
   2666     """
   2667     return _wrapreduction(a, np.maximum, 'max', axis, None, out,
-> 2668                           keepdims=keepdims, initial=initial, where=where)
   2669 
   2670 

/databricks/python/lib/python3.7/site-packages/numpy/core/fromnumeric.py in _wrapreduction(obj, ufunc, method, axis, dtype, out, **kwargs)
     88                 return reduction(axis=axis, out=out, **passkwargs)
     89 
---> 90     return ufunc.reduce(obj, axis, dtype, out, **passkwargs)
     91 
     92 

ValueError: zero-size array to reduction operation maximum which has no identity

Sample Dataframe and expected Output

Sample Df

6/3/2021 1:19:05    0
6/3/2021 1:19:15    1
6/3/2021 1:19:26    1
6/3/2021 1:19:38    1
6/3/2021 1:20:06    0
6/3/2021 1:20:16    0
6/3/2021 1:20:36    1
6/3/2021 1:21:09    1
6/3/2021 1:21:19    1
6/3/2021 1:21:45    0
6/4/2021 1:19:15    0
6/4/2021 1:19:25    0
6/4/2021 1:19:36    0
6/4/2021 1:19:48    1
6/4/2021 1:22:26    1
6/4/2021 1:22:36    0
6/4/2021 1:22:46    0
6/5/2021 2:20:19    0
6/5/2021 2:20:21    1
6/5/2021 2:20:40    0

Expected Output

6/3/2021 1:19   1
6/3/2021 1:20   0
6/3/2021 1:21   1
6/4/2021 1:19   0
6/4/2021 1:22   0
6/5/2021 2:20   0

Notice that original Data frame has data available at irregular frequency ( sometime every 5 second 20 seconds etc . The output expected is also show abover - need data every 1 minute ( resample to every minute instead of original irregular seconds) and the categorical column should have most frequent value during that minute. For ex : in orginal data at in 19minute there are four data points and the most frequent value in that is 1. Similarly at 20 minute there are three data points in original data and the most frquent is 0 . Similarly for 21 minutes there are three data points and the most frequent is 1. Also data I am working has 20 million rows . Hope it helps, This is an effort to reduce the data dimension .

After expected output I would do groupby column and count . This count will be in minutes and I will be able to know How long this column was 1 (in time )

Upvotes: 2

Views: 1076

Answers (1)

Corralien
Corralien

Reputation: 120391

Update after your edit:

out = df.set_index(pd.to_datetime(df.index).floor('T')) \
        .groupby(level=0)['category'] \
        .apply(lambda x: x.value_counts().idxmax())
print(out)

# Output
2021-06-03 01:19:00    1
2021-06-03 01:20:00    0
2021-06-03 01:21:00    1
2021-06-04 01:19:00    0
2021-06-04 01:22:00    0
2021-06-05 02:20:00    0
Name: category, dtype: int64

Old answer

# I used 'D' instead of 'T'
>>> df.set_index(df.index.floor('D')).groupby(level=0).count()
            category
2021-06-03         6
2021-06-04         2
2021-06-06         1
2021-06-08         1
2021-06-25         1
2021-06-29         6
2021-06-30         3

# OR

>>> df.set_index(df.index.floor('D')).groupby(level=0).sum()
            category
2021-06-03         2
2021-06-04         0
2021-06-06         1
2021-06-08         1
2021-06-25         0
2021-06-29         3
2021-06-30         1

Upvotes: 1

Related Questions