Reputation: 34155
I'm looking for a good transformation in pandas which would allow me to go from a time series of measurements into a count-per-bin-per-timebin list.
Let's say I've got my:
x = list(range(count))
y = [random.gauss(1, 0.1) for _ in range(count)]
I can transform this into binned intervals on both sides:
df = pandas.DataFrame.from_dict({'x': x, 'y': y})
df['x'].update(pandas.cut(df['x'], 20))
df['y'].update(pandas.cut(df['y'], 20))
I know I can get the value counts for the y
s using:
df['y'].value_counts()
But I'm having trouble putting the "run value_counts on y
grouped by unique x
values, then unroll, and return that" into a valid operation.
Example:
y = [1, 1, 2, 3, 4, 4]
x = [0, 1, 2, 3, 4, 5]
bin_count = 2
expected:
df: x y count
0-2 1 2
0-2 2 1
3-5 3 1
3-5 4 2
Upvotes: 1
Views: 129
Reputation: 862761
I believe you need SeriesGroupBy.value_counts
with reset_index
:
y = [1, 1, 2, 3, 4, 4]
x = [0, 1, 2, 3, 4, 5]
bin_count = 2
df = pd.DataFrame.from_dict({'x': x, 'y': y})
df['x'].update(pd.cut(df['x'], bin_count))
df1 = df.groupby('x')['y'].value_counts().reset_index(name='count')
print (df1)
x y count
0 (-0.005, 2.5] 1 2
1 (-0.005, 2.5] 2 1
2 (2.5, 5.0] 4 2
3 (2.5, 5.0] 3 1
For columns from y
use unstack
:
df1 = df.groupby('x')['y'].value_counts().unstack(fill_value=0)
print (df1)
y 1 2 3 4
x
(-0.005, 2.5] 2 1 0 0
(2.5, 5.0] 0 0 1 2
EDIT:
If need unique values for bins add parameter labels=False
to cut
:
df['x'].update(pd.cut(df['x'], bin_count, labels=False))
df1 = df.groupby('x')['y'].value_counts().unstack(fill_value=0)
print (df1)
y 1 2 3 4
x
0 2 1 0 0
1 0 0 1 2
Upvotes: 2