Reputation: 5822
This is how my data look like:
id date rt dnm
101122 2017-01-24 0.0 70
101122 2017-01-08 0.0 49
101122 2017-04-13 0.02976 67
101122 2017-08-03 1.02565 39
101122 2016-12-01 0.0 46
101122 2017-01-25 0.0 69
101122 2017-01-02 0.0 76
101122 2017-07-18 0.02631 38
101122 2016-06-02 0.0 120
221344 2016-10-21 0.00182 176
221344 2016-09-21 0.47732 194
221344 2016-06-23 0.0 169
221344 2017-10-10 0.91391 151
221344 2017-04-29 0.0 33
221344 2017-02-05 0.0 31
221344 2017-10-16 0.0 196
221344 2016-09-25 0.0 33
221344 2016-07-17 0.0 21
221344 2016-07-21 0.0 46
615695 2017-07-12 0.0 21
615695 2017-07-05 0.0 18
615695 2016-07-11 0.0 38
615695 2016-07-19 0.03655 29
615695 2017-05-27 0.0 23
615695 2017-12-22 0.0 20
615695 2017-04-25 0.0 34
615695 2017-03-23 0.0 20
615695 2016-09-23 0.0 25
615695 2016-06-18 0.0 25
I'm trying to get the sum of 'dmn' column for each 'id' and give this new column a name like 'sum_values'. After that I need to get the id's that have the 'sum_values' higher than 300. The following code generates the first part:
data = pd.read_csv(file_name, sep='\t', header=0,
parse_dates=[1], infer_datetime_format=True);
test = (data.assign(sum_values = data.groupby('id')['dnm'].transform(np.sum))
.query('sum_values > 300'))
This will add a new column named 'sum_values' and repeat the sum value for each id several times. I need to get a unique value of 'id' and 'sum_values' column. But I can't figure out how/where to add the nunique().
This is the desired outcome:
id sum_values(>300)
101122 574
221344 1050
Any ideas?
Upvotes: 0
Views: 61
Reputation: 294218
This is how I interpreted what you said
df.assign(sum_values=df.groupby('id').dnm.transform('sum')).query('sum_values > 300')
id date rt dnm sum_values
0 101122 2017-01-24 0.00000 70 574
1 101122 2017-01-08 0.00000 49 574
2 101122 2017-04-13 0.02976 67 574
3 101122 2017-08-03 1.02565 39 574
4 101122 2016-12-01 0.00000 46 574
5 101122 2017-01-25 0.00000 69 574
6 101122 2017-01-02 0.00000 76 574
7 101122 2017-07-18 0.02631 38 574
8 101122 2016-06-02 0.00000 120 574
9 221344 2016-10-21 0.00182 176 1050
10 221344 2016-09-21 0.47732 194 1050
11 221344 2016-06-23 0.00000 169 1050
12 221344 2017-10-10 0.91391 151 1050
13 221344 2017-04-29 0.00000 33 1050
14 221344 2017-02-05 0.00000 31 1050
15 221344 2017-10-16 0.00000 196 1050
16 221344 2016-09-25 0.00000 33 1050
17 221344 2016-07-17 0.00000 21 1050
18 221344 2016-07-21 0.00000 46 1050
This is how I'd get your desired results.
f, u = pd.factorize(df.id)
sums = np.bincount(f, df.dnm)
mask = sums > 300
pd.DataFrame(dict(
sum_values=sums[mask],
ID=u[mask]
))
ID sum_values
0 101122 574.0
1 221344 1050.0
Upvotes: 2
Reputation: 51165
groupby
with sum
d = df.groupby('id')['dnm'].sum()
indexing
d[d > 500]
id
101122 574
221344 1050
Name: dnm, dtype: int64
If you want the column name in the output, just use d[d > 500].reset_index()
Upvotes: 5