Reputation: 327
I have been working on formatting a log file and finally I have arrived to the following dataframe sample, where the categories and numbers I want to add are in the same column:
df = pd.DataFrame(dict(a=['Cat. A',1,1,3,'Cat. A',2,2,'Cat. B',3,5,2,6,'Cat. B',1,'Cat. C',4]))
>>> a
0 Cat. A
1 1
2 1
3 3
4 Cat. A
5 2
6 2
7 Cat. B
8 3
9 5
10 2
11 6
12 Cat. B
13 1
14 Cat. C
15 4
If I sum all the numbers below each category I want to obtain:
Cat. A= 1+1+3+2+2 = 9
Cat. B= 3+5+2+6+1 = 17
Cat. C= 4
I know how to do with going through all the files in the classic way, but I would like to know how to do it in a most pythonic way, considering that the numbers of rows for each category can be variable, and that the numbers of times that the category appears in each dataframe can be different too.
Upvotes: 6
Views: 1719
Reputation: 1459
In a very fiew lines, in a simple and direct way:
import pandas as pd
df = pd.DataFrame(dict(a=['Cat. A', 1, 1, 3, 'Cat. A', 2, 2, 'Cat. B', 3, 5, 2, 6, 'Cat. B', 1, 'Cat. C', 4]))
df['label'] = (df['a'].apply(lambda x: x if str(x)[0] == 'C' else None)).ffill()
df['value'] = df['a'].apply(lambda x: 0 if str(x)[0] == 'C' else int(x))
result = df.groupby('label').sum()
print(result)
output
value
label
Cat. A 9
Cat. B 17
Cat. C 4
Explanation
label
and 'value' columns are created and store labels and values, respectively
then use groupby
with sum
Upvotes: 0
Reputation: 20669
We can use pd.to_numeric
to mark non-numeric fields as nan
using Series.mask
and Series.notna
then use for group. Then use GroupBy.sum
a = pd.to_numeric(df['a'], errors='coerce')
g = df['a'].mask(a.notna()).ffill()
a.groupby(g).sum()
Cat. A 9.0
Cat. B 17.0
Cat. C 4.0
Name: a, dtype: float64
Upvotes: 3
Reputation: 788
This is also another way
df = pd.DataFrame(dict(a=['Cat. A',1,1,3,'Cat. A',2,2,'Cat. B',3,5,2,6,'Cat. B',1,'Cat. C',4]))
def coerce(x):
try:
int(x)
return np.nan
except:
return x
def safesum(x):
return x[x!=x.iloc[0]].astype(int).sum()
df['b'] = df['a'].apply(coerce).ffill()
df.groupby('b').agg(safesum)
Produces
a
b
Cat. A 9
Cat. B 17
Cat. C 4
Upvotes: 6
Reputation: 10624
Here is one way to do it. not very pythonic, but it does the work:
l=list(df.a)
d={i:0 for i in l if str(i)[:3]=='Cat'}
for i in l:
if str(i)[:3]=='Cat':
key=i
else:
d[key]+=i
>>> print(d)
{'Cat. A': 9, 'Cat. B': 17, 'Cat. C': 4}
Upvotes: 0