Reputation: 5791
I have such dataframe - each transaction might appear more than one, and transactions linked to stores. I want to find mean value of transaction. For that I need to sum and then find mean value:
#preparind dataset
txt_data = pandas.read_csv("./TestDataSource/txn.csv", sep = ';')
txt_data = txt_data.replace({',': '.'}, regex=True)
txt_data[['SALES']] = txt_data[[ 'SALES']].apply(pd.to_numeric)
We have len(txt_data.STORE.unique())
only 30 unique STORE here.
First of all I aggregate on transactions:
a1 = txt_data[['STORE', 'SALES', 'TXN']].groupby('TXN').sum()[['STORE', 'SALES']]
a.head()
Everything seems to be okay. But then I aggregate on stores:
a2 = a1.groupby('STORE').mean()
[![enter image description here][3]][3]
But...
list(a2.shape)
- returns [1137, 1] . Thats really confusing. But moreover len(a1.STORE.unique())
returns 1137
What am I doing wrong
Upvotes: 4
Views: 233
Reputation: 1399
I think your problem is happened when you used this line,
a1 = txt_data[['STORE', 'SALES', 'TXN']].groupby('TXN').sum()
When I get unique values using txt_data['STORE'].unique() it outputs,
array([22691, 20581, 1574, 1602, 1579, 29245, 19009, 21761, 17474,
1544, 1612, 1534, 958, 17096, 1094, 1596, 1594, 1609,
24605, 956, 961, 1122, 27220, 974, 1082, 25039, 1530,
999, 1053, 980])
But In the a1 Dataframe,
STORE values are diffrent than txt_data because group_by.sum() summed values in STORE inorder to get unique 'TXN's.
See : There is no STORE = 4328 in txt_data['STORE'].unique()
1082 * 4 = 4328
Upvotes: 1
Reputation: 863791
There is problem you aggregate STORE
and SALES
columns by sum
per TXN
column:
a1 = txt_data[['STORE', 'SALES', 'TXN']].groupby('TXN').sum()[['STORE', 'SALES']]
What is same as:
a1 = txt_data.groupby('TXN')['STORE', 'SALES'].sum()
But if aggregate by column TXT
and STORE
all working nice:
txt_data = pd.read_csv("txn.csv", sep = ';', decimal=',')
a1 = txt_data.groupby(['TXN', 'STORE'], as_index=False)['SALES'].sum()
print (txt_data.STORE.nunique())
30
print (a1.STORE.nunique())
30
Upvotes: 4
Reputation: 2702
In the line
a1 = txt_data[['STORE', 'SALES', 'TXN']].groupby('TXN').sum()[['STORE', 'SALES']]
your are grouping your dataframe by TXN
, but tell pandas to sum all other columns, so you get the store-ids summed and "new stores created", e.g.:
txt_data[txt_data['TXN']==5541359000]
DAY STORE ART TXN TIME SALES
1268877 2015-10-01 1082 15294488 5541359000 09:30:22 60.2
1269093 2015-10-01 1082 80439 5541359000 09:30:29 15.6
1269309 2015-10-01 1082 191452 5541359000 09:30:15 4.0
1269525 2015-10-01 1082 15317962 5541359000 09:30:17 103.0
a1.head()
STORE SALES
TXN
5541359000 4328 182.8
#1082 * 4 = 4328
Upvotes: 1