Reputation: 1
I have a Series like this before doing groupby('A') and want to see the difference of composition among these groups:
A B C
1 us air 35
2 us gas 34
3 us sun 30
4 us fog 23
5 us mud 33
6 fr air 45
7 fr bug 15
8 fr sun 35
9 fr mud 85
...
Could anyone please help me with how to see the variance of groups? Something I need would be like:
us [air,gas,sun,fog,mud]
eu [air,gas,sun,fog]
fr [air,bug,sun,mud]
Thank you very much and I greatly appreciate your help.
--update
I would like to see if every day has the same composition.
It doesn't work well when I tried groupby('MonitorDate')['ItemEngName'].unique()
Thanks in advance.
--update
Thanks to @Perl. df.groupby('MonitorDate')['ItemEngName'].apply(list)
works. But it's not what I want. What I want is to get a unique combination in a day(like set). Similar to use set(['ItemEngName'] after groupby.
So I tried df.groupby('MonitorDate')['ItemEngName'].unique()
and got
MonitorDate
2010/1/1 [AMB_TEMP, CO, NO, NO2, NOx, O3, PM10, PM2.5, ...
2010/1/2 [AMB_TEMP, CO, NO, NO2, NOx, O3, PM10, PM2.5, ...
2010/1/3 [AMB_TEMP, CO, NO, NO2, NOx, O3, PM10, PM2.5, ...
2010/1/4 [AMB_TEMP, CO, NO, NO2, NOx, O3, PM10, PM2.5, ...
2010/1/5 [AMB_TEMP, CO, NO, NO2, NOx, O3, PM10, PM2.5, ...
2010/1/6 [AMB_TEMP, CO, NO, NO2, NOx, O3, PM10, PM2.5, ...
2010/1/7 [AMB_TEMP, CO, NO, NO2, NOx, O3, PM10, PM2.5, ...
Name: ItemEngName, dtype: object
It seems they are all the same except 1/7. But I need to check the difference manually.
--update
Expected Output:
Thanks @perl again
MonitorDate
2010/1/1 [AMB_TEMP, CO, NO, NO2, NOx, O3, PM10, PM2.5, ..., WS_HR]
2010/1/7 [AMB_TEMP, CO, NO, NO2, NOx, O3, PM10, PM2.5, ...]
Name: ItemEngName, dtype: object
For more data case, the more complicated condition would be like:
MonitorDate
2010/1/1 [AMB_TEMP, CO, NO, NO2, NOx, O3, PM10, PM2.5, ..., WS_HR] (full)
2010/1/7 [AMB_TEMP, CO, NO, NO2, NOx, O3, PM10, PM2.5, ...] ..(lack WS_HR)
2010/1/8 [AMB_TEMP, CO, NO, NO2, NOx, O3, PM2.5, ..., WS_HR] .(lack PM10)
2010/1/9 [AMB_TEMP, CO, NO, NO2, NOx, O3, PM10, , ...] .......(lack PM2.5)
...
Name: ItemEngName, dtype: object
I can only use df.groupby('MonitorDate')["ItemEngName"].nunique().unique()
to know different number of items but not knowing the difference of components.
Is there any effective way?
Upvotes: 0
Views: 67
Reputation: 9941
You can groupby
column A, then take the B column and convert grouped values to lists with apply
:
df.groupby('A')['B'].apply(list)
Output:
A
fr [air, bug, sun, mud]
us [air, gas, sun, fog, mud]
Name: B, dtype: object
If the lists may have duplicates and you're only interested in unique values, probably the best option is to go with the solution from @anky's comment:
df.groupby('A')['B'].unique()
Update: In your dataset they are almost the same, except 2010/1/7
not having WS_HR
:
df[['MonitorDate', 'ItemEngName']].value_counts().unstack()
Output:
ItemEngName AMB_TEMP CO NO NO2 NOx O3 PM10 PM2.5 RH SO2 WD_HR \
MonitorDate
2010/1/1 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
2010/1/2 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
2010/1/3 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
2010/1/4 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
2010/1/5 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
2010/1/6 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
2010/1/7 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
ItemEngName WIND_DIREC WIND_SPEED WS_HR
MonitorDate
2010/1/1 1.0 1.0 1.0
2010/1/2 1.0 1.0 1.0
2010/1/3 1.0 1.0 1.0
2010/1/4 1.0 1.0 1.0
2010/1/5 1.0 1.0 1.0
2010/1/6 1.0 1.0 1.0
2010/1/7 1.0 1.0 NaN
Update 2: If you just want to visually check which days have which items, you can plot it. Here for example you can see that WS_HR
is missing on 2010/1/7
:
z = df[['MonitorDate', 'ItemEngName']].value_counts().unstack()
plt.pcolor(z, alpha=0.2)
plt.yticks(np.arange(0.5, len(z.index), 1), z.index)
plt.xticks(np.arange(0.5, len(z.columns), 1), z.columns, rotation=90)
plt.show()
Output:
Upvotes: 1