Wayne Hwang
Wayne Hwang

Reputation: 1

Get unique value by grouped row

I have a Series like this before doing groupby('A') and want to see the difference of composition among these groups:

    A    B    C
1  us   air   35
2  us   gas   34
3  us   sun   30
4  us   fog   23
5  us   mud   33
6  fr   air   45
7  fr   bug   15
8  fr   sun   35
9  fr   mud   85
      ...

Could anyone please help me with how to see the variance of groups? Something I need would be like:

us    [air,gas,sun,fog,mud]
eu    [air,gas,sun,fog]
fr    [air,bug,sun,mud]

Thank you very much and I greatly appreciate your help.

--update

data is here

I would like to see if every day has the same composition. It doesn't work well when I tried groupby('MonitorDate')['ItemEngName'].unique()

Thanks in advance.

--update

Thanks to @Perl. df.groupby('MonitorDate')['ItemEngName'].apply(list) works. But it's not what I want. What I want is to get a unique combination in a day(like set). Similar to use set(['ItemEngName'] after groupby.

So I tried df.groupby('MonitorDate')['ItemEngName'].unique() and got

MonitorDate 
2010/1/1    [AMB_TEMP, CO, NO, NO2, NOx, O3, PM10, PM2.5, ...
2010/1/2    [AMB_TEMP, CO, NO, NO2, NOx, O3, PM10, PM2.5, ...
2010/1/3    [AMB_TEMP, CO, NO, NO2, NOx, O3, PM10, PM2.5, ...
2010/1/4    [AMB_TEMP, CO, NO, NO2, NOx, O3, PM10, PM2.5, ...
2010/1/5    [AMB_TEMP, CO, NO, NO2, NOx, O3, PM10, PM2.5, ...
2010/1/6    [AMB_TEMP, CO, NO, NO2, NOx, O3, PM10, PM2.5, ...
2010/1/7    [AMB_TEMP, CO, NO, NO2, NOx, O3, PM10, PM2.5, ...
Name:     ItemEngName, dtype: object

It seems they are all the same except 1/7. But I need to check the difference manually.

--update

Expected Output:

Thanks @perl again

MonitorDate 
2010/1/1    [AMB_TEMP, CO, NO, NO2, NOx, O3, PM10, PM2.5, ..., WS_HR]
2010/1/7    [AMB_TEMP, CO, NO, NO2, NOx, O3, PM10, PM2.5, ...]
Name:     ItemEngName, dtype: object

For more data case, the more complicated condition would be like:

MonitorDate 
2010/1/1    [AMB_TEMP, CO, NO, NO2, NOx, O3, PM10, PM2.5, ..., WS_HR] (full)
2010/1/7    [AMB_TEMP, CO, NO, NO2, NOx, O3, PM10, PM2.5, ...] ..(lack WS_HR)
2010/1/8    [AMB_TEMP, CO, NO, NO2, NOx, O3, PM2.5, ..., WS_HR] .(lack PM10)
2010/1/9    [AMB_TEMP, CO, NO, NO2, NOx, O3, PM10, , ...] .......(lack PM2.5)
...
Name:     ItemEngName, dtype: object

I can only use df.groupby('MonitorDate')["ItemEngName"].nunique().unique()to know different number of items but not knowing the difference of components. Is there any effective way?

Upvotes: 0

Views: 67

Answers (1)

perl
perl

Reputation: 9941

You can groupby column A, then take the B column and convert grouped values to lists with apply:

df.groupby('A')['B'].apply(list)

Output:

A
fr         [air, bug, sun, mud]
us    [air, gas, sun, fog, mud]
Name: B, dtype: object

If the lists may have duplicates and you're only interested in unique values, probably the best option is to go with the solution from @anky's comment:

df.groupby('A')['B'].unique()

Update: In your dataset they are almost the same, except 2010/1/7 not having WS_HR:

df[['MonitorDate', 'ItemEngName']].value_counts().unstack()

Output:

ItemEngName  AMB_TEMP   CO   NO  NO2  NOx   O3  PM10  PM2.5   RH  SO2  WD_HR  \
MonitorDate                                                                    
2010/1/1          1.0  1.0  1.0  1.0  1.0  1.0   1.0    1.0  1.0  1.0    1.0   
2010/1/2          1.0  1.0  1.0  1.0  1.0  1.0   1.0    1.0  1.0  1.0    1.0   
2010/1/3          1.0  1.0  1.0  1.0  1.0  1.0   1.0    1.0  1.0  1.0    1.0   
2010/1/4          1.0  1.0  1.0  1.0  1.0  1.0   1.0    1.0  1.0  1.0    1.0   
2010/1/5          1.0  1.0  1.0  1.0  1.0  1.0   1.0    1.0  1.0  1.0    1.0   
2010/1/6          1.0  1.0  1.0  1.0  1.0  1.0   1.0    1.0  1.0  1.0    1.0   
2010/1/7          1.0  1.0  1.0  1.0  1.0  1.0   1.0    1.0  1.0  1.0    1.0   

ItemEngName  WIND_DIREC  WIND_SPEED  WS_HR  
MonitorDate                                 
2010/1/1            1.0         1.0    1.0  
2010/1/2            1.0         1.0    1.0  
2010/1/3            1.0         1.0    1.0  
2010/1/4            1.0         1.0    1.0  
2010/1/5            1.0         1.0    1.0  
2010/1/6            1.0         1.0    1.0  
2010/1/7            1.0         1.0    NaN 

Update 2: If you just want to visually check which days have which items, you can plot it. Here for example you can see that WS_HR is missing on 2010/1/7:

z = df[['MonitorDate', 'ItemEngName']].value_counts().unstack()
plt.pcolor(z, alpha=0.2)
plt.yticks(np.arange(0.5, len(z.index), 1), z.index)
plt.xticks(np.arange(0.5, len(z.columns), 1), z.columns, rotation=90)
plt.show()

Output:

picture

Upvotes: 1

Related Questions