Reputation: 152
I have a DataFrame with three columns: unit, influencer, and value. There are a few different types of influencers and the values represent the count of the unit. I want to create a new table displaying the most and least frequent n units for each influencer and their respective values.
My df looks like this:
Unit Influencer Value
A foo 321
B foo 200
C foo 20
D foo 12
E foo 3
A bar 999
B bar 209
C bar 89
D bar 34
E bar 15
F bar 2
My output should look like (lets say we want top and bottom 2 units):
Unit Influencer Value
A foo 321
B foo 200
D foo 12
E foo 3
A bar 999
B bar 209
E bar 15
F bar 2
I have tried something similar to the solution found here, but I am getting the error "Index contains duplicate entries, cannot reshape," which I assume is because "Influencer" is the index of my df. If my df is multiindex, the new df is created, but it is incorrect.
def get_top3(counts, col1, col2):
top3 = (counts.groupby(col1))[col2].apply(lambda x: x.nlargest(3)).reset_index(level=1, drop=True).to_frame('VAL')
top3 = counts.set_index(np.arange(len(counts)) % 3, append=True)['value'].unstack().add_prefix('VAL')
return top3
However, this creates a Dataframe that looks like:
VAL1 VAL2 VAL3
321 NaN NaN
NaN 200 NaN
NaN NaN 20
12 NaN NaN
NaN 3 NaN
...
Any suggestions would be greatly appreciated! I am also open to feedback on how my output df could be formatted. Thanks!
Upvotes: 3
Views: 10222
Reputation: 323396
Using
#df=df.sort_values('Value')
g=df.groupby('Influencer')
pd.concat([g.head(2),g.tail(2)]).sort_index()
Out[693]:
Unit Influencer Value
0 A foo 321
1 B foo 200
3 D foo 12
4 E foo 3
5 A bar 999
6 B bar 209
9 E bar 15
10 F bar 2
Upvotes: 7
Reputation: 61930
You could try:
nlargest = df.groupby('Influencer')['Value'].nlargest(2).reset_index()['level_1'].values
nsmallest = df.groupby('Influencer')['Value'].nsmallest(2).reset_index()['level_1'].values
result = pd.concat([df.iloc[nlargest], df.iloc[nsmallest]]).sort_index()
print(result)
Output
Unit Influencer Value
0 A foo 321
1 B foo 200
3 D foo 12
4 E foo 3
5 A bar 999
6 B bar 209
9 E bar 15
10 F bar 2
Upvotes: 5
Reputation: 8826
try creating a function like:
def selc_df(df, x=2):
return df.head(x).append(df.tail(x))
selc_df(df,2)
Example :
>>> df
A B
0 1 345
1 2 366
2 3 299
3 3 455
4 4 879
5 5 321
6 5 957
7 6 543
Result:
>>> def selc_df(df, x=2):
... return df.head(x).append(df.tail(x))
...
>>> selc_df(df,2)
A B
0 1 345
1 2 366
6 5 957
7 6 543
Upvotes: 1