Reputation: 159
I am having some data of 13961 rows in pandas DF, I have 30+ of unique values for the category in X axis( City name ) and one more feature 'Retention Flag' in Y axis with only two levels(namely Retained/not retained).
While ploting using pd.crosstab ( it is showing all the 30+ cities-unique values in the X axis which is too clumsy and dense to understand). Instead can I show only the top 20/10 unique levels in X axis and leave the others ( or add them into the category others ) ? Please help, not mandate to deal only with pd.crosstab.
I have created pd.cross tab, X axis-BORROWER_CITY , Y axis-'Retention_Flag'
This shows all the 30+ cities in X axis, instead i need only the top n (20/30 ) to be shown in the X axis labels
df2=data.groupby("BORROWER_CITY") ['Retention_Flag'].value_counts().groupby(level=1).nlargest(4).unstack(fill_value=0)
df2.plot(kind='bar')
The o/p obtained is shown above:
Retention_Flag Non Retained Retained
Retention_Flag BORROWER_CITY
Non Retained Bangalore 837 0
Delhi 1477 0
Mumbai 2507 0
Pune 838 0
Retained Bangalore 0 52
Chennai 0 106
Mumbai 0 168
Pune 0 67
the plot is shown above with 'RetentionFlag,BORROWER_CITY'in X axis - 8
entries
instead of having observed pic 2, having two entries for city column in
Xaxis for ( retained/non retained ) can i have single entry alone with city
name since, already i have legend for the flag.
second try:
instead of nlargest, while trying with head(4) the picture is shown as
myself expected , but it is not giving the largest value_counts() instead
resulting the city name in alphabetical order. observed pic 3
df3=data.groupby("BORROWER_CITY")['Retention_Flag'].value_counts().groupby(level=1).head(4).unstack(fill_value=0)
print(df3)
Retention_Flag Non Retained Retained
BORROWER_CITY
Adilabad 2 0
Agra 17 0
Ahmedabad 434 21
Ahmednagar 19 1
Alappuzha 0 1
Ambala 0 2
df3.plot(kind='bar')
the plot is shown above with 'BORROWER_CITY'in X axis - 6 entries
Upvotes: 1
Views: 357
Reputation: 862671
You can count top values per both categories with SeriesGroupBy.value_counts
and GroupBy.head
, then reshape by Series.unstack
:
data = pd.DataFrame({
'BORROWER_CITY':list('abcdabaaadab'),
'Retention_Flag':['Ret', 'Non ret'] * 6,
})
print (data)
BORROWER_CITY Retention_Flag
0 a Ret
1 b Non ret
2 c Ret
3 d Non ret
4 a Ret
5 b Non ret
6 a Ret
7 a Non ret
8 a Ret
9 d Non ret
10 a Ret
11 b Non ret
df1 = pd.crosstab(data['BORROWER_CITY'],data['Retention_Flag'])
print (df1)
Retention_Flag Non ret Ret
BORROWER_CITY
a 1 5
b 3 0
c 0 1
d 2 0
N = 2
df2 = (data.groupby('BORROWER_CITY')['Retention_Flag']
.value_counts()
.groupby(level=1)
.head(N)
.unstack(fill_value=0))
print (df2)
Retention_Flag Non ret Ret
BORROWER_CITY
a 1 5
b 3 0
c 0 1
df2.plot(kind='bar')
EDIT:
Solution with nlargest
:
N = 2
df3 = (data.groupby('BORROWER_CITY')['Retention_Flag']
.value_counts()
.groupby(level=1)
.head(N)
.unstack(fill_value=0))
print (df3)
Upvotes: 1