showing the two categorical axis(X,Y) in plot for only top (n) , may be top 10 or 20 ( of X axis ) unique values

Question

I am having some data of 13961 rows in pandas DF, I have 30+ of unique values for the category in X axis( City name ) and one more feature 'Retention Flag' in Y axis with only two levels(namely Retained/not retained).

While ploting using pd.crosstab ( it is showing all the 30+ cities-unique values in the X axis which is too clumsy and dense to understand). Instead can I show only the top 20/10 unique levels in X axis and leave the others ( or add them into the category others ) ? Please help, not mandate to deal only with pd.crosstab.

I have created pd.cross tab, X axis-BORROWER_CITY , Y axis-'Retention_Flag'

This shows all the 30+ cities in X axis, instead i need only the top n (20/30 ) to be shown in the X axis labels

     df2=data.groupby("BORROWER_CITY") ['Retention_Flag'].value_counts().groupby(level=1).nlargest(4).unstack(fill_value=0)
    df2.plot(kind='bar')


The o/p obtained is shown above:

Retention_Flag                Non Retained  Retained
Retention_Flag BORROWER_CITY                        
Non Retained   Bangalore               837         0
               Delhi                  1477         0
               Mumbai                 2507         0
               Pune                    838         0
Retained       Bangalore                 0        52
               Chennai                   0       106
               Mumbai                    0       168
               Pune                      0        67

the plot is shown above with 'RetentionFlag,BORROWER_CITY'in X axis - 8 
entries

instead of having observed pic 2, having two entries for city column in 
Xaxis for ( retained/non retained ) can i have single entry alone with city 
name since, already i have legend for the flag. 

second try:
instead of nlargest, while trying with head(4) the picture is shown as 
myself expected , but it is not giving the largest value_counts() instead 
resulting the city name in alphabetical order. observed pic 3
df3=data.groupby("BORROWER_CITY")['Retention_Flag'].value_counts().groupby(level=1).head(4).unstack(fill_value=0)
print(df3)
    Retention_Flag  Non Retained  Retained
BORROWER_CITY                         
Adilabad                   2         0
Agra                      17         0
Ahmedabad                434        21
Ahmednagar                19         1
Alappuzha                  0         1
Ambala                     0         2
df3.plot(kind='bar')

the plot is shown above with 'BORROWER_CITY'in X axis - 6 entries

jezrael · Accepted Answer

You can count top values per both categories with SeriesGroupBy.value_counts and GroupBy.head, then reshape by Series.unstack:

data = pd.DataFrame({
        'BORROWER_CITY':list('abcdabaaadab'),
         'Retention_Flag':['Ret', 'Non ret'] * 6,

})

print (data)
   BORROWER_CITY Retention_Flag
0              a            Ret
1              b        Non ret
2              c            Ret
3              d        Non ret
4              a            Ret
5              b        Non ret
6              a            Ret
7              a        Non ret
8              a            Ret
9              d        Non ret
10             a            Ret
11             b        Non ret

df1 = pd.crosstab(data['BORROWER_CITY'],data['Retention_Flag'])
print (df1)
Retention_Flag  Non ret  Ret
BORROWER_CITY               
a                     1    5
b                     3    0
c                     0    1
d                     2    0

N = 2
df2 = (data.groupby('BORROWER_CITY')['Retention_Flag']
           .value_counts()
           .groupby(level=1)
           .head(N)
           .unstack(fill_value=0))
print (df2)
Retention_Flag  Non ret  Ret
BORROWER_CITY               
a                     1    5
b                     3    0
c                     0    1

df2.plot(kind='bar')

EDIT:

Solution with nlargest:

N = 2
df3 = (data.groupby('BORROWER_CITY')['Retention_Flag']
           .value_counts()
           .groupby(level=1)
           .head(N)
           .unstack(fill_value=0))
print (df3)

showing the two categorical axis(X,Y) in plot for only top (n) , may be top 10 or 20 ( of X axis ) unique values

Answers (1)

Related Questions