Acerace.py
Acerace.py

Reputation: 719

x- axis label not sorting in right order (matplotlib/pandas)

I am having a bizzare problem. I have a dataframe which looks like the one below. It has been read from a csv file.

AgeGroups     Factor Cancer  Frequency
0        0_5  wo-statin    Yes          0
1       6_10  wo-statin    Yes          0
2      11_15  wo-statin    Yes          1
3      16_20  wo-statin    Yes          1
4      21_25  wo-statin    Yes         23
5      26_30  wo-statin    Yes         50
6      31_35  wo-statin    Yes         70
7      36_40  wo-statin    Yes        107
8      41_45  wo-statin    Yes        168
9      46_50  wo-statin    Yes        412
10     51_55  wo-statin    Yes        503
11     56_60  wo-statin    Yes        646
12     61_65  wo-statin    Yes        635
13     66_70  wo-statin    Yes        725
14     71_75  wo-statin    Yes        771
15     76_80  wo-statin    Yes        421
16     81_85  wo-statin    Yes        181
17     86_90  wo-statin    Yes         57
18     91_95  wo-statin    Yes          4
19    96_100  wo-statin    Yes          4
..
60       0_5   w-statin    Yes          0
61      6_10   w-statin    Yes          0
62     11_15   w-statin    Yes          0
63     16_20   w-statin    Yes          0
64     21_25   w-statin    Yes          0
65     26_30   w-statin    Yes          0
66     31_35   w-statin    Yes          0
67     36_40   w-statin    Yes          0
68     41_45   w-statin    Yes          0
69     46_50   w-statin    Yes         10
70     51_55   w-statin    Yes         17
71     56_60   w-statin    Yes         24
72     61_65   w-statin    Yes         50
73     66_70   w-statin    Yes        113
74     71_75   w-statin    Yes        198
75     76_80   w-statin    Yes        105
76     81_85   w-statin    Yes         37
77     86_90   w-statin    Yes         18
78     91_95   w-statin    Yes          2
79    96_100   w-statin    Yes          0

I wanted to carry out some statistical analyses from the barchart that I got from the following code:

import pandas as pd
import matplotlib.pyplot as plt
df = pd.read_csv('file:///C:/Users/out.CSV')
ages= df.AgeGroups.unique()

grp = df.groupby(['AgeGroups','Factor','Cancer']).Frequency.sum()

counts = grp.unstack(level=[2])
by_factor = counts.groupby(level='Factor')


k = by_factor.ngroups

fig, axes = plt.subplots(1, k, sharex=True, sharey=True, figsize=(15, 8))
for i, (gname, grp) in enumerate(by_factor):
    grp.xs(gname, level='Factor').plot.bar(
        stacked=True, rot=45, ax=axes[i], title=gname)
_ = axes.set_xticklabels(grp['AgeGroups'])
#for ax in axes: ax.set_ylim([0,25000])
fig.tight_layout()

I was almost very happy that everything went out smoothly, until I realized the age group 6_10 on my plot came in the middle of the plot. enter image description here

If I change the table manually for instance 0_5 as 00-05 and 6_10 as 06-10 and so on, the plot shows the correct order. However the problem for me is, I have about 2k of such CSVs and it is almost impossible to edit the order of each table manually. I can provide more information if required and thank you so much for help. :)

Upvotes: 2

Views: 1002

Answers (2)

MaxU - stand with Ukraine
MaxU - stand with Ukraine

Reputation: 210842

You may want to add leading 0 to the AgeGroups column (as @ImportanceOfBeingErnest has already suggested).

But i would do it in Pandas way:

In [91]: df.AgeGroups = \
             df.AgeGroups.replace([r'^(\d{1})\_', r'_(\d{1})$'], 
                                  [r'0\1_',r'_0\1'],
                                  regex=True)

Result:

In [93]: df
Out[93]:
   AgeGroups     Factor Cancer  Frequency
0      00_05  wo-statin    Yes          0
1      06_10  wo-statin    Yes          0
2      11_15  wo-statin    Yes          1
3      16_20  wo-statin    Yes          1
4      21_25  wo-statin    Yes         23
5      26_30  wo-statin    Yes         50
6      31_35  wo-statin    Yes         70
7      36_40  wo-statin    Yes        107
8      41_45  wo-statin    Yes        168
9      46_50  wo-statin    Yes        412
..       ...        ...    ...        ...
70     51_55   w-statin    Yes         17
71     56_60   w-statin    Yes         24
72     61_65   w-statin    Yes         50
73     66_70   w-statin    Yes        113
74     71_75   w-statin    Yes        198
75     76_80   w-statin    Yes        105
76     81_85   w-statin    Yes         37
77     86_90   w-statin    Yes         18
78     91_95   w-statin    Yes          2
79    96_100   w-statin    Yes          0

[40 rows x 4 columns]

Upvotes: 1

ImportanceOfBeingErnest
ImportanceOfBeingErnest

Reputation: 339220

It's probably worthwhile to replace the strings in the files. This could be done as follows:

import glob as glob

files = glob.glob("data/*.csv")

for filename in files:
    # Read in the file
    with open(filename, 'r') as f :
        filedata = f.read()

    # Replace the target string
    filedata = filedata.replace('  0_5', '00-05')
    filedata = filedata.replace(' 6_10', '06_10')

    # Write the file out again
    with open(filename, 'w') as f:
        f.write(filedata)

Don't forget to backup the files beforehands in case you need the originals for some reason.

Upvotes: 1

Related Questions