Reputation: 719
I am having a bizzare problem. I have a dataframe which looks like the one below. It has been read from a csv file.
AgeGroups Factor Cancer Frequency
0 0_5 wo-statin Yes 0
1 6_10 wo-statin Yes 0
2 11_15 wo-statin Yes 1
3 16_20 wo-statin Yes 1
4 21_25 wo-statin Yes 23
5 26_30 wo-statin Yes 50
6 31_35 wo-statin Yes 70
7 36_40 wo-statin Yes 107
8 41_45 wo-statin Yes 168
9 46_50 wo-statin Yes 412
10 51_55 wo-statin Yes 503
11 56_60 wo-statin Yes 646
12 61_65 wo-statin Yes 635
13 66_70 wo-statin Yes 725
14 71_75 wo-statin Yes 771
15 76_80 wo-statin Yes 421
16 81_85 wo-statin Yes 181
17 86_90 wo-statin Yes 57
18 91_95 wo-statin Yes 4
19 96_100 wo-statin Yes 4
..
60 0_5 w-statin Yes 0
61 6_10 w-statin Yes 0
62 11_15 w-statin Yes 0
63 16_20 w-statin Yes 0
64 21_25 w-statin Yes 0
65 26_30 w-statin Yes 0
66 31_35 w-statin Yes 0
67 36_40 w-statin Yes 0
68 41_45 w-statin Yes 0
69 46_50 w-statin Yes 10
70 51_55 w-statin Yes 17
71 56_60 w-statin Yes 24
72 61_65 w-statin Yes 50
73 66_70 w-statin Yes 113
74 71_75 w-statin Yes 198
75 76_80 w-statin Yes 105
76 81_85 w-statin Yes 37
77 86_90 w-statin Yes 18
78 91_95 w-statin Yes 2
79 96_100 w-statin Yes 0
I wanted to carry out some statistical analyses from the barchart that I got from the following code:
import pandas as pd
import matplotlib.pyplot as plt
df = pd.read_csv('file:///C:/Users/out.CSV')
ages= df.AgeGroups.unique()
grp = df.groupby(['AgeGroups','Factor','Cancer']).Frequency.sum()
counts = grp.unstack(level=[2])
by_factor = counts.groupby(level='Factor')
k = by_factor.ngroups
fig, axes = plt.subplots(1, k, sharex=True, sharey=True, figsize=(15, 8))
for i, (gname, grp) in enumerate(by_factor):
grp.xs(gname, level='Factor').plot.bar(
stacked=True, rot=45, ax=axes[i], title=gname)
_ = axes.set_xticklabels(grp['AgeGroups'])
#for ax in axes: ax.set_ylim([0,25000])
fig.tight_layout()
I was almost very happy that everything went out smoothly, until I realized the age group 6_10 on my plot came in the middle of the plot.
If I change the table manually for instance 0_5 as 00-05 and 6_10 as 06-10 and so on, the plot shows the correct order. However the problem for me is, I have about 2k of such CSVs and it is almost impossible to edit the order of each table manually. I can provide more information if required and thank you so much for help. :)
Upvotes: 2
Views: 1002
Reputation: 210842
You may want to add leading 0
to the AgeGroups
column (as @ImportanceOfBeingErnest has already suggested).
But i would do it in Pandas way:
In [91]: df.AgeGroups = \
df.AgeGroups.replace([r'^(\d{1})\_', r'_(\d{1})$'],
[r'0\1_',r'_0\1'],
regex=True)
Result:
In [93]: df
Out[93]:
AgeGroups Factor Cancer Frequency
0 00_05 wo-statin Yes 0
1 06_10 wo-statin Yes 0
2 11_15 wo-statin Yes 1
3 16_20 wo-statin Yes 1
4 21_25 wo-statin Yes 23
5 26_30 wo-statin Yes 50
6 31_35 wo-statin Yes 70
7 36_40 wo-statin Yes 107
8 41_45 wo-statin Yes 168
9 46_50 wo-statin Yes 412
.. ... ... ... ...
70 51_55 w-statin Yes 17
71 56_60 w-statin Yes 24
72 61_65 w-statin Yes 50
73 66_70 w-statin Yes 113
74 71_75 w-statin Yes 198
75 76_80 w-statin Yes 105
76 81_85 w-statin Yes 37
77 86_90 w-statin Yes 18
78 91_95 w-statin Yes 2
79 96_100 w-statin Yes 0
[40 rows x 4 columns]
Upvotes: 1
Reputation: 339220
It's probably worthwhile to replace the strings in the files. This could be done as follows:
import glob as glob
files = glob.glob("data/*.csv")
for filename in files:
# Read in the file
with open(filename, 'r') as f :
filedata = f.read()
# Replace the target string
filedata = filedata.replace(' 0_5', '00-05')
filedata = filedata.replace(' 6_10', '06_10')
# Write the file out again
with open(filename, 'w') as f:
f.write(filedata)
Don't forget to backup the files beforehands in case you need the originals for some reason.
Upvotes: 1