Reputation: 13
I am new to coding and I'm having an issue merging csv files. I have searched similar questions and haven't found a fix. Just to include some relevant details: CSV files are cancer types over the period of 1950 - 2017 for different countries (lung cancer, colorectal cancer, stomach cancer, liver cancer and breast cancer) Below is an example of the layout of lung cancer.
dlung.describe(include='all')
dlung
Year Cancer Country Gender ASR SE
0 1950 Lung Australia Male 13.89 0.56
1 1951 Lung Australia Male 14.84 0.57
2 1952 Lung Australia Male 17.19 0.61
3 1953 Lung Australia Male 18.21 0.62
4 1954 Lung Australia Male 19.05 0.63
5 1955 Lung Australia Male 20.65 0.65
6 1956 Lung Australia Male 22.05 0.67
7 1957 Lung Australia Male 23.93 0.69
8 1958 Lung Australia Male 23.77 0.68
9 1959 Lung Australia Male 26.12 0.71
10 1960 Lung Australia Male 27.08 0.72
I am interested in joining all cancer types into one dataframe based on shared columns (year, country). I have tried different methods, but they all seem to duplicate Year and Country (as below)
This one wasn't bad, but I have two columns for year and country
df_lung_colorectal = pd.concat([dlung, dcolorectal], axis = 1)
df_lung_colorectal
Year Cancer Country Gender ASR SE Year Cancer Country Gender ASR SE
If I continue like this, I will end up with 5 identical columns for YEAR and 5 for COUNTRY.
Any ideas on how merge all values that are independent (Cancer type and associated ASR (standardized risk), and SE values) with only one column for YEAR, COUNTRY (and GENDER) if possible?
Upvotes: 1
Views: 95
Reputation: 309
Concat with axis=0
to merge them row-wise.
with axis=1
you are asking it to Concat side-to-side.
Upvotes: 0
Reputation: 862661
Yes, it is possible if use DataFrame.set_index
, but then are duplicated another columns names:
print (dlung)
Year Cancer Country Gender ASR SE
0 1950 Lung Australia Male 13.89 0.56
1 1951 Lung Australia Male 14.84 0.57
2 1952 Lung Australia Male 17.19 0.61
3 1953 Lung Australia Male 18.21 0.62
4 1954 Lung Australia Male 19.05 0.63
print (dcolorectal)
Year Cancer Country Gender ASR SE
6 1950 colorectal Australia Male 22.05 0.67
7 1951 colorectal Australia Male 23.93 0.69
8 1952 colorectal Australia Male 23.77 0.68
9 1953 colorectal Australia Male 26.12 0.71
10 1954 colorectal Australia Male 27.08 0.72
df_lung_colorectal = pd.concat([dlung.set_index(['Year','Country','Gender']),
dcolorectal.set_index(['Year','Country','Gender'])], axis = 1)
print (df_lung_colorectal)
Cancer ASR SE Cancer ASR SE
Year Country Gender
1950 Australia Male Lung 13.89 0.56 colorectal 22.05 0.67
1951 Australia Male Lung 14.84 0.57 colorectal 23.93 0.69
1952 Australia Male Lung 17.19 0.61 colorectal 23.77 0.68
1953 Australia Male Lung 18.21 0.62 colorectal 26.12 0.71
1954 Australia Male Lung 19.05 0.63 colorectal 27.08 0.72
But I think better is first concat all DataFrame together with axis=0
, what is default value, so should be removed and last reshape by DataFrame.set_index
and DataFrame.unstack
:
df = pd.concat([dlung, dcolorectal]).set_index(['Year','Country','Gender','Cancer']).unstack()
df.columns = df.columns.map('_'.join)
df = df.reset_index()
print (df)
Year Country Gender ASR_Lung ASR_colorectal SE_Lung SE_colorectal
0 1950 Australia Male 13.89 22.05 0.56 0.67
1 1951 Australia Male 14.84 23.93 0.57 0.69
2 1952 Australia Male 17.19 23.77 0.61 0.68
3 1953 Australia Male 18.21 26.12 0.62 0.71
4 1954 Australia Male 19.05 27.08 0.63 0.72
Upvotes: 1