Ryan Burke
Ryan Burke

Reputation: 13

Merging csv files with some columns same and others different in python

I am new to coding and I'm having an issue merging csv files. I have searched similar questions and haven't found a fix. Just to include some relevant details: CSV files are cancer types over the period of 1950 - 2017 for different countries (lung cancer, colorectal cancer, stomach cancer, liver cancer and breast cancer) Below is an example of the layout of lung cancer.

 dlung.describe(include='all')   
 dlung


    Year    Cancer  Country     Gender  ASR     SE
0   1950    Lung    Australia   Male    13.89   0.56
1   1951    Lung    Australia   Male    14.84   0.57
2   1952    Lung    Australia   Male    17.19   0.61
3   1953    Lung    Australia   Male    18.21   0.62
4   1954    Lung    Australia   Male    19.05   0.63
5   1955    Lung    Australia   Male    20.65   0.65
6   1956    Lung    Australia   Male    22.05   0.67
7   1957    Lung    Australia   Male    23.93   0.69
8   1958    Lung    Australia   Male    23.77   0.68
9   1959    Lung    Australia   Male    26.12   0.71
10  1960    Lung    Australia   Male    27.08   0.72

I am interested in joining all cancer types into one dataframe based on shared columns (year, country). I have tried different methods, but they all seem to duplicate Year and Country (as below)

  1. This one wasn't bad, but I have two columns for year and country

    df_lung_colorectal = pd.concat([dlung, dcolorectal], axis = 1)
    
    df_lung_colorectal 
    
    Year    Cancer  Country Gender  ASR SE  Year    Cancer  Country Gender  ASR SE
    

If I continue like this, I will end up with 5 identical columns for YEAR and 5 for COUNTRY.

Any ideas on how merge all values that are independent (Cancer type and associated ASR (standardized risk), and SE values) with only one column for YEAR, COUNTRY (and GENDER) if possible?

Upvotes: 1

Views: 95

Answers (2)

Ramesh Kamath
Ramesh Kamath

Reputation: 309

Concat with axis=0 to merge them row-wise.

with axis=1 you are asking it to Concat side-to-side.

Upvotes: 0

jezrael
jezrael

Reputation: 862661

Yes, it is possible if use DataFrame.set_index, but then are duplicated another columns names:

print (dlung)
   Year Cancer    Country Gender    ASR    SE
0  1950   Lung  Australia   Male  13.89  0.56
1  1951   Lung  Australia   Male  14.84  0.57
2  1952   Lung  Australia   Male  17.19  0.61
3  1953   Lung  Australia   Male  18.21  0.62
4  1954   Lung  Australia   Male  19.05  0.63

print (dcolorectal)
    Year      Cancer    Country Gender    ASR    SE
6   1950  colorectal  Australia   Male  22.05  0.67
7   1951  colorectal  Australia   Male  23.93  0.69
8   1952  colorectal  Australia   Male  23.77  0.68
9   1953  colorectal  Australia   Male  26.12  0.71
10  1954  colorectal  Australia   Male  27.08  0.72

df_lung_colorectal = pd.concat([dlung.set_index(['Year','Country','Gender']), 
                                dcolorectal.set_index(['Year','Country','Gender'])], axis = 1)

print (df_lung_colorectal)
                      Cancer    ASR    SE      Cancer    ASR    SE
Year Country   Gender                                             
1950 Australia Male     Lung  13.89  0.56  colorectal  22.05  0.67
1951 Australia Male     Lung  14.84  0.57  colorectal  23.93  0.69
1952 Australia Male     Lung  17.19  0.61  colorectal  23.77  0.68
1953 Australia Male     Lung  18.21  0.62  colorectal  26.12  0.71
1954 Australia Male     Lung  19.05  0.63  colorectal  27.08  0.72

But I think better is first concat all DataFrame together with axis=0, what is default value, so should be removed and last reshape by DataFrame.set_index and DataFrame.unstack:

df = pd.concat([dlung, dcolorectal]).set_index(['Year','Country','Gender','Cancer']).unstack()
df.columns = df.columns.map('_'.join)
df = df.reset_index()
print (df)
   Year    Country Gender  ASR_Lung  ASR_colorectal  SE_Lung  SE_colorectal
0  1950  Australia   Male     13.89           22.05     0.56           0.67
1  1951  Australia   Male     14.84           23.93     0.57           0.69
2  1952  Australia   Male     17.19           23.77     0.61           0.68
3  1953  Australia   Male     18.21           26.12     0.62           0.71
4  1954  Australia   Male     19.05           27.08     0.63           0.72

Upvotes: 1

Related Questions