Merge data frames based on column with different rows

Question

I have multiple csv files that I read into individual data frames based on their name in the directory, like so

# ask user for path
path = input('Enter the path for the csv files: ')
os.chdir(path)

# loop over filenames and read into individual dataframes
for fname in os.listdir(path):
    if fname.endswith('Demo.csv'):
        demoRaw = pd.read_csv(fname, encoding = 'utf-8')
    if fname.endswith('Key2.csv'):
        keyRaw = pd.read_csv(fname, encoding = 'utf-8')

Then I filter to only keep certain columns

# filter to keep desired columns only
demo = demoRaw.filter(['Key', 'Sex', 'Race', 'Age'], axis=1)
key = keyRaw.filter(['Key', 'Key', 'Age'], axis=1)

Then I create a list of the above dataframes and use reduce to merge them on Key

# create list of data frames for combined sheet
dfs = [demo, key]

# merge the list of data frames on the Key
combined = reduce(lambda left,right: pd.merge(left,right,on='Key'), dfs)

Then I drop the auto generated column, create an Excel writer and write to a csv

# drop the auto generated index colulmn
combined.set_index('RecordKey', inplace=True)

# create a Pandas Excel writer using XlsxWriter as the engine.
writer = pd.ExcelWriter('final.xlsx', engine='xlsxwriter')

# write to csv
combined.to_excel(writer, sheet_name='Combined')
meds.to_excel(writer, sheet_name='Meds')

# Close the Pandas Excel writer and output the Excel file.
writer.save()

The problem is some files have keys that aren't in others. For example

Demo file

Key   Sex   Race   Age
1      M     W     52
2      F     B     25
3      M     L     78

Key file

Key   Key2   Age
1      7325     52
2      4783     25
3      1367     78
4      9435     21
5      7247     65

Right now, it will only include rows if there is a matching key in each (in other words it just leaves out the rows with keys not in the other files). How can I combine all rows from all files, even if keys don't match? So the end result will look like this

Key   Sex   Race   Age   Key2   Age
 1      M     W     52    7325     52
 2      F     B     25    4783     25
 3      M     L     78    1367     78
 4                        9435     21
 5                        7247     65

I don't care if the empty cells are blanks, NaN, #N/A, etc. Just as long as I can identify them.

Bram van Hout · Accepted Answer

Replace combined = reduce(lambda left,right: pd.merge(left,right,on='Key'), dfs) With: combined=pd.merge(demo,key, how='outer', on='Key') You will have to specificy the 'outer' to join both the full table of Key and Demo

Merge data frames based on column with different rows

Answers (1)

Related Questions