AYE I
AYE I

Reputation: 423

Merging csv files with different headers with Pandas in Python

I'm trying to map a dataset to a blank CSV file with different headers, so I'm essentially trying to map data from one CSV file which has different headers to a new CSV with different amount of headers and called different things, the reason this question is different is since the column names aren't the same but there are no overlapping columns either. And I can't overwrite the data file with new headers since the data file has other columns with irrelevant data, I'm certain I'm overcomplicating this.

I've seen this example code but how do I change this since this example is using a common header to join the data.

a = pd.read_csv("a.csv")
b = pd.read_csv("b.csv")
#a.csv = ID TITLE
#b.csv = ID NAME
b = b.dropna(axis=1)
merged = a.merge(b, on='title')
merged.to_csv("output.csv", index=False)

Sample Data

a.csv (blank format file, the format must match this file):

Headers: TOWN NAME LOCATION HEIGHT STAR

b.csv:

Headers: COUNTRY WEIGHT  NAME  AGE MEASUREMENT
 Data:    UK,     150lbs, John, 6,  6ft

Expected output file:

Headers: TOWN    NAME   LOCATION  HEIGHT  STAR
Data:    (Blank) John,  UK,       6ft    (Blank)

Upvotes: 8

Views: 5928

Answers (3)

s3dev
s3dev

Reputation: 9681

Here is another approach using df.reindex to add the empty columns rather than merge.

import pandas as pd

dfa = pd.read_csv('./a.csv')
dfb = pd.read_csv('./b.csv')

# These columns will be renamed.
colmap = {'COUNTRY': 'LOCATION', 'MEASUREMENT': 'HEIGHT'}
# Extract all unique column names.
cols = list(set(dfa.columns.tolist() + dfb.columns.tolist()))
# Create a new dataset matching requirements.
dfb = dfb.rename(columns=colmap).reindex(columns=cols)[dfa.columns]
# Write to CSV.
dfb.to_csv('output.csv', index=False)

Output:

Content of the CSV as show from the console ...

(py35) user@host ~/Desktop/so
$ cat output.csv 

TOWN,NAME,LOCATION,HEIGHT,STAR
,John,UK,6ft

Source files:

Content of the source CSV files as show from the console ...

(py35) user@host ~/Desktop/so
$ cat a.csv 

TOWN,NAME,LOCATION,HEIGHT,STAR

(py35) user@host ~/Desktop/so
$ cat b.csv 

COUNTRY,WEIGHT,NAME,AGE,MEASUREMENT
UK,150lbs,John,6,6ft

Upvotes: 1

Shaido
Shaido

Reputation: 28322

From your example, it looks like you need to do some column renaming in addition to the merge. This is easiest done before the merge itself.

# Read the csv files
dfA = pd.read_csv("a.csv")
dfB = pd.read_csv("b.csv")

# Rename the columns of b.csv that should match the ones in a.csv
dfB = dfB.rename(columns={'MEASUREMENT': 'HEIGHT', 'COUNTRY': 'LOCATION'})

# Merge on all common columns
df = pd.merge(dfA, dfB, on=list(set(dfA.columns) & set(dfB.columns)), how='outer')

# Only keep the columns that exists in a.csv
df = df[dfA.columns]

# Save to a new csv
df.to_csv("output.csv", index=False)

This should give you what you are after.

Upvotes: 2

Barış Can Tayiz
Barış Can Tayiz

Reputation: 87

You should specify left_on column for a dataframe and right_on column for b dataframe. So you have common data type with header "data_1" on a and common data in b with header "data_2". If you don't create a series on these dataframes with simply a['series_1'] = [i for i in range(a.shape[0])] and for b ==> b['series_1] = [i for i in range(b.shape[0]). Right now you have a common column. If not you can use a.merge(b,left_on='a_column',right_on='b_column') this will work. You can use reset_index() for common column. Using a = a.reset_index() will create an index from 1 to a.shape[0] and use it for b = b.reset_index(). merge() function automatically understand the common data types with same headers.

Upvotes: 0

Related Questions