Reputation: 423
I'm trying to map a dataset to a blank CSV file with different headers, so I'm essentially trying to map data from one CSV file which has different headers to a new CSV with different amount of headers and called different things, the reason this question is different is since the column names aren't the same but there are no overlapping columns either. And I can't overwrite the data file with new headers since the data file has other columns with irrelevant data, I'm certain I'm overcomplicating this.
I've seen this example code but how do I change this since this example is using a common header to join the data.
a = pd.read_csv("a.csv")
b = pd.read_csv("b.csv")
#a.csv = ID TITLE
#b.csv = ID NAME
b = b.dropna(axis=1)
merged = a.merge(b, on='title')
merged.to_csv("output.csv", index=False)
Sample Data
a.csv (blank format file, the format must match this file):
Headers: TOWN NAME LOCATION HEIGHT STAR
b.csv:
Headers: COUNTRY WEIGHT NAME AGE MEASUREMENT
Data: UK, 150lbs, John, 6, 6ft
Expected output file:
Headers: TOWN NAME LOCATION HEIGHT STAR
Data: (Blank) John, UK, 6ft (Blank)
Upvotes: 8
Views: 5928
Reputation: 9681
Here is another approach using df.reindex
to add the empty columns rather than merge
.
import pandas as pd
dfa = pd.read_csv('./a.csv')
dfb = pd.read_csv('./b.csv')
# These columns will be renamed.
colmap = {'COUNTRY': 'LOCATION', 'MEASUREMENT': 'HEIGHT'}
# Extract all unique column names.
cols = list(set(dfa.columns.tolist() + dfb.columns.tolist()))
# Create a new dataset matching requirements.
dfb = dfb.rename(columns=colmap).reindex(columns=cols)[dfa.columns]
# Write to CSV.
dfb.to_csv('output.csv', index=False)
Content of the CSV as show from the console ...
(py35) user@host ~/Desktop/so
$ cat output.csv
TOWN,NAME,LOCATION,HEIGHT,STAR
,John,UK,6ft
Content of the source CSV files as show from the console ...
(py35) user@host ~/Desktop/so
$ cat a.csv
TOWN,NAME,LOCATION,HEIGHT,STAR
(py35) user@host ~/Desktop/so
$ cat b.csv
COUNTRY,WEIGHT,NAME,AGE,MEASUREMENT
UK,150lbs,John,6,6ft
Upvotes: 1
Reputation: 28322
From your example, it looks like you need to do some column renaming in addition to the merge
. This is easiest done before the merge
itself.
# Read the csv files
dfA = pd.read_csv("a.csv")
dfB = pd.read_csv("b.csv")
# Rename the columns of b.csv that should match the ones in a.csv
dfB = dfB.rename(columns={'MEASUREMENT': 'HEIGHT', 'COUNTRY': 'LOCATION'})
# Merge on all common columns
df = pd.merge(dfA, dfB, on=list(set(dfA.columns) & set(dfB.columns)), how='outer')
# Only keep the columns that exists in a.csv
df = df[dfA.columns]
# Save to a new csv
df.to_csv("output.csv", index=False)
This should give you what you are after.
Upvotes: 2
Reputation: 87
You should specify left_on column for a dataframe and right_on column for b dataframe. So you have common data type with header "data_1" on a and common data in b with header "data_2". If you don't create a series on these dataframes with simply a['series_1'] = [i for i in range(a.shape[0])] and for b ==> b['series_1] = [i for i in range(b.shape[0]). Right now you have a common column. If not you can use a.merge(b,left_on='a_column',right_on='b_column') this will work. You can use reset_index() for common column. Using a = a.reset_index() will create an index from 1 to a.shape[0] and use it for b = b.reset_index(). merge() function automatically understand the common data types with same headers.
Upvotes: 0