Arun
Arun

Reputation: 61

combine multiple csv files with different columns

I have a requirement to combine multiple csv files into one single file. Challenge is :- All these csv files differ in the number of columns present in them. So I was thinking if there is any way to read all the column names from these files and then while combining these csv files , I can read the respective column name from the incoming file , match it with the column name from the big combined file and accordingly save the value. In case the column name is not found , then store N/A or null in the combined file.

Does it look feasible or any other better approach ?

Regards, Arun

Upvotes: 1

Views: 1507

Answers (2)

Fariman Kashani
Fariman Kashani

Reputation: 1024

I did this using miller. It's a tool for working with CSV data. You can install it using brew, choco or yum and then head to the directory that your files exist. Then use the command below to save the result in result.csv file.

mlr --csv reshape -r "^A" -o item,value then reshape -s item,value \ then unsparsify --fill-with "" *.csv > result.csv

Some notes about the command:

  • reshape -r "^A" -o item,value, to transform the input CSVs from wide to long, applying this to all the fields whose name begins with "A";
  • reshape -s item,value, to transform the previous output from long to wide;
  • unsparsify --fill-with "", to manage field names over all input records. For field names absent in a given record but present in others, fills in the value "";

Upvotes: 1

Amin Gheibi
Amin Gheibi

Reputation: 748

I recommend reading CSV files into pandas DataFrames. Then you can combine them by using concat. Let's say you have two pandas dataframes, df1 and df2. This code will combine them and leaves NaN when it is not applicable:

import pandas as pd

df1 = pd.DataFrame({'column1': [1,2],
                    'column2': [3,4],
                    })

df2 = pd.DataFrame({'column1': [5,6],
                    'column3': [7,8],
                    })
pd.concat([df1,df2],ignore_index=True)

Output is like this:

   column1  column2  column3
0        1      3.0      NaN
1        2      4.0      NaN
2        5      NaN      7.0
3        6      NaN      8.0

Upvotes: 2

Related Questions