Reputation: 61
I have a requirement to combine multiple csv files into one single file. Challenge is :- All these csv files differ in the number of columns present in them. So I was thinking if there is any way to read all the column names from these files and then while combining these csv files , I can read the respective column name from the incoming file , match it with the column name from the big combined file and accordingly save the value. In case the column name is not found , then store N/A or null in the combined file.
Does it look feasible or any other better approach ?
Regards, Arun
Upvotes: 1
Views: 1507
Reputation: 1024
I did this using miller. It's a tool for working with CSV data. You can install it using brew, choco or yum and then head to the directory that your files exist. Then use the command below to save the result in result.csv file.
mlr --csv reshape -r "^A" -o item,value then reshape -s item,value \ then unsparsify --fill-with "" *.csv > result.csv
Some notes about the command:
Upvotes: 1
Reputation: 748
I recommend reading CSV files into pandas DataFrames. Then you can combine them by using concat
. Let's say you have two pandas dataframes, df1 and df2. This code will combine them and leaves NaN when it is not applicable:
import pandas as pd
df1 = pd.DataFrame({'column1': [1,2],
'column2': [3,4],
})
df2 = pd.DataFrame({'column1': [5,6],
'column3': [7,8],
})
pd.concat([df1,df2],ignore_index=True)
Output is like this:
column1 column2 column3
0 1 3.0 NaN
1 2 4.0 NaN
2 5 NaN 7.0
3 6 NaN 8.0
Upvotes: 2