Reputation: 2725
I have two CSV files that share similar headers:sample_scv_1.csv
is::
Transaction_date,Product,Payment_Type
1/2/09 6:17,Product1,Mastercard
1/2/09 4:53,Product1,Visa
1/2/09 13:08,Product1,Mastercard
1/3/09 14:44,Product1,Visa
1/4/09 12:56,Product2,Visa
1/4/09 13:19,Product1,Visa
Similarly, sample_scv_2.csv
is ::
Transaction_date,Product,Price,Name
1/1/09 11:05,Product1,1200,Janis
1/5/09 4:10,Product1,1200,Nicola
1/6/09 7:18,Product1,1200,asuman
1/2/09 1:11,Product1,1200,Lena
Here in these two files Columns/Fields Transaction_date
and Product
are common and I want to concatenate these files but it should consider the header information.
cat sample_scv_1.csv sample_scv_2.csv > combined_scv.csv
will concatenate the files but will not consider the header information. I can do so easily for smaller files in R
as:
sample_one <- read.csv("sample_scv_1.csv", header = TRUE, stringsAsFactors = FALSE)
sample_two <- read.csv("sample_scv_2.csv", header = TRUE, stringsAsFactors = FALSE)
common_headers <- intersect(colnames(sample_one), colnames(sample_two))
concat.samples <- rbind(sample_one[, common_headers, drop = FALSE], sample_two[, common_headers, drop = FALSE] )
write.csv(concat.samples, "combined_samples.csv")
My required output is ::
Transaction_date,Product
1/2/09 6:17,Product1
1/2/09 4:53,Product1
1/2/09 13:08,Product1
1/3/09 14:44,Product1
1/4/09 12:56,Product2
1/4/09 13:19,Product1
1/1/09 11:05,Product1
1/5/09 4:10,Product1
1/6/09 7:18,Product1
1/2/09 1:11,Product1
Since my files are big 20000 columns and 10000 fields, can this be done efficiently from command-line?
Upvotes: 1
Views: 94
Reputation: 785481
You can use cut
command with a tail -n +2
on 2nd file to remove duplicate header:
cut -d, -f1,2 file1.csv <(tail -n +2 file2.csv)
Based on comments below it seems OP doesn't want to specify common column numbers and wants this script to figure it out based on header row. These header fields can be in different order also.
Here is an awk script with extensive comments to make it work:
$> cat cmn.awk
function prnt(hdr) { # utilitiy function for printing using common header arrays
for (i=1; i<=length(hdr); i++)
printf "%s%s", (i>1?OFS:""), $(hdr[i])
print ""
}
BEGIN { FS=OFS="," }
NR == 1 { # for 1st record of 1st file in list
for(i=1; i<=NF; i++)
col[$i] = i # for 1st file save each col name and col num in array "col"
getline < ARGV[2] # read 1st record of 2nd file in list
for(i=1; i<=NF; i++) # for 1st record of 2nd file in list
if ($i in col) { # if this field exists in array "col"
hdr1[++k] = col[$i] # save common col num of 1st file in array "hdr1"
hdr2[k] = i # save common col num of 2nd file in array "hdr2"
# print header row
printf "%s%s", (k>1?OFS:""), $i
}
close(ARGV[2])
if (k) print ""
}
FNR == 1 {
next # skip first record from both files
}
FNR == NR {
prnt(hdr1) # print records from first file
next
}
{
prnt(hdr2) # print records from second file
}
Now use it as:
$> awk -f cmn.awk file1.csv file2.csv
Transaction_date,Product
1/2/09 6:17,Product1
1/2/09 4:53,Product1
1/2/09 13:08,Product1
1/3/09 14:44,Product1
1/4/09 12:56,Product2
1/4/09 13:19,Product1
1/1/09 11:05,Product1
1/5/09 4:10,Product1
1/6/09 7:18,Product1
1/2/09 1:11,Product1
Upvotes: 2
Reputation: 5298
Another solution with awk
:
awk -F, '!/^Transaction_date/ || NR==FNR {print $1, $2}' File1 File2
Transaction_date Product
1/2/09 6:17 Product1
1/2/09 4:53 Product1
1/2/09 13:08 Product1
1/3/09 14:44 Product1
1/4/09 12:56 Product2
1/4/09 13:19 Product1
1/1/09 11:05 Product1
1/5/09 4:10 Product1
1/6/09 7:18 Product1
1/2/09 1:11 Product1
Upvotes: 1
Reputation: 18391
awk 'BEGIN{FS=OFS=","; print "Transaction_date,Product" } NR==FNR && NR>1 {print $1,$2} NR!=FNR && FNR>1 {print $1,$2}' file1 file2
Transaction_date,Product
1/2/09 6:17,Product1
1/2/09 4:53,Product1
1/2/09 13:08,Product1
1/3/09 14:44,Product1
1/4/09 12:56,Product2
1/4/09 13:19,Product1
1/1/09 11:05,Product1
1/5/09 4:10,Product1
1/6/09 7:18,Product1
1/2/09 1:11,Product1
Explanation:
BEGIN{}
: This is used to make awk
aware of that fields are separated by ,
and print the header message.
NR==FNR
: Means first file's contents and NR>1
means skip the header of file1.
NF!=FNR
: Means content of files other then first file and NR>1
means , skip the header of file2.
{print $1,$2}
: Print column1 and colum2 of file1 and file2 ,which are separated by ,
.
Upvotes: 1