discipulus
discipulus

Reputation: 2725

Concatenate files based such that only common headers are put together

I have two CSV files that share similar headers:sample_scv_1.csv is::

Transaction_date,Product,Payment_Type
1/2/09 6:17,Product1,Mastercard
1/2/09 4:53,Product1,Visa
1/2/09 13:08,Product1,Mastercard
1/3/09 14:44,Product1,Visa
1/4/09 12:56,Product2,Visa
1/4/09 13:19,Product1,Visa

Similarly, sample_scv_2.csv is ::

Transaction_date,Product,Price,Name
1/1/09 11:05,Product1,1200,Janis
1/5/09 4:10,Product1,1200,Nicola
1/6/09 7:18,Product1,1200,asuman
1/2/09 1:11,Product1,1200,Lena

Here in these two files Columns/Fields Transaction_date and Product are common and I want to concatenate these files but it should consider the header information.

cat  sample_scv_1.csv  sample_scv_2.csv > combined_scv.csv 

will concatenate the files but will not consider the header information. I can do so easily for smaller files in R as:

sample_one <- read.csv("sample_scv_1.csv", header = TRUE, stringsAsFactors = FALSE)

sample_two <- read.csv("sample_scv_2.csv", header = TRUE, stringsAsFactors = FALSE)

common_headers <-  intersect(colnames(sample_one), colnames(sample_two))

concat.samples <- rbind(sample_one[, common_headers, drop = FALSE], sample_two[, common_headers, drop = FALSE] )

write.csv(concat.samples, "combined_samples.csv")

My required output is ::

Transaction_date,Product
1/2/09 6:17,Product1
1/2/09 4:53,Product1
1/2/09 13:08,Product1
1/3/09 14:44,Product1
1/4/09 12:56,Product2
1/4/09 13:19,Product1
1/1/09 11:05,Product1
1/5/09 4:10,Product1
1/6/09 7:18,Product1
1/2/09 1:11,Product1

Since my files are big 20000 columns and 10000 fields, can this be done efficiently from command-line?

Upvotes: 1

Views: 94

Answers (3)

anubhava
anubhava

Reputation: 785481

You can use cut command with a tail -n +2 on 2nd file to remove duplicate header:

cut -d, -f1,2 file1.csv <(tail -n +2 file2.csv)

Based on comments below it seems OP doesn't want to specify common column numbers and wants this script to figure it out based on header row. These header fields can be in different order also.

Here is an awk script with extensive comments to make it work:

$> cat cmn.awk

function prnt(hdr) {          # utilitiy function for printing using common header arrays
   for (i=1; i<=length(hdr); i++)
      printf "%s%s", (i>1?OFS:""), $(hdr[i])
   print ""
}
BEGIN { FS=OFS="," }
NR == 1 {                     # for 1st record of 1st file in list
   for(i=1; i<=NF; i++)
        col[$i] = i           # for 1st file save each col name and col num in array "col"

   getline < ARGV[2]          # read 1st record of 2nd file in list

   for(i=1; i<=NF; i++)       # for 1st record of 2nd file in list
      if ($i in col) {        # if this field exists in array "col"    
         hdr1[++k] = col[$i]  # save common col num of 1st file in array "hdr1"    
         hdr2[k] = i          # save common col num of 2nd file in array "hdr2"
         # print header row
         printf "%s%s", (k>1?OFS:""), $i
      }

   close(ARGV[2])
   if (k) print ""

}
FNR == 1 {
   next                        # skip first record from both files
}
FNR == NR {
   prnt(hdr1)                  # print records from first file
   next
}
{
   prnt(hdr2)                  # print records from second file
}

Now use it as:

$> awk -f cmn.awk file1.csv file2.csv

Transaction_date,Product
1/2/09 6:17,Product1
1/2/09 4:53,Product1
1/2/09 13:08,Product1
1/3/09 14:44,Product1
1/4/09 12:56,Product2
1/4/09 13:19,Product1
1/1/09 11:05,Product1
1/5/09 4:10,Product1
1/6/09 7:18,Product1
1/2/09 1:11,Product1

Upvotes: 2

Arjun Mathew Dan
Arjun Mathew Dan

Reputation: 5298

Another solution with awk:

awk -F, '!/^Transaction_date/ || NR==FNR {print $1, $2}' File1 File2

Transaction_date Product
1/2/09 6:17 Product1
1/2/09 4:53 Product1
1/2/09 13:08 Product1
1/3/09 14:44 Product1
1/4/09 12:56 Product2
1/4/09 13:19 Product1
1/1/09 11:05 Product1
1/5/09 4:10 Product1
1/6/09 7:18 Product1
1/2/09 1:11 Product1

Upvotes: 1

P....
P....

Reputation: 18391

awk  'BEGIN{FS=OFS=","; print "Transaction_date,Product" } NR==FNR && NR>1 {print $1,$2}  NR!=FNR && FNR>1 {print $1,$2}' file1 file2
Transaction_date,Product
1/2/09 6:17,Product1
1/2/09 4:53,Product1
1/2/09 13:08,Product1
1/3/09 14:44,Product1
1/4/09 12:56,Product2
1/4/09 13:19,Product1
1/1/09 11:05,Product1
1/5/09 4:10,Product1
1/6/09 7:18,Product1
1/2/09 1:11,Product1

Explanation:

BEGIN{}: This is used to make awk aware of that fields are separated by , and print the header message.
NR==FNR : Means first file's contents and NR>1 means skip the header of file1.
NF!=FNR : Means content of files other then first file and NR>1 means , skip the header of file2.
{print $1,$2} : Print column1 and colum2 of file1 and file2 ,which are separated by ,.

Upvotes: 1

Related Questions