Reputation: 332

How can I merge two files by column with awk?

I have the two following text files:

file1

-7.7
-7.4
-7.3
-7.3
-7.3

file2

And I want to merge them side by side, separated by a comma:

file3

-7.7,4.823
-7.4,5.472
-7.3,5.856
-7.3,4.770
-7.3,4.425

I know this can be easily done with paste -d ',' file1 file2 > file3, but I want a solution that allows me to have control over each iteration, since my dataset is big and I also need to add other columns to the output file. E.g.:

A,-7.7,4.823,3
A,-7.4,5.472,2
B,-7.3,5.856,3
A,-7.3,4.770,1
B,-7.3,4.425,1

Here's what I got so far:

awk 'NR==FNR {a[$count]=$1; count+=1; next} {print a[$count] "," $1; count+=1;}' file1 file2 > file3

Output:

-7.3,4.823
-7.3,5.472
-7.3,5.856
-7.3,4.770
-7.3,4.425

I am new to bash and awk, so a detailed response would be appreciated :)

Edit:
Suppose I have a directory with pairs of files, ending with two extensions: .ext1 and .ext2. Those files have parameters included in their names, for example file_0_par1_par2.ext1 has its pair, file_0_par1_par2.ext2. Each file contains 5 values. I have a function to extract its serial number and its parameters from its name. My goal is to write, on a single csv file (file_out.csv), the values present in the files along with the parameters extracted from their names.
Code:

for file1 in *.ext1 ; do
    for file2 in *.ext2 ; do
        # for each file ending with .ext2, verify if it is file1's corresponding pair
        # I know this is extremely time inefficient, since it's a O(n^2) operation, but I couldn't find another alternative
        if [[ "${file1%.*}" == "${file2%.*}" ]] ; then
            # extract file_number, and par1, par2 based on some conditions, then append to the csv file
            paste -d ',' "$file1" "$file2" | while IFS="," read -r var1 var2;
            do
                echo "$par1,$par2,$var1,$var2,$file_number" >> "file_out.csv" 
            done
        fi
    done
done

Upvotes: 2

Answers (5)

Thomas Blankenhorn

Reputation: 256

Two pieces of advice that might help:

First, I suspect an Awk script that does what you want will be too long for a one-liner. I would write a multi-line script that takes file1 and file2 as arguments, and store it in a file called mymerge.awk or whatever. Here is a skeleton:

#!/usr/bin/awk -f

BEGIN {
    file1=ARGV[1]; file2=ARGV[2]
}

# The guts of your script go here.

Then you can simply make your script executable (chmod +x mymerge.awk) and call it from the shell: mymerge.awk file1 file2. This approach has the advantage of making your script easy to read, reuse, and maintain.

The second piece of advice: Use Awk's getline < file1 to read data from file1 instead of stdin. Likewise for file2. To store the lines you just read in variables, you can say

getline var1 < file1; getline var2 < file2

The Gnu Awk Users Guide has a comprehensive and readable description of getline and how to use it.

I won't be able to write and test a working script for you tonight, but I hope this will help you make some progress.

Upvotes: 0

Ed Morton

Reputation: 204558

The way to efficiently do what your updated question describes:

Suppose I have a directory with pairs of files, ending with two extensions: .ext1 and .ext2. Those files have parameters included in their names, for example file_0_par1_par2.ext1 has its pair, file_0_par1_par2.ext2. Each file contains 5 values. I have a function to extract its serial number and its parameters from its name. My goal is to write, on a single csv file (file_out.csv), the values present in the files along with the parameters extracted from their names.

for file1 in *.ext1 ; do
    for file2 in *.ext2 ; do
        # for each file ending with .ext2, verify if it is file1's corresponding pair
        # I know this is extremely time inefficient, since it's a O(n^2) operation, but I couldn't find another alternative
        if [[ "${file1%.*}" == "${file2%.*}" ]] ; then
            # extract file_number, and par1, par2 based on some conditions, then append to the csv file
            paste -d ',' "$file1" "$file2" | while IFS="," read -r var1 var2;
            do
                echo "$par1,$par2,$var1,$var2,$file_number" >> "file_out.csv" 
            done
        fi
    done
done

would be (untested):

for file1 in *.ext1; do
    base="${file1%.*}"
    file2="${base}.ext2"
    paste -d ',' "$file1" "$file2" |
    awk -v base="$base" '
        BEGIN { split(base,b,/_/); FS=OFS="," }
        { print b[3], b[4], $1, $2, b[2] }
    '
done > 'file_out.csv'

Doing base="${file1%.*}"; file2="${base}.ext2" itself would be N^2 times (given N pairs of files) more efficient than for file2 in *.ext2 ; do if [[ "${file1%.*}" == "${file2%.*}" ]] ; then and doing | awk '...' itself would be an order of magnitude more efficient than | while IFS="," read -r var1 var2; do echo ...; done (see why-is-using-a-shell-loop-to-process-text-considered-bad-practice) so you can expect to see a huge improvement in performance over your existing script.

Upvotes: 2

Walter A

Reputation: 20032

Your command fails:

awk 'NR==FNR {a[$count]=$1; count+=1; next} {print a[$count] "," $1; count+=1;}' file1 file2 > file3

Don't use $count but count, start with count 1 and reset count to 1 when starting in file2. The last two conditions can be added by FNR==1 {count=1} or {count=FNR}.
When count is always the same as FNR, why use count?

awk 'NR==FNR {a[FNR]=$1; next} {print a[FNR] "," $1; }' file1 file2

Upvotes: 2

terafl0ps

Reputation: 704

awk 'BEGIN {FS=","} {getline file2_line < "file2.txt"; print $1","file2_line }' file1.txt

The begin block sets the field separator to comma, but this only applies to data in file1.txt

The first statement in the main body of the script stores the value of this line from file2.txt into a variable named file2_line. This variable contains the entire line from file2.txt, and the data in the line is not split into fields in the usual fashion. This means that if file2.txt were also comma-delimited, you would probably want to use awk's split function to split the string into an array in order to work with the individual fields.

In awk, one concatenates by simply writing string values one after another, so print $1","file2_line writes the first field from the first file, a literal comma, and the string value for this line of file2.txt which we stored earlier.

Upvotes: 0

DamianK

Reputation: 382

You can use yours solutions with "paste". Just add while loop for example to take control for each iteration.

paste -d ',' file1 file2 | while IFS="," read -r lineA lineB;
do
    # you can build new file here like you need
    echo "$lineA,$lineB"
done

Upvotes: 2

How can I merge two files by column with awk?

Answers (5)

Related Questions