Using matching entries (shell scripting)

Question

I have two .csv files of same structure

ENSTGUG00000000001,C2CD2L
ENSTGUG00000000002,DPAGT1
ENSTGUG00000000003,EGR1
ENSTGUG00000000004,REEP2
ENSTGUG00000000005,
ENSTGUG00000000006,PIAS2
ENSTGUG00000000007,KDM3B
ENSTGUG00000000008,
ENSTGUG00000000010,HMBS
ENSTGUG00000000011,

and

ENSTGUG00000000001,
ENSTGUG00000000002,DPAGT1
ENSTGUG00000000003,EGR1
ENSTGUG00000000004,REEP2
ENSTGUG00000000005,
ENSTGUG00000000006,PIAS2
ENSTGUG00000000007,KDM3B
ENSTGUG00000000008,LOC100218573
ENSTGUG00000000010,HMBS
ENSTGUG00000000011,

I want to find elements in the 1st column of the 1st file for which the 2nd column is empty, and search for the corresponding in the 2nd column of the 2nd file.

So that the output would be

ENSTGUG00000000001,C2CD2L
ENSTGUG00000000002,DPAGT1
ENSTGUG00000000003,EGR1
ENSTGUG00000000004,REEP2
ENSTGUG00000000005,
ENSTGUG00000000006,PIAS2
ENSTGUG00000000007,KDM3B
ENSTGUG00000000008,LOC100218573
ENSTGUG00000000010,HMBS
ENSTGUG00000000011,

This way I can have 1st file as complete as possible.

William Pursell · Accepted Answer

Assuming your files line up line by line:

awk '{getline s < "file2.csv"; split( s, a, ",")} 
      !$2 && a[2] { $2 = a[2] }1' FS=, file1.csv

This is not even remotely robust, and if your files don't match exactly will be completely worthless, but it gives you somewhere to start.

It's fairly self-explanatory... awk by nature reads the input file one line at a time. Each time it reads a line, it performs the commands in the script. In this case, it reads one line from file2.csv and splits it on the ,. Then it checks to see if the second field of the line read from file1.csv evaluates as a boolean to false (so if you have entries that are integer 0, they may get overwritten) and that the second field in the line from file 2 is non-zero. If that's true, it assigns the field from file2 to the 2nd field of the current input line. Then it outputs the line.

Using matching entries (shell scripting)

Answers (2)

Related Questions