Reputation: 1
I have two files. the first is FileA, it consists of three colums (tab-delimited), see content below
FileA
House1 dog blue
House2 cat yellow
House3 bird red
House4 cow orange
House5 duck black
House6 cow brown
The second is FileB, contains a two colunms (tab-delimited) FileB Column 1 contains a list of all unnique values in FileA column 2. FileB column 2 contains a list of values that I want their corresponding value in Colum 1 to be replaced by in FileA
FileB
dog 1
cat 2
cow 3
duck 4
bird 5
In other words I wish to find all of the values in FileA column 2 that match with FileB column 1 and replace them with their corresponding value in File B Column 2, and output the new FileA to a new file ("FileA2")
The output for the new FileA (FileA2) should look like this (i.e names column 2 have all been changed to numbers, with each change corresponding to find-replace criteria in FileB)
House1 1 blue
House2 2 yellow
House3 5 red
House4 3 orange
House5 4 black
House6 3 brown
Are there any awk, grep or sed one-liners that would do this?
Note: in real life my "FileA" has over 800,000 lines and over 4000 unique values in in colum 2, so doing it via my Linux terminal would be preferable.
Also, for future reference, are there any one-liners to extract a list of unique values from a single column in a given file?
Thank you in advance for your kind assistance.
Upvotes: 0
Views: 152
Reputation: 11237
The following should do the trick (bash
) :
#!/usr/bin/env bash
join -t $'\t' \
<(sort -t $'\t' -k 2 FileA) \
<(sort -t $'\t' -k 1 FileB) \
-1 2 \
-2 1 | \
sort -t $'\t' -k 2 | \
awk -F '\t' 'BEGIN { OFS="\t" } {print $2, $4, $3}'
First, files need to be sorted to use join
command :
sort -t $'\t' -k 2 FileA
sort -t $'\t' -k 1 FileB
Here, we are sorting FileA
according to the 2nd column (-k 2
) and FileB
according to the 1st column (-k 1
). Tab is used for both files as separators (-t $'\t'
).
Note : Sorting the inputs could be done in a preparation step (storing sorted results in temporary files) to avoid re-sorting the files each time the whole command is executed).
Next, the join
command takes as parameters :
-t $'\t'
)<(sort ...)
)-1 2
= 1st file, 2nd column-2 1
= 2nd file, 1st columnOnce joined, and to get the output you need (sorted against the 2nd column of the input, the one containing House
), we need to sort the output of the previous command by running sort -t $'\t' -k 2
.
Finally, awk
is used to get only needed columns (2nd column, 4th column and 3rd column, in that order), separated by \t
(BEGIN { OFS="\t" }
).
This command is tested on Linux host. You can test this command in same environment with Docker; in your current directory, you must have FileA
, FileB
and join.sh
, an executable script containing the first command.
Running :
docker run --rm -v $(pwd):/tmp -w /tmp debian:buster ./join.sh
returns :
House1 1 blue
House2 2 yellow
House3 5 red
House4 3 orange
House5 4 black
House6 3 brown
Upvotes: 0