user10234826
user10234826

Reputation: 1

Find and replace values in colum in File A, based on Find/Replace criteria in FileB

I have two files. the first is FileA, it consists of three colums (tab-delimited), see content below

FileA

House1   dog    blue
House2   cat    yellow
House3   bird   red
House4   cow   orange
House5   duck   black 
House6   cow    brown

The second is FileB, contains a two colunms (tab-delimited) FileB Column 1 contains a list of all unnique values in FileA column 2. FileB column 2 contains a list of values that I want their corresponding value in Colum 1 to be replaced by in FileA

FileB

dog     1
cat     2
cow     3
duck    4
bird    5 

In other words I wish to find all of the values in FileA column 2 that match with FileB column 1 and replace them with their corresponding value in File B Column 2, and output the new FileA to a new file ("FileA2")

The output for the new FileA (FileA2) should look like this (i.e names column 2 have all been changed to numbers, with each change corresponding to find-replace criteria in FileB)

House1   1  blue
House2   2  yellow
House3   5   red
House4   3  orange
House5   4   black 
House6   3   brown

Are there any awk, grep or sed one-liners that would do this?

Note: in real life my "FileA" has over 800,000 lines and over 4000 unique values in in colum 2, so doing it via my Linux terminal would be preferable.

Also, for future reference, are there any one-liners to extract a list of unique values from a single column in a given file?

Thank you in advance for your kind assistance.

Upvotes: 0

Views: 152

Answers (1)

norbjd
norbjd

Reputation: 11237

The following should do the trick (bash) :

#!/usr/bin/env bash

join -t $'\t' \
    <(sort -t $'\t' -k 2 FileA) \
    <(sort -t $'\t' -k 1 FileB) \
    -1 2 \
    -2 1 | \
sort -t $'\t' -k 2 | \
awk -F '\t' 'BEGIN { OFS="\t" } {print $2, $4, $3}'

Explanations

First, files need to be sorted to use join command :

sort -t $'\t' -k 2 FileA
sort -t $'\t' -k 1 FileB

Here, we are sorting FileA according to the 2nd column (-k 2) and FileB according to the 1st column (-k 1). Tab is used for both files as separators (-t $'\t').

Note : Sorting the inputs could be done in a preparation step (storing sorted results in temporary files) to avoid re-sorting the files each time the whole command is executed).

Next, the join command takes as parameters :

  • the separator (-t $'\t')
  • the sorted inputs (using <(sort ...))
  • the columns used for joining :
    • -1 2 = 1st file, 2nd column
    • -2 1 = 2nd file, 1st column

Once joined, and to get the output you need (sorted against the 2nd column of the input, the one containing House), we need to sort the output of the previous command by running sort -t $'\t' -k 2.

Finally, awk is used to get only needed columns (2nd column, 4th column and 3rd column, in that order), separated by \t (BEGIN { OFS="\t" }).

Test

This command is tested on Linux host. You can test this command in same environment with Docker; in your current directory, you must have FileA, FileB and join.sh, an executable script containing the first command.

Running :

docker run --rm -v $(pwd):/tmp -w /tmp debian:buster ./join.sh

returns :

House1  1       blue
House2  2       yellow
House3  5       red
House4  3       orange
House5  4       black
House6  3       brown

Upvotes: 0

Related Questions