Geroge
Geroge

Reputation: 561

Replace string according dictionary file in awk

cat input

aaa paul peter
bbb john mike
ccc paul mike 
bbb paul john

And my dictionary file dict:

cat dict

aaa OOO
bbb 111
ccc 222

I need to find string form input and if match first column in file dict, print second column form file dict to first column file input. I can use sub and gsub, but I have thousands row in dict file (with different letters).

cat output:

000 paul peter
111 john mike
222 paul mike 
111 paul john

Thank you for any help.

My solution:

  awk:

awk '{sub(/aaa/,"000",$1); sub(/bbb/,"111",$1); sub(/ccc/,"222",$1)1' input

UPDATE:

If not found match from input in dict, keep the word in first column unchanged.

cat input

aaa paul peter
bbb john mike
ccc paul mike 
bbb paul john
ddd paul peter

cat dict

aaa OOO
bbb 111
ccc 222

cat output:

000 paul peter
111 john mike
222 paul mike 
111 paul john
ddd paul peter

Upvotes: 3

Views: 3182

Answers (4)

Josiah DeWitt
Josiah DeWitt

Reputation: 1802

awk was waaay faster for the operation, but here is a pure bash solution.

#!/bin/bash

typeset -A dict

function add_dict()
{
   dict[$1]=$2
}

add_dict aaa 000
add_dict bbb 111
add_dict ccc 222

while read row
do
   column=(${row//:/ })
   if [ "${dict[${column[0]}]}" ];then
      echo ${dict[${column[0]}]} ${column[1]} ${column[2]}
   else
      echo ${column[0]} ${column[1]} ${column[2]}
   fi 
done < /tmp/1M.txt

#1 Million lines processed in
#real   0m40.173s
#user   0m37.668s
#sys    0m2.462s

#time awk 'NR==FNR{a[$1]=$2;next}{if ($1 in a)print a[$1],$2,$3; else print $0}' dict 1M.txt > processed.txt

#real   0m0.281s
#user   0m0.242s
#sys    0m0.024s

Upvotes: 1

Inian
Inian

Reputation: 85653

A more generalized approach as suggested by fedorqui in comments for handling mismatch in the names in the input and dict files can be done something as,

awk 'FNR==NR {dict[$1]=$2; next} {$1=($1 in dict) ? dict[$1] : $1}1' dict input

My original solution below works on the cases when there is no missed mappings between the input and the dict files.

awk 'FNR==NR{hash[$2FS$3]=$1; next}{for (i in hash) if (match(hash[i],$1)){print $2, i} }' input dict
OOO paul peter
111 john mike
111 paul john
222 paul mike

The idea is to create a hash-map with index as $2FS$3 and value as $1, i.e. hash["paul peter"]="aaa", etc. Once this is constructed, now the dictionary file is looked upon to see matching lines from $1 in dict with hash value from input file. If the match is found printing the contents as needed.

Upvotes: 6

JFS31
JFS31

Reputation: 518

Changed my answer to:

awk 'NR==FNR{a[$1]=$2;next}{if ($1 in a)print a[$1],$2,$3; else print $0}' dict input

prints

OOO paul peter
111 john mike
222 paul mike
111 paul john
ddd paul peter

With the command NR==FNR the following command is only excecuted on on the first file. Each line is stored into the array a with the key $1 and the value $2. Then $1 in a takes $1 from the second file and looks if the value can be found in the array a. If it is true then then a[$1] prints the number and $2 and $3 the name. Now there is an additional else clause which keeps prints the whole line from input if no match is found.

Upvotes: 2

Aaron
Aaron

Reputation: 24812

I think you could effectively use GNU join :

sort input > sorted_input
sort dict > sorted_dict
join sorted_dict sorted_input -o 1.2,2.2,2.3

Which gives the following output with your example data (notice the sort modified the output, but is necessary for join to work) :

OOO paul peter
111 john mike
111 paul john
222 paul mike

All of this relies on the join field being the first of each file, otherwise you'll need to specify which field the files should be joined on.

The -o parameter is a format output specification and refers to the fields of each file we want in the output : the second field of the dict, followed by every field but the first of the input.

You've mentioned some keys might be not found in dict and you want to keep the value from the first field of input. There's a -a option to handle that, but it will mess with our output, so I think the easier is to do two executions, a first one which outputs lines with correspondances in each file and a second one which handles lines without correspondance in dict :

$ join sorted_dict sorted_input -o 1.2,2.2,2.3; join sorted_dict sorted_input -v 2
OOO paul peter
111 john mike
111 paul john
222 paul mike
ddd paul peter

If it adds too much of an overhead because of the size of the files, you should instead do a single execution with -a 2, without output specification, and then transform the result with sed, awk or something else to handle lines with the missing field.

Upvotes: 2

Related Questions