Reputation: 561
cat input
aaa paul peter
bbb john mike
ccc paul mike
bbb paul john
And my dictionary file dict:
cat dict
aaa OOO
bbb 111
ccc 222
I need to find string form input
and if match first column in file dict
, print second column form file dict
to first column file input
. I can use sub
and gsub
, but I have thousands row in dict
file (with different letters).
cat output:
000 paul peter
111 john mike
222 paul mike
111 paul john
Thank you for any help.
My solution:
awk:
awk '{sub(/aaa/,"000",$1); sub(/bbb/,"111",$1); sub(/ccc/,"222",$1)1' input
UPDATE:
If not found match from input
in dict
, keep the word in first column unchanged.
cat input
aaa paul peter
bbb john mike
ccc paul mike
bbb paul john
ddd paul peter
cat dict
aaa OOO
bbb 111
ccc 222
cat output:
000 paul peter
111 john mike
222 paul mike
111 paul john
ddd paul peter
Upvotes: 3
Views: 3182
Reputation: 1802
awk was waaay faster for the operation, but here is a pure bash solution.
#!/bin/bash
typeset -A dict
function add_dict()
{
dict[$1]=$2
}
add_dict aaa 000
add_dict bbb 111
add_dict ccc 222
while read row
do
column=(${row//:/ })
if [ "${dict[${column[0]}]}" ];then
echo ${dict[${column[0]}]} ${column[1]} ${column[2]}
else
echo ${column[0]} ${column[1]} ${column[2]}
fi
done < /tmp/1M.txt
#1 Million lines processed in
#real 0m40.173s
#user 0m37.668s
#sys 0m2.462s
#time awk 'NR==FNR{a[$1]=$2;next}{if ($1 in a)print a[$1],$2,$3; else print $0}' dict 1M.txt > processed.txt
#real 0m0.281s
#user 0m0.242s
#sys 0m0.024s
Upvotes: 1
Reputation: 85653
A more generalized approach as suggested by fedorqui in comments for handling mismatch in the names in the input
and dict
files can be done something as,
awk 'FNR==NR {dict[$1]=$2; next} {$1=($1 in dict) ? dict[$1] : $1}1' dict input
My original solution below works on the cases when there is no missed mappings between the input
and the dict
files.
awk 'FNR==NR{hash[$2FS$3]=$1; next}{for (i in hash) if (match(hash[i],$1)){print $2, i} }' input dict
OOO paul peter
111 john mike
111 paul john
222 paul mike
The idea is to create a hash-map with index as $2FS$3
and value as $1
, i.e. hash["paul peter"]="aaa"
, etc. Once this is constructed, now the dictionary file is looked upon to see matching lines from $1
in dict
with hash value from input
file. If the match is found printing the contents as needed.
Upvotes: 6
Reputation: 518
Changed my answer to:
awk 'NR==FNR{a[$1]=$2;next}{if ($1 in a)print a[$1],$2,$3; else print $0}' dict input
prints
OOO paul peter
111 john mike
222 paul mike
111 paul john
ddd paul peter
With the command NR==FNR the following command is only excecuted on on the first file. Each line is stored into the array a with the key $1 and the value $2. Then $1 in a takes $1 from the second file and looks if the value can be found in the array a. If it is true then then a[$1] prints the number and $2 and $3 the name. Now there is an additional else clause which keeps prints the whole line from input if no match is found.
Upvotes: 2
Reputation: 24812
I think you could effectively use GNU join
:
sort input > sorted_input
sort dict > sorted_dict
join sorted_dict sorted_input -o 1.2,2.2,2.3
Which gives the following output with your example data (notice the sort modified the output, but is necessary for join
to work) :
OOO paul peter
111 john mike
111 paul john
222 paul mike
All of this relies on the join field being the first of each file, otherwise you'll need to specify which field the files should be joined on.
The -o
parameter is a format output specification and refers to the fields of each file we want in the output : the second field of the dict
, followed by every field but the first of the input
.
You've mentioned some keys might be not found in dict
and you want to keep the value from the first field of input
. There's a -a
option to handle that, but it will mess with our output, so I think the easier is to do two executions, a first one which outputs lines with correspondances in each file and a second one which handles lines without correspondance in dict
:
$ join sorted_dict sorted_input -o 1.2,2.2,2.3; join sorted_dict sorted_input -v 2
OOO paul peter
111 john mike
111 paul john
222 paul mike
ddd paul peter
If it adds too much of an overhead because of the size of the files, you should instead do a single execution with -a 2
, without output specification, and then transform the result with sed
, awk
or something else to handle lines with the missing field.
Upvotes: 2