Vensmira
Vensmira

Reputation: 13

Search for a string in file2 based on file1 and replace

I'm new to the shell scripting and need your guidance regarding a typical requirement. I have two files(1.master file and 2.pattern file) Master file contains many fields with | delimiter and only 10th and 15th fields needs to be updated based on the pattern file.

Master file:

H|20170101

123|field2|field3|...|field10|field11...|field15|....|field150

...

...

T|1000000

Pattern file:

Europe|EU

Australia|AU

China|CN

For example,

123|1|2|3|...|9|nice weather in europe today|11|.....

the above line need to be replaced into

123|1|2|3|...|9|nice weather in EU today|11|.....

I began with a simple sed command by replacing the master file by getting value from pattern file .. But it's incomplete as im not sure how to process a huge master file and that too replacing specific fields.

while read line

do

value1=$(echo $line | awk -F"|" '{print $1}')

value2=$(echo $line | awk -F"|" '{print $2}')

sed -i 's/ '${value1}' /'${value2}'/g' master.txt

done < pattern.txt

Above script is very slow for a 10mb file where as my Master file is bit huge (100 mb).

Please help.

Upvotes: 0

Views: 399

Answers (3)

cdarke
cdarke

Reputation: 44344

The script is probably slow because of the number child processes you are creating. Also, you are reading the larger file (master.txt) more times than the smaller one.

Note that the -i option to sed is non-standard.

You can get rid of the calls to the awk language interpreter and the sed editor by using bash:

# Read patterns into an associative array
# Requites Bash 4 or later
declare -A patterns

while IFS='|' read key value
do
    patterns[$key]="$value"

done < pattern.txt 

# Set the option for case insensitive patterns
shopt -s nocasematch

while read line
do
    # Iterate through the patterns array
    for key in "${!patterns[@]}"
    do 
        line="${line//$key/${patterns[$key]}}"
    done  

    echo "$line"

done < master.txt

That does not allow only certain fields to be edited. This does:

# Read patterns into an associative array
# Requites Bash 4 or later
declare -A patterns

while IFS='|' read key value
do
    patterns[$key]="$value"

done < pattern.txt

# Set the option for case insensitive patterns
shopt -s nocasematch

# IFS is set here because localised setting for 'echo' does not work in bash
oldIFS="$IFS"
IFS='|'

# "line" is an array
while read -a line
do
    # Check there are at least 15 fields
    if (( ${#line[@]} >= 15 ))
    then
        # Iterate through the patterns array
        for key in "${!patterns[@]}"
        do
            # We are only interested in the 10th and 15th fields
            # (index 9 and 14 since arrays index from zero)
            val="${line[9]}"
            line[9]="${val//$key/${patterns[$key]}}"
            val="${line[14]}"
            line[14]="${val//$key/${patterns[$key]}}"
        done
    fi
    echo "${line[*]}"

done < master.txt

IFS="$oldIFS"

Upvotes: 1

George Vasiliou
George Vasiliou

Reputation: 6335

This is a sed alternative proposal, based on the fact that sed can read commands from a file.

First i create a sed command file using the contents of your pattern file:

$ cat file1
europe|EU
australia|AU
china|CN

$ while IFS="|" read -r a b;do 
> echo -e "s/((.[^|]*.){9})(.+)\<$a\>([^|]+)(.*)/\1\3$b\4\5/g";
> echo -e "s/((.[^|]*.){14})(.+)\<$a\>([^|]+)(.*)/\1\3$b\4\5/g";
> done<file1 >file11

$ cat file11
s/((.[^|]*.){9})(.+)\<europe\>([^|]+)(.*)/\1\3EU\4\5/g
s/((.[^|]*.){14})(.+)\<europe\>([^|]+)(.*)/\1\3EU\4\5/g
s/((.[^|]*.){9})(.+)\<australia\>([^|]+)(.*)/\1\3AU\4\5/g
s/((.[^|]*.){14})(.+)\<australia\>([^|]+)(.*)/\1\3AU\4\5/g
s/((.[^|]*.){9})(.+)\<china\>([^|]+)(.*)/\1\3CN\4\5/g
s/((.[^|]*.){14})(.+)\<china\>([^|]+)(.*)/\1\3CN\4\5/g

Then the only thing we have to do is to call sed and feed sed with above commands file11.

$ cat file2
1|2|3|4|5|europe|7|8|9|nice weather in europe today|11|12|europe|14|nice weather in europe today|16
1|2|3|4|5|europe|7|8|9|nice european weather today|11|12|europe|14|nice european weather today|16
1|2|3|4|5|europe|7|8|9|nice weather in china today|11|12|china|14|nice weather in china today|16
1|2|3|4|5|europe|7|8|9|nice weather in china today|11|12|china|14|best of chinas today|16
1|2|3|4|5|europe|7|8|9|nice weather in australia today|11|12|australia|14|nice weather in australia today|16

I have fullfilled file2 with various values for testing and to be sure that the sed regex provided will replace 10th and 15th field only , and only when we have a literal word match (i.e word europe is replaced by EU but word european is not replaced)

These are the results which seems to be pretty good. I expect this sed solution to be really fast with your big file.

$ sed -E -f file11 file2
1|2|3|4|5|europe|7|8|9|nice weather in EU today|11|12|europe|14|nice weather in EU today|16
1|2|3|4|5|europe|7|8|9|nice european weather today|11|12|europe|14|nice european weather today|16
1|2|3|4|5|europe|7|8|9|nice weather in CN today|11|12|china|14|nice weather in CN today|16
1|2|3|4|5|europe|7|8|9|nice weather in CN today|11|12|china|14|best of chinas today|16
1|2|3|4|5|europe|7|8|9|nice weather in AU today|11|12|australia|14|nice weather in AU today|16

Upvotes: 1

James Brown
James Brown

Reputation: 37394

Here's one shot in the dark as your sample data didn't even have 10 fields and I didn't have time to create test sets. Hope it works, using awk. Next time, please be considerate enough to create working data sets (enough fields, Europe =/= europe, etc). LIke I said, untested:

$ awk '
BEGIN { FS=OFS="|" }                      # delimiters
NR==FNR { a[$1]=$2; next }                # read patterns and hash them
{
    for(i=10;i<=NF;i+=5)                  # iterate every fifth field
        if(i%10==0||i%15==0){             # pick only mod 10 and mod 15
            n=split($i,b," ")             # split to b the chosen ones
            for(j=1;j<=n;j++)             # iterate thru the chosen ones
                if(b[j] in a)             # if word is found among patterns
                    sub(b[j],a[b[j]],$i)  # switch the matching pattern
        }
}1' pattern master

Upvotes: 1

Related Questions