Reputation: 13
I'm new to the shell scripting and need your guidance regarding a typical requirement. I have two files(1.master file and 2.pattern file) Master file contains many fields with | delimiter and only 10th and 15th fields needs to be updated based on the pattern file.
H|20170101
123|field2|field3|...|field10|field11...|field15|....|field150
...
...
T|1000000
Europe|EU
Australia|AU
China|CN
For example,
123|1|2|3|...|9|nice weather in europe today|11|.....
the above line need to be replaced into
123|1|2|3|...|9|nice weather in EU today|11|.....
I began with a simple sed command by replacing the master file by getting value from pattern file .. But it's incomplete as im not sure how to process a huge master file and that too replacing specific fields.
while read line
do
value1=$(echo $line | awk -F"|" '{print $1}')
value2=$(echo $line | awk -F"|" '{print $2}')
sed -i 's/ '${value1}' /'${value2}'/g' master.txt
done < pattern.txt
Above script is very slow for a 10mb file where as my Master file is bit huge (100 mb).
Please help.
Upvotes: 0
Views: 399
Reputation: 44344
The script is probably slow because of the number child processes you are creating. Also, you are reading the larger file (master.txt
) more times than the smaller one.
Note that the -i
option to sed
is non-standard.
You can get rid of the calls to the awk
language interpreter and the sed
editor by using bash
:
# Read patterns into an associative array
# Requites Bash 4 or later
declare -A patterns
while IFS='|' read key value
do
patterns[$key]="$value"
done < pattern.txt
# Set the option for case insensitive patterns
shopt -s nocasematch
while read line
do
# Iterate through the patterns array
for key in "${!patterns[@]}"
do
line="${line//$key/${patterns[$key]}}"
done
echo "$line"
done < master.txt
That does not allow only certain fields to be edited. This does:
# Read patterns into an associative array
# Requites Bash 4 or later
declare -A patterns
while IFS='|' read key value
do
patterns[$key]="$value"
done < pattern.txt
# Set the option for case insensitive patterns
shopt -s nocasematch
# IFS is set here because localised setting for 'echo' does not work in bash
oldIFS="$IFS"
IFS='|'
# "line" is an array
while read -a line
do
# Check there are at least 15 fields
if (( ${#line[@]} >= 15 ))
then
# Iterate through the patterns array
for key in "${!patterns[@]}"
do
# We are only interested in the 10th and 15th fields
# (index 9 and 14 since arrays index from zero)
val="${line[9]}"
line[9]="${val//$key/${patterns[$key]}}"
val="${line[14]}"
line[14]="${val//$key/${patterns[$key]}}"
done
fi
echo "${line[*]}"
done < master.txt
IFS="$oldIFS"
Upvotes: 1
Reputation: 6335
This is a sed alternative proposal, based on the fact that sed can read commands from a file.
First i create a sed command file using the contents of your pattern file:
$ cat file1
europe|EU
australia|AU
china|CN
$ while IFS="|" read -r a b;do
> echo -e "s/((.[^|]*.){9})(.+)\<$a\>([^|]+)(.*)/\1\3$b\4\5/g";
> echo -e "s/((.[^|]*.){14})(.+)\<$a\>([^|]+)(.*)/\1\3$b\4\5/g";
> done<file1 >file11
$ cat file11
s/((.[^|]*.){9})(.+)\<europe\>([^|]+)(.*)/\1\3EU\4\5/g
s/((.[^|]*.){14})(.+)\<europe\>([^|]+)(.*)/\1\3EU\4\5/g
s/((.[^|]*.){9})(.+)\<australia\>([^|]+)(.*)/\1\3AU\4\5/g
s/((.[^|]*.){14})(.+)\<australia\>([^|]+)(.*)/\1\3AU\4\5/g
s/((.[^|]*.){9})(.+)\<china\>([^|]+)(.*)/\1\3CN\4\5/g
s/((.[^|]*.){14})(.+)\<china\>([^|]+)(.*)/\1\3CN\4\5/g
Then the only thing we have to do is to call sed and feed sed with above commands file11.
$ cat file2
1|2|3|4|5|europe|7|8|9|nice weather in europe today|11|12|europe|14|nice weather in europe today|16
1|2|3|4|5|europe|7|8|9|nice european weather today|11|12|europe|14|nice european weather today|16
1|2|3|4|5|europe|7|8|9|nice weather in china today|11|12|china|14|nice weather in china today|16
1|2|3|4|5|europe|7|8|9|nice weather in china today|11|12|china|14|best of chinas today|16
1|2|3|4|5|europe|7|8|9|nice weather in australia today|11|12|australia|14|nice weather in australia today|16
I have fullfilled file2 with various values for testing and to be sure that the sed regex provided will replace 10th and 15th field only , and only when we have a literal word match (i.e word europe
is replaced by EU
but word european
is not replaced)
These are the results which seems to be pretty good. I expect this sed solution to be really fast with your big file.
$ sed -E -f file11 file2
1|2|3|4|5|europe|7|8|9|nice weather in EU today|11|12|europe|14|nice weather in EU today|16
1|2|3|4|5|europe|7|8|9|nice european weather today|11|12|europe|14|nice european weather today|16
1|2|3|4|5|europe|7|8|9|nice weather in CN today|11|12|china|14|nice weather in CN today|16
1|2|3|4|5|europe|7|8|9|nice weather in CN today|11|12|china|14|best of chinas today|16
1|2|3|4|5|europe|7|8|9|nice weather in AU today|11|12|australia|14|nice weather in AU today|16
Upvotes: 1
Reputation: 37394
Here's one shot in the dark as your sample data didn't even have 10 fields and I didn't have time to create test sets. Hope it works, using awk. Next time, please be considerate enough to create working data sets (enough fields, Europe
=/= europe
, etc). LIke I said, untested:
$ awk '
BEGIN { FS=OFS="|" } # delimiters
NR==FNR { a[$1]=$2; next } # read patterns and hash them
{
for(i=10;i<=NF;i+=5) # iterate every fifth field
if(i%10==0||i%15==0){ # pick only mod 10 and mod 15
n=split($i,b," ") # split to b the chosen ones
for(j=1;j<=n;j++) # iterate thru the chosen ones
if(b[j] in a) # if word is found among patterns
sub(b[j],a[b[j]],$i) # switch the matching pattern
}
}1' pattern master
Upvotes: 1