Reputation: 821
I have two files: masterlist.txt
that has hundreds of lines of URLs, and toupdate.txt
that has a smaller number of updated versions of lines from the masterlist.txt
file that need to be replaced.
I'd like to be able to automate this process using Bash, since the creation and utilisation of these lists is already occuring in a bash script.
The server part of the URL is the part that changes, so we could match using the unique part: /whatever/whatever_user.xml
, but how to find and replace those lines in masterlist.txt
? i.e. how to go through each line of toupdate.txt
and as it ends in /f_SomeName/f_SomeName_user.xml
, find that ending in masterlist.txt
and replace that whole line with the new one?
So https://123456url.domain.com/26/path/f_SomeName/f_SomeName_user.xml
becomes https://new-123.domain.com/1/path/f_SomeName/f_SomeName_user.xml
for example.
The rest of masterlist.txt
needs to stay intact, so we must only find and replace lines that have different servers for the same line endings (IDs).
masterlist.txt
looks like this:
https://123456url.domain.com/26/path/f_SomeName/f_SomeName_user.xml
https://456789url.domain.com/32/path/f_AnotherName/f_AnotherName_user.xml
https://101112url.domain.com/1/path/g_SomethingElse/g_SomethingElse_user.xml
https://222blah11.domain.com/19/path/e_BlahBlah/e_BlahBlah_user.xml
[...]
toupdate.txt
looks like this:
https://new-123.domain.com/1/path/f_SomeName/f_SomeName_user.xml
https://foo-254.domain.com/8/path/g_SomethingElse/g_SomethingElse_user.xml
Make masterlist.txt
look like:
https://new-123.domain.com/1/path/f_SomeName/f_SomeName_user.xml
https://456789url.domain.com/32/path/f_AnotherName/f_AnotherName_user.xml
https://foo-254.domain.com/8/path/g_SomethingElse/g_SomethingElse_user.xml
https://222blah11.domain.com/19/path/e_BlahBlah/e_BlahBlah_user.xml
[...]
I've looked at sed
but I don't know how to do the find and replace using lines from the two files?
Here's what I have so far, doing the file handling at least:
#!/bin/bash
#...
while read -r line; do
# there's a new link on each line
link="${line}"
# extract the unique part from the end of each line
grabXML="${link##*/}"
grabID="${grabXML%_user.xml}"
# if we cannot grab the ID, then just set it to use the full link so we don't have an empty string
if [ -n "${grabID}" ]; then
identifier=${grabID}
else
identifier="${line}"
fi
## the find and replace here? ##
# we're done when we've reached the end of the file
done < "masterlist.txt"
Upvotes: 1
Views: 906
Reputation: 22012
Would you please try the following:
#!/bin/bash
declare -A map
while IFS= read -r line; do
if [[ $line =~ (/[^/]+/[^/]*\.xml)$ ]]; then
uniq_part="${BASH_REMATCH[1]}"
map[$uniq_part]=$line
fi
done < "toupdate.txt"
while IFS= read -r line; do
if [[ $line =~ (/[^/]+/[^/]*\.xml)$ ]]; then
uniq_part="${BASH_REMATCH[1]}"
if [[ -n ${map[$uniq_part]} ]]; then
line=${map[$uniq_part]}
fi
fi
echo "$line"
done < "masterlist.txt" > "masterlist_tmp.txt"
# if the result of "masterlist_tmp.txt" is good enough, uncomment the line below
# mv -f -- "masterlist_tmp.txt" "masterlist.txt"
result:
https://new-123.domain.com/1/path/f_SomeName/f_SomeName_user.xml
https://456789url.domain.com/32/path/f_AnotherName/f_AnotherName_user.xml
https://foo-254.domain.com/8/path/g_SomethingElse/g_SomethingElse_user.xml
https://222blah11.domain.com/19/path/e_BlahBlah/e_BlahBlah_user.xml
[Explanations]
map
maps the "unique part" such as /f_SomeName/f_SomeName_user.xml
to the "full path" such as https://new-123.domain.com/1/path/f_SomeName/f_SomeName_user.xml
.(/[^/]+/[^/]*\.xml)$
, if matched, assigns the shell variable
BASH_REMATCH[1]
to the substring from the second rightmost slash
to the extention ".xml" at the end of the string.[Alternative]
If the text files are large in size, bash
may not be fast enough. In such a case, awk
script will work more efficiently:
awk 'NR==FNR {
if (match($0, "/[^/]+/[^/]*\\.xml$")) {
map[substr($0, RSTART, RLENGTH)] = $0
}
next
}
{
if (match($0, "/[^/]+/[^/]*\\.xml$")) {
full_path = map[substr($0, RSTART, RLENGTH)]
if (full_path != "") {
$0 = full_path
}
}
print
}' "toupdate.txt" "masterlist.txt" > "masterlist_tmp.txt"
[Explanations]
NR==FNR { BLOCK1; next } { BLOCK2 }
syntax is a common idiom to
switch the processing individually for each file. As the NR==FNR
condition
meets only for the 1st file in the argument list and next
statement skips
the following block, BLOCK1
processes the file "toupdate.txt" only.
Similarly BLOCK2
processes the file "masterlist.txt" only.match($0, pattern)
succeeds, it sets the awk
variable
RSTART
to the start position of the matched substring out of $0
,
the current record read from the file,
then sets the variable RLENGTH
to the length of the matched substring.
Now we can extract the matched substring such as
/f_SomeName/f_SomeName_user.xml
by using the substr()
function.map
so that the substring (the unique part)
is mapped to the whole url in "toupdate.txt".map
, then the record ($0) is replaced with the
value of the array indexed by the key.Upvotes: 2
Reputation: 1801
Why not have sed
write its own script - producing the desired output,
sed -e "$(sed -e 's<^\(http[s]*://[^/]*/[^/]*/\)\(.*\)<\\|\2\$| s|.*|\1\2|<' toupdate.txt)" masterlist.txt
where
sed
command has an outer and an inner s
ubstitution commands
(s<...<...<
) captures scheme://domain/N/ as \1
and rest-of-path
\(.*\)
as \2
and inserts them into a script for the outer sed
commandsed
script (\|\2$| s|.*|\1\2|
) finds URLs in masterlist.txt
ending in
rest-of-path, substituting (inner s
) the new URL from toupdate.txt
<
and |
are used as
delimiters for the two s
commands, and \|...|
is used for /.../
Upvotes: 2