Reputation: 3
I need merge 2 files if there is ona match. The match in not static is random but is always after one specific tag
File 1
<can update="x" site="merge-xml-01" site_id="foo.com" xmltv_id="foo@com">foo@com</can>
<can update="x" site="merge-xml-02" site_id="bar.com" xmltv_id="bar@com">bar@com</can>
<can update="x" site="merge-xml-03" site_id="xxx.com" xmltv_id="xxx@com">xxx@com</can>
File 2
<can offset="u" same_as="foo.com" id="foo 01">foo 01</can>
<can offset="u" same_as="foo.com" id="foo 02">foo 02</can>
<can offset="u" same_as="bar.com" id="bar 01">bar 01</can>
<can offset="u" same_as="xxx.com" id="xxx 01">xxx 01</can>
<can offset="u" same_as="xxx.com" id="xxx 02">xxx 02</can>
<can offset="u" same_as="xxx.com" id="xxx 03">xxx 03</can>
I need make the file number 3 like this
<can update="x" site="merge-xml-01" site_id="foo.com" xmltv_id="foo@com">foo@com</can>
<can offset="u" same_as="foo.com" id="foo 01">foo 01</can>
<can offset="u" same_as="foo.com" id="foo 02">foo 02</can>
<can update="x" site="merge-xml-02" site_id="bar.com" xmltv_id="bar@com">bar@com</can>
<can offset="u" same_as="bar.com" id="bar 01">bar 01</can>
<can update="x" site="merge-xml-03" site_id="xxx.com" xmltv_id="xxx@com">xxx@com</can>
<can offset="u" same_as="xxx.com" id="xxx 01">xxx 01</can>
<can offset="u" same_as="xxx.com" id="xxx 02">xxx 02</can>
<can offset="u" same_as="xxx.com" id="xxx 03">xxx 03</can>
I hope is clear, if the tag inside the file 1 "site_id=" match with the tag "same_as=" inside the file 2, I need merge the data.
Honestly I have no idea what I can do to have this result, I checked many posts but all merge data on the same line, I can't find something merge data on new line.
I like if is possible use sed or awk but every suggestion is welcome.
Thank you in advice.
Upvotes: 0
Views: 328
Reputation: 58371
This might work for you (GNU sed):
sed 's#.*same_as=\("[^"]*"\).*#/site_id=\1/a&#' file2 | sed -f - file1
Turn file2 into a sed script that appends each line on matching the value of the same_as
with file1's site_id
. Then pipe the generated script through to a second invocation of sed which is run against file1. Each time a line from file1 is read in, lines from file2 are appended in sequence to it.
To delete lines from file1 which do not have a match in file2, use:
sed -e 's#.*same_as=\("[^"]*"\).*#/site_id=\1/{a&\nx;s/^/x/;x}#' file2 |
sed -f - -e 'x;/x/{z;x;b};d' file1
This adds a flag in the hold space which is set when a line from file2 is added and when it is not set, to delete the current record from file1
Upvotes: 0
Reputation: 67467
assumes file2 is sorted by the key
$ awk -F' |=' 'NR==FNR {for(i=1;i<NF;i++) if($i=="site_id") {a[$(i+1)]=$0; break}; next}
{k=""; for(i=1;i<NF;i++) if($i=="same_as") {k=$(i+1); break}
if(!p[k]++) print a[k]}1' file1 file2
<can update="x" site="merge-xml-01" site_id="foo.com" xmltv_id="foo@com">foo@com</can>
<can offset="u" same_as="foo.com" id="foo 01">foo 01</can>
<can offset="u" same_as="foo.com" id="foo 02">foo 02</can>
<can update="x" site="merge-xml-02" site_id="bar.com" xmltv_id="bar@com">bar@com</can>
<can offset="u" same_as="bar.com" id="bar 01">bar 01</can>
<can update="x" site="merge-xml-03" site_id="xxx.com" xmltv_id="xxx@com">xxx@com</can>
<can offset="u" same_as="xxx.com" id="xxx 01">xxx 01</can>
<can offset="u" same_as="xxx.com" id="xxx 02">xxx 02</can>
<can offset="u" same_as="xxx.com" id="xxx 03">xxx 03</can>
ps. this should be dramatically faster than other solutions for large files.
Upvotes: 1
Reputation: 5335
Read a file line by line, find URL and search for it in a second file.
while read -r line; do
echo "$line" >> file3
url=$(sed 's/.*site_id="\([^"]\+\)".*/\1/' <<< $line)
grep $url file2 >> file3
done < file1
$ cat file3
<can update="x" site="merge-xml-01" site_id="foo.com" xmltv_id="foo@com">foo@com</can>
<can offset="u" same_as="foo.com" id="foo 01">foo 01</can>
<can offset="u" same_as="foo.com" id="foo 02">foo 02</can>
<can update="x" site="merge-xml-02" site_id="bar.com" xmltv_id="bar@com">bar@com</can>
<can offset="u" same_as="bar.com" id="bar 01">bar 01</can>
<can update="x" site="merge-xml-03" site_id="xxx.com" xmltv_id="xxx@com">xxx@com</can>
<can offset="u" same_as="xxx.com" id="xxx 01">xxx 01</can>
<can offset="u" same_as="xxx.com" id="xxx 02">xxx 02</can>
<can offset="u" same_as="xxx.com" id="xxx 03">xxx 03</can>
Upvotes: 0
Reputation: 15248
IF you know for sure these formats are consistent and always on a single line...
$: cat c $ file 1 is a, file 2 is b
#! /bin/env bash
while read -r line
do pat="${line##* site_id=\"}"
pat="${pat%%\"*}"
echo "$line"
grep " same_as=[\"]$pat[\"] " b
done < a
$: c
<can update="x" site="merge-xml-01" site_id="foo.com" xmltv_id="foo@com">foo@com</can>
<can offset="u" same_as="foo.com" id="foo 01">foo 01</can>
<can offset="u" same_as="foo.com" id="foo 02">foo 02</can>
<can update="x" site="merge-xml-02" site_id="bar.com" xmltv_id="bar@com">bar@com</can>
<can offset="u" same_as="bar.com" id="bar 01">bar 01</can>
<can update="x" site="merge-xml-03" site_id="xxx.com" xmltv_id="xxx@com">xxx@com</can>
<can offset="u" same_as="xxx.com" id="xxx 01">xxx 01</can>
<can offset="u" same_as="xxx.com" id="xxx 02">xxx 02</can>
<can offset="u" same_as="xxx.com" id="xxx 03">xxx 03</can>
Upvotes: 0