Tapiocapioca
Tapiocapioca

Reputation: 3

Merge two files, line by line, after matching pattern in a new line

I need merge 2 files if there is ona match. The match in not static is random but is always after one specific tag

File 1

<can update="x" site="merge-xml-01" site_id="foo.com" xmltv_id="foo@com">foo@com</can>
<can update="x" site="merge-xml-02" site_id="bar.com" xmltv_id="bar@com">bar@com</can>
<can update="x" site="merge-xml-03" site_id="xxx.com" xmltv_id="xxx@com">xxx@com</can>

File 2

<can offset="u" same_as="foo.com" id="foo 01">foo 01</can>
<can offset="u" same_as="foo.com" id="foo 02">foo 02</can>
<can offset="u" same_as="bar.com" id="bar 01">bar 01</can>
<can offset="u" same_as="xxx.com" id="xxx 01">xxx 01</can>
<can offset="u" same_as="xxx.com" id="xxx 02">xxx 02</can>
<can offset="u" same_as="xxx.com" id="xxx 03">xxx 03</can>

I need make the file number 3 like this

<can update="x" site="merge-xml-01" site_id="foo.com" xmltv_id="foo@com">foo@com</can>
<can offset="u" same_as="foo.com" id="foo 01">foo 01</can>
<can offset="u" same_as="foo.com" id="foo 02">foo 02</can>
<can update="x" site="merge-xml-02" site_id="bar.com" xmltv_id="bar@com">bar@com</can>
<can offset="u" same_as="bar.com" id="bar 01">bar 01</can>
<can update="x" site="merge-xml-03" site_id="xxx.com" xmltv_id="xxx@com">xxx@com</can>
<can offset="u" same_as="xxx.com" id="xxx 01">xxx 01</can>
<can offset="u" same_as="xxx.com" id="xxx 02">xxx 02</can>
<can offset="u" same_as="xxx.com" id="xxx 03">xxx 03</can>

I hope is clear, if the tag inside the file 1 "site_id=" match with the tag "same_as=" inside the file 2, I need merge the data.

Honestly I have no idea what I can do to have this result, I checked many posts but all merge data on the same line, I can't find something merge data on new line.

I like if is possible use sed or awk but every suggestion is welcome.

Thank you in advice.

Upvotes: 0

Views: 328

Answers (4)

potong
potong

Reputation: 58371

This might work for you (GNU sed):

sed 's#.*same_as=\("[^"]*"\).*#/site_id=\1/a&#' file2 | sed -f - file1

Turn file2 into a sed script that appends each line on matching the value of the same_as with file1's site_id. Then pipe the generated script through to a second invocation of sed which is run against file1. Each time a line from file1 is read in, lines from file2 are appended in sequence to it.

To delete lines from file1 which do not have a match in file2, use:

sed -e 's#.*same_as=\("[^"]*"\).*#/site_id=\1/{a&\nx;s/^/x/;x}#' file2 |
sed -f - -e 'x;/x/{z;x;b};d' file1

This adds a flag in the hold space which is set when a line from file2 is added and when it is not set, to delete the current record from file1

Upvotes: 0

karakfa
karakfa

Reputation: 67467

assumes file2 is sorted by the key

$ awk -F' |=' 'NR==FNR {for(i=1;i<NF;i++) if($i=="site_id") {a[$(i+1)]=$0; break}; next} 
                       {k=""; for(i=1;i<NF;i++) if($i=="same_as") {k=$(i+1); break}
                        if(!p[k]++) print a[k]}1' file1 file2

<can update="x" site="merge-xml-01" site_id="foo.com" xmltv_id="foo@com">foo@com</can>
<can offset="u" same_as="foo.com" id="foo 01">foo 01</can>
<can offset="u" same_as="foo.com" id="foo 02">foo 02</can>
<can update="x" site="merge-xml-02" site_id="bar.com" xmltv_id="bar@com">bar@com</can>
<can offset="u" same_as="bar.com" id="bar 01">bar 01</can>
<can update="x" site="merge-xml-03" site_id="xxx.com" xmltv_id="xxx@com">xxx@com</can>
<can offset="u" same_as="xxx.com" id="xxx 01">xxx 01</can>
<can offset="u" same_as="xxx.com" id="xxx 02">xxx 02</can>
<can offset="u" same_as="xxx.com" id="xxx 03">xxx 03</can>

ps. this should be dramatically faster than other solutions for large files.

Upvotes: 1

Michael
Michael

Reputation: 5335

Read a file line by line, find URL and search for it in a second file.

while read -r line; do
        echo "$line" >> file3
        url=$(sed 's/.*site_id="\([^"]\+\)".*/\1/' <<< $line)
        grep $url file2 >> file3
done < file1

$ cat file3
<can update="x" site="merge-xml-01" site_id="foo.com" xmltv_id="foo@com">foo@com</can>
<can offset="u" same_as="foo.com" id="foo 01">foo 01</can>
<can offset="u" same_as="foo.com" id="foo 02">foo 02</can>
<can update="x" site="merge-xml-02" site_id="bar.com" xmltv_id="bar@com">bar@com</can>
<can offset="u" same_as="bar.com" id="bar 01">bar 01</can>
<can update="x" site="merge-xml-03" site_id="xxx.com" xmltv_id="xxx@com">xxx@com</can>
<can offset="u" same_as="xxx.com" id="xxx 01">xxx 01</can>
<can offset="u" same_as="xxx.com" id="xxx 02">xxx 02</can>
<can offset="u" same_as="xxx.com" id="xxx 03">xxx 03</can>

Upvotes: 0

Paul Hodges
Paul Hodges

Reputation: 15248

IF you know for sure these formats are consistent and always on a single line...

$: cat c $ file 1 is a, file 2 is b
#! /bin/env bash

while read -r line
do pat="${line##* site_id=\"}"
   pat="${pat%%\"*}"
   echo "$line"
   grep " same_as=[\"]$pat[\"] " b
done < a

$: c
<can update="x" site="merge-xml-01" site_id="foo.com" xmltv_id="foo@com">foo@com</can>
<can offset="u" same_as="foo.com" id="foo 01">foo 01</can>
<can offset="u" same_as="foo.com" id="foo 02">foo 02</can>
<can update="x" site="merge-xml-02" site_id="bar.com" xmltv_id="bar@com">bar@com</can>
<can offset="u" same_as="bar.com" id="bar 01">bar 01</can>
<can update="x" site="merge-xml-03" site_id="xxx.com" xmltv_id="xxx@com">xxx@com</can>
<can offset="u" same_as="xxx.com" id="xxx 01">xxx 01</can>
<can offset="u" same_as="xxx.com" id="xxx 02">xxx 02</can>
<can offset="u" same_as="xxx.com" id="xxx 03">xxx 03</can>

Upvotes: 0

Related Questions