Allan
Allan

Reputation: 12456

Extract XML content from a log file using Sed and dump each result to a different file

I have the following 10 GB log file that I need to analyze directly on a Unix server.

2017-12-12 13:04:28,716 [ABC] [DEF] DEBUG some message1
2017-12-12 13:04:28,716 [ABC] [DEF] DEBUG some message2
2017-12-12 13:04:28,716 [ABC] [DEF] DEBUG some message3
2017-12-12 13:04:28,716 [ABC] [DEF] DEBUG some message4
2017-12-12 13:04:28,716 [ABC] [DEF] DEBUG some message5
2017-12-12 13:04:28,732 [ABC] [DEF] DEBUG some message6
2017-12-12 13:04:28,732 [ABC] [DEF] DEBUG <xml>
<id>1</id> 
<!—- id is not unique since the XML data provides all the
information of an object X defined by its id at a specific point in time -->
some XML content on more than 500 lines
</xml>
2017-12-12 13:04:30,330 [ABC] [DEF] DEBUG some message8
2017-12-12 13:04:30,333 [ABC] [DEF] DEBUG some message9
2017-12-12 13:04:30,334 [ABC] [DEF] INFO some message10
2017-12-12 13:04:30,334 [ABC] [DEF] INFO some message11
2017-12-12 13:04:31,431 [ABC] [DEF] INFO some message12
2017-12-12 13:04:28,732 [ABC] [DEF] DEBUG <xml>
<id>2</id>
some XML content on more than 500 lines 
</xml>
2017-12-12 13:04:31,432 [ABC] [DEF] DEBUG some message13
2017-12-12 13:04:31,476 [ABC] [DEF] INFO some message14
2017-12-12 13:04:31,476 [ABC] [DEF] DEBUG some message14
2017-12-12 13:04:31,490 [ABC] [DEF] DEBUG some message15
2017-12-12 13:04:28,732 [ABC] [DEF] DEBUG <xml>
<id>1</id>
some XML content on more than 500 lines 
</xml>
2017-12-12 13:04:31,491 [ABC] [DEF] DEBUG some message16
2017-12-12 13:04:31,491 [ABC] [DEF] DEBUG some message17
2017-12-12 13:04:31,496 [ABC] [DEF] DEBUG some message18
2017-12-12 13:04:31,996 [ABC] [DEF] INFO some message19

In order to do so, I would like to extract each XML message and dump it in a separate file.

For example: the first XML message would be stored in file1.xml, the second one in file2.xml, and so on.

If all the patterns had to be extracted to one single file, it would be quite direct with something like:

sed -n 's~<xml>(\s*\.*\s*)\s*</xml>~p' file.in > file.out #just a prototype

I thought about going to a solution in which I could use a back reference with the <id> tag of the XML and use it to name the file in which I would dump it, but it is not working since same values of <id> tag do appear at different places in the log file, which would overwrite the previous extractions.

sed -r 's~(<xml>…<id>(.*)</id>…</xml>)~echo "\1" >> \2.out~e' file.in #just a prototype

With awk, if the XML content was on one single line, it would be also quite straightforward. However, it is not the case and I don’t know which line separator I should define for RS to treat the XML content as if it was on a single line and dump it in separate files.

With awk, what I thought feasable was:

If you have a better solution with awk or a solution with sed in which I could access a variable containing the number of the pattern being currently treated and reuse it to generate the output files, it would be great. (something like that: current_pattern_position used to generate file_$current_pattern_position.out)

I got already pretty interesting solutions using awk and perl. I would like to have a sed working solution for this case

Upvotes: 0

Views: 5032

Answers (4)

Cy Rossignol
Cy Rossignol

Reputation: 16867

Update: Here's a portable, simplified approach using Sed:

#!/bin/sed -nf

# Execute the following group of commands for each line in the XML node to
# generate a series of shell commands that we'll feed into an interpreter:
/<xml>/,/<\/xml>/ {
    # Extract the ID number to generate a command that changes the output file:
    /^<id>\([0-9]\+\)<\/id>$/ {
        # Using the same pattern as above, substitute the ID number into a
        # command that updates the current output file and increments a counter
        # for the ID that we'll append as the filename extension:
        s//c\1=$(( c\1 + 1 )); exec > "file\1.$c\1"/
        # Output the generated command:
        p
        # Then, proceed to the next line:
        n
    }
    # Output any remaining lines in the XML block except for the <xml> tags:
    /<xml>\|<\/xml>/ !{
        # Escape any single quotes in the XML content (so we can wrap it in a
        # shell command below):
        s/'/'"'"'/g
        #'# (...ignore or remove this line...)
        # Generate a command that will write the line to the current file:
        s/^.*$/echo '&'/
        # Output the generated command:
        p
    }
}

As we can see, the Sed program generates a series of shell commands from the input that we can pipe to a shell interpreter to write the output files:

$ sed -nf parse_log.sed < file.in | sh

This avoids excessive hold space buffering and GNU Sed's e flag which is painfully slow (we would need to spawn a child shell process every time we need to write a file), and enables us to efficiently track the number of times we encounter an ID so we can increment the number in the filename. Sed also includes a w flag that we can append to a pattern command to write a file more quickly (instead of shelling-out with e), but I'm not aware of any way to pass a variable argument to the flag.

Alternatively, we could include the contents of the program as an argument to Sed. Here's a squashed version that's easier to paste:

sed -n '/<xml>/,/<\/xml>/ {                             
    /^<id>\([0-9]\+\)<\/id>$/{s//c\1=$(( c\1 + 1 ));exec > "file\1.$c\1"/;p;n;}
    /<xml>\|<\/xml>/!{'"s/'/'\"'\"'/g;"'s/^.*$/echo '"'&'"'/;p;}                
}' < file.in | sh

It works, but we can probably tell that Sed isn't the best tool for this problem. Sed's simple language isn't designed for this kind of logic, so the code isn't pretty and we rely on the shell to generate the files, which adds a bit of overhead. If you're hard-set on using Sed, it may be okay for the job to take a little longer. For something performance-critical, consider using one of the tools described in the other answers.

Based on the information and examples in the question, I assume we don't want the opening and closing <xml> tags in the output, and the ID is always a number on its own line. The implementation writes filenames with a numeric extension that increments when it finds a duplicate ID (fileID.count, file1.1, file1.2, etc.). It should be easy enough to change these details if needed.


Note: If needed, the revision history contains the two alternative implementations (one using GNU Sed, and another that uses a wrapper script) that I removed for brevity. They work but are unnecessarily slow or complex.

Upvotes: 4

thanasisp
thanasisp

Reputation: 5975

awk 'sub(/.*<xml>/,"<xml>") {out="file" ++i ".xml"; p=1}
     p {print > out}
     /<\/xml>/ {p=0; close(out)}
' file

In case of too many xml objects in the logs, you could get something like error: Too many open files so I add an optional close file.

Upvotes: 2

RomanPerekhrest
RomanPerekhrest

Reputation: 92894

GNU Awk solution:

awk -v RS='<xml>|</xml>' '!(NR%2){ 
           gsub(/^[[:space:]]*|[[:space:]]*$/, ""); 
           printf "<xml>\n%s\n</xml>\n",$0 > "file"++c".xml";
           close("file"c".xml")
       }' file

Viewing results:

$ head file*.xml
==> file1.xml <==
<xml>
<id>1</id> 
<!—- id is not unique since the xml data provides all the
information of an object X defined by its id at a specific point in time -->
some xml content on more than 500 lines
</xml>

==> file2.xml <==
<xml>
<id>2</id>
some xml content on more than 500 lines
</xml>

==> file3.xml <==
<xml>
<id>1</id>
some xml content on more than 500 lines
</xml>

Upvotes: 3

Nahuel Fouilleul
Nahuel Fouilleul

Reputation: 19335

perl one-liner

perl -ne 'if(s/.*(?=<xml>)//){$x++;open$fh,">file$x.xml"}if($fh){print$fh $_}if(/<\/xml>/){close$fh;undef$fh}' input.txt

how it works

  • -n : this is similar to sed -n will read input or argument files without printing

  • s/.*(?=<xml>)// : to remove the left part before <xml> and evaluates to true if match

Upvotes: 2

Related Questions