Reputation: 15109

Parse multiline with awk

I have a multiline output, like this:

foo: some text
    goes here
    and here
    and here
bar: more text
    goes here
    and here
xyz: and more...
    and more...
    and more...

The text's format is exactly as shown here. The "groups/sections" of text I'm interested in start right after the beginning of the line and end at the line before the next text starts right at the beginning of a line.

In this example the grouls would be foo and all the text right before bar. Then bar and all the text right before xyz. And finally, xyz until the end.

Upvotes: 0

Answers (4)

Erik Kruus

Reputation: 161

First, if there's a single section, go with @Akshay Hegde. Otherwise if you can change the RS, follow @sheltond. But for logfile processing I often need to sometimes extract linewise, and some sections multi-line, so that some logfile summary ends up as short as possible.

Here I usually use some variation on a braindead pattern. For example, suppose I want to

print the first line of all non-bar sections, and
also print every bar section with extra detail (here join lines)

file print_bar_sections.awk :

function bar_may_end_here() { # This check might happen in several places
    if(bar_started){
        print(bar_out); bar_out=""; bar_started=0;
    }
}

# Here, any section-begin match might be terminating a bar section
/^[a-z]*:/ {bar_may_end_here();}
    
# Match start of interesting section, this line always included
/^bar:/ {bar_started=1; bar_out=$0; next;}
    
# Pehaps modify, skip interior lines?
#    bar_started==1 && /goes/ {bar_out = bar_out "GOES-LINE"; next;}
# Here, join lines
bar_started==1 {bar_out = bar_out $0; next;}

# Here we know we are not in a bar-section.
# For example, we might have single-line "interesting lines"
/error/ {print; next;}
/warning/ {print; next;}

# EOF might also terminate an active bar section
# (for logfiles you might know this is impossible)
END { bar_may_end_here(); }

Adjust this pattern as needed. awk begins with strings empty and variables 0. The next command is especially useful when creating such section extractors for log file processing.

Sometimes this approach of creating a state machine variable like bar_started and state info like a bar_out string can allow rather more complicated awk programs. For example, the state variable might need more values than 0 or 1, and the the stored state info might more complex (array or several variables). Enjoy!

Upvotes: 1

sheltond

Reputation: 1937

As others have said, you haven't specified what you want to do with the data once you've parsed it.

If you just want to extract a particular chunk, the answer from Akshay Hegde should work fine.

If you want to process each record using some more awk functionality, such as transforming the output in some way (e.g. joining the lines together, etc), you probably need something a bit different.

There are a couple of fairly easy ways that you can do this, but I think the best approach is probably to change the record separator.

The ability to use a regular expression as the record separator is a gawk extension, but you're probably using gawk if you're on Linux.

Here is the contents of a gawk program file "prog.awk":

function process_group(name, body) {
    print "Got group with name '" name "'";
    print body;
}

BEGIN {
    RS="(\n|^)\\S+:"
    PREV=""
}

{
    if (PREV!="") {
        process_group(gensub(/\n?(\S+):/, "\\1", "", PREV), $0);
    }
    PREV=RT
}

You can run this using

gawk -f prog.awk input.txt

Alternatively you can put the whole thing on the gawk command-line, but it's easier to read if it's nicely formatted.

The idea is that each time it sees the record separator it gives you the content since the last record separator or the beginning of the file. This means that the first time it sees the record separator it calls the bottom block with the record separator "foo:" and an empty body, the second time it sees the record separator it calls the block with "bar:" and the content between "foo:" and "bar:", etc.

This means that the record separator corresponding to each block is the previous one, not the current one. This is easy to handle by keeping track of the previous record separator in the "PREV" variable.

So, the BEGIN block sets the record separator RS, and initializes PREV to be empty.

The block at the bottom is called for each record delimited by RS, and once more at the end of the file.

If "PREV" is not empty, it calls the "process_group" function with the current body data and the previous record separator (stripping off the uninteresting bits from PREV on the way through using gensub). It then assigns the currently matches record separator (RT) to PREV for use next time.

In "process_group", you can do whatever processing you want with each group. In this case I'm just printing them out, but it should be easy to modify it to do whatever you want.

Upvotes: 0

hansaplast

Reputation: 11573

If I'm interpreting your question correctly you want to simply remove the whitespace and put foo on a different line than the part after :. This awk script would do that:

awk 'BEGIN{RS="[:\n]"}{$1=$1}1' file

Output:

foo
some text
goes here
and here
and here
bar
more text
goes here
and here
xyz
and more...
and more...
and more...

Explanation:

RS="[:\n] says that lines should be split either at : or at \n
$1=$1 reprocesses the line into $0 (removes whitespace at beginning of line)
1 says that every line should be processes with the "default action" which is print $0

Upvotes: 0

Akshay Hegde

Reputation: 16997

Input

$ cat file
foo: some text
    goes here
    and here
    and here
bar: more text
    goes here
    and here
xyz: and more...
    and more...
    and more...

Output

$ awk '/:/{f=/^foo/}f' file
foo: some text
    goes here
    and here
    and here

Incase if you want to skip line matched then

$ awk '/:/{f=/^foo/;next}f' file
    goes here
    and here
    and here

Or even

# Just modify variable search value
# 1st approach
$ awk -v search="foo" '/:/{f=$0~"^"search}f' file
foo: some text
    goes here
    and here
    and here

# 2nd approach
$ awk -v search="foo" '/:/{f=$0~"^"search;next}f' file
    goes here
    and here
    and here

Upvotes: 2

Parse multiline with awk

Answers (4)

Related Questions