hmj6jmh
hmj6jmh

Reputation: 699

Bash script to extract information from a block of text spanning multiple lines

I am trying to extract track information from MKV files using mkvinfo from a bash script. The output is a long series of lines with repeating patterns as delimiters for various track properties of various track types. An example of a track is:

…
| + A track
|  + Track number: 6 (track ID for mkvmerge & mkvextract: 5)
|  + Track UID: 11555278830806058806
|  + Track type: subtitles
|  + (Unknown element: TrickTrackFlag; ID: 0xc6 size: 3)
|  + Enabled: 1
|  + Default flag: 0
|  + Forced flag: 0
|  + Lacing flag: 0
|  + MinCache: 0
|  + Timecode scale: 1
|  + Name: Spanish
|  + Language: spa
|  + Codec ID: S_TEXT/UTF8
|  + (Unknown element: TrackAttachmentLink; ID: 0x7446 size: 11)
|  + Codec decode all: 1
| + A track
|  + Track number: 7 (track ID for mkvmerge & mkvextract: 6)
…

There can be multiple instances of a given track type and the number of lines for a track is somewhat variable. I need to extract certain track properties from specific track types. For example, if I want to find all instances of the subtitles track type and extract the Track number and the Codec ID, I can pipe the results through grep:

mkvinfo "file.mkv" | grep "subtitles" -B 2 | grep "Track number"

This outputs the lines containing the track numbers for all subtitle tracks. I have to put the lines into an array and filter them to get the first number so I can use it with mkvpropedit, which requires the first number.

Similarly:

mkvinfo "file.mkv" | grep "subtitles" -A 10 | grep "Codec ID: " | sed 's/^.**: //'

outputs the codec IDs for all subtitle tracks.

This works fine IF I know exactly how many lines there are before/after the line containing subtitles. The problem is, the exact number of lines to include varies from file to file. So what I need to do is to output the entire block of lines between | + A track and a line beginning with |+ OR | + OR EOF. I also need to filter the block to extract the first Track number and the Codec ID. I tried using | grep -Eo [0-9]+ | head -1 to extract the first number of each track but it only works on the first track found and quits. If there's a way to make it work for all tracks in one line that would be helpful. The second example I gave using sed works for the Codec ID.

The bottom line QUESTION is:

How can I extract specific properties of specific track types, such as the example given, and put them into an array or arrays for further processing?

I am hoping to be able to meet the following criteria:

  1. I want to use existing bash (GNU bash, version 4.3.30(1)-release (x86_64-apple-darwin12.5.0)) utilities like sed, awk, grep, …
  2. I don't want to have to create an 'intermediate file'
  3. I want to simply pipe the output of mkvinfo into the various utilities

I found lots of threads that show how to use sed to find a block of text between two words but I could not get the code to work with entire lines or strings containing spaces. Maybe there is a way to do that but I don't know enough about sed to be able to adapt the code to my situation.

Please explain in detail how your code works so I can 'learn how to fish' so next time I can do it myself.

Upvotes: 0

Views: 1452

Answers (1)

fferri
fferri

Reputation: 18940

When processing multiple lines in complex ways, my tool of choice is awk.

In each matching pattern, we save the match in a variable. Finally, when we encounter the string indicating a new block (| + A track), or we reach the end of the stream, we print the value of the variables we are interested in (track number, codec id), but only if the type is subtitles.

mkvinfo ... | gawk '
    match($0, /Track number: ([0-9]+)/, m) {TN=m[1]}
    match($0, /Codec ID: (.*)$/, m)        {CI=m[1]}
    /Track type: subtitles/                {SUB=1}
    /^\| \+ A track$/ {if(SUB) print TN, CI; unset SUB}
    END               {if(SUB) print TN, CI; unset SUB}'

You need gawk to have the match function to capture parenthesized groups.

Upvotes: 2

Related Questions