Sven Richter
Sven Richter

Reputation: 423

How to use regex for multiple line pattern in shell script

I want to write a bash script that finds a pattern in a html-file which is going over multiple lines.

File for regex:

<td class="content">
  some content
</td>
<td class="time">
  13.05.2013  17:51
</td>
<td class="author">
  A Name
</td>

Now I want to find the content of <td>-tag with the class="time".

So in principle the following regex:

<td class="time">(\d{2}\.\d{2}\.\d{4}\s+\d{2}:\d{2})</td>

grep seems not to be the command I can use, because...

  1. It only returns the complete line or the complete result using -o and not only the result inside the round brackets (...).
  2. It looks only in one line for a pattern

So how is it possible that I will get only a string with 13.05.2013 17:51?

Upvotes: 3

Views: 5843

Answers (3)

Scrutinizer
Scrutinizer

Reputation: 9926

Try:

awk '/^td class="time">/{gsub(ORS,x); print $2}' RS=\< FS=\> file

or

awk '/^td class="time">/{print $2}' ORS= RS=\< FS='>[[:space:]]*' file

Upvotes: 0

SpaceDog
SpaceDog

Reputation: 3269

How fixed is your format? If you're sure it's going to look like that then you can use sed to match the first line, get the next line and print it, like this:

$  sed -n '/<td *class="time">/{n;p}' test
  13.05.2013  17:51

You could add something to cover the case where it's on the same line as well. Alternatively pre-process the file to strip all the newlines, maybe collapse the whitespace too (can't be done with sed apparently) and then go from there.

However, if it's an HTML file from somewhere else and you can't be sure of the format I'd consider using some other scripting language that has a library to parse XML, otherwise any solution is liable to break when the format changes.

Edited to add a link to my favorite sed resource for this sort of thing:http://www-rohan.sdsu.edu/doc/sed.html

Upvotes: 0

timss
timss

Reputation: 10260

It's not quite there, it prints a leading newline for some reason, but maybe something like this?

$ sed -n '/<td class="time">/,/<\/td>/{s/^<td class="time">$//;/^<\/td>$/d;p}' file 

13.05.2013  17:51

Inspired by https://stackoverflow.com/a/13023643/1076493

Edit: Well, there's always perl!
For more info see https://stackoverflow.com/a/1213996/1076493

$ perl -0777 -ne 'print "$1\n" while /<td class="time">\n  (.*?)\n<\/td>/gs' regex.txt 
13.05.2013  17:51

Upvotes: 2

Related Questions