Reputation: 423
I want to write a bash script that finds a pattern in a html-file which is going over multiple lines.
File for regex:
<td class="content">
some content
</td>
<td class="time">
13.05.2013 17:51
</td>
<td class="author">
A Name
</td>
Now I want to find the content of <td>
-tag with the class="time"
.
So in principle the following regex:
<td class="time">(\d{2}\.\d{2}\.\d{4}\s+\d{2}:\d{2})</td>
grep
seems not to be the command I can use, because...
-o
and not only the result inside the round brackets (...)
.So how is it possible that I will get only a string with 13.05.2013 17:51
?
Upvotes: 3
Views: 5843
Reputation: 9926
Try:
awk '/^td class="time">/{gsub(ORS,x); print $2}' RS=\< FS=\> file
or
awk '/^td class="time">/{print $2}' ORS= RS=\< FS='>[[:space:]]*' file
Upvotes: 0
Reputation: 3269
How fixed is your format? If you're sure it's going to look like that then you can use sed
to match the first line, get the next line and print it, like this:
$ sed -n '/<td *class="time">/{n;p}' test
13.05.2013 17:51
You could add something to cover the case where it's on the same line as well. Alternatively pre-process the file to strip all the newlines, maybe collapse the whitespace too (can't be done with sed
apparently) and then go from there.
However, if it's an HTML file from somewhere else and you can't be sure of the format I'd consider using some other scripting language that has a library to parse XML, otherwise any solution is liable to break when the format changes.
Edited to add a link to my favorite sed resource for this sort of thing:http://www-rohan.sdsu.edu/doc/sed.html
Upvotes: 0
Reputation: 10260
It's not quite there, it prints a leading newline for some reason, but maybe something like this?
$ sed -n '/<td class="time">/,/<\/td>/{s/^<td class="time">$//;/^<\/td>$/d;p}' file
13.05.2013 17:51
Inspired by https://stackoverflow.com/a/13023643/1076493
Edit: Well, there's always perl!
For more info see https://stackoverflow.com/a/1213996/1076493
$ perl -0777 -ne 'print "$1\n" while /<td class="time">\n (.*?)\n<\/td>/gs' regex.txt
13.05.2013 17:51
Upvotes: 2