Reputation: 1197
I have this text file (it's really a part of an html):
<tr>
<td width="10%" valign="top"><P>Name:</P></td>
<td colspan="2"><P>
XXXXX
</P></td>
</tr>
<tr>
<td width="10%" valign="top"><p>City:</p></td>
<td colspan="2"><p>
Mycity
</p></td>
</tr>
<tr>
<td width="10%" valign="top"><p>County:</p></td>
<td colspan="2"><p>
YYYYYY
</p></td>
</tr>
<tr>
<td width="10%" valign="top"><p>Map:</p></td>
<td colspan="2"><p>
ZZZZZZZZ
I've used this sed command to extract "Mycity"
$ tr -d '\n' < file.html | sed -n 's/.*City:<\/p><\/td>.*<p>\(.*\)<\/p><\/td>.*/\1/p'
The regular expression as far as I know works but I get
Map:
Instead of Mycity
.
I've tested the REGEX with Rubular and works but not with sed. Is sed not the right tool? What I¡m I doing wrong?
PS: I'm using Linux
Upvotes: 0
Views: 255
Reputation: 203129
sed is always the wrong tool for anything that involves processing multiple lines. Just use awk, it's what it was invented to do:
$ awk 'c&&!--c; /City:/{c=2}' file.html
Mycity
See Printing with sed or awk a line following a matching pattern
Upvotes: 2
Reputation: 58375
The problem that you have right now is that regex is greedy by default
's/.*City:<\/p><\/td>.*<p>\(.*\)<\/p><\/td>.*/\1/p'
^ // here!
So it's matching everything up to the last section. To be non-greedy use a ?
's/.*City:<\/p><\/td>.*?<p>\(.*\)<\/p><\/td>.*/\1/p'
^
Upvotes: 2