PerseP
PerseP

Reputation: 1197

Extract text with sed

I have this text file (it's really a part of an html):

<tr>
              <td width="10%" valign="top"><P>Name:</P></td>
              <td colspan="2"><P>
                XXXXX
              </P></td>
            </tr>
            <tr>
              <td width="10%" valign="top"><p>City:</p></td>
              <td colspan="2"><p>
                Mycity
              </p></td>
            </tr>
            <tr>
              <td width="10%" valign="top"><p>County:</p></td>
              <td colspan="2"><p>
                YYYYYY
              </p></td>
            </tr>
            <tr>
              <td width="10%" valign="top"><p>Map:</p></td>
              <td colspan="2"><p>
                ZZZZZZZZ

I've used this sed command to extract "Mycity"

$ tr -d '\n' < file.html | sed -n 's/.*City:<\/p><\/td>.*<p>\(.*\)<\/p><\/td>.*/\1/p'

The regular expression as far as I know works but I get

Map:

Instead of Mycity.

I've tested the REGEX with Rubular and works but not with sed. Is sed not the right tool? What I¡m I doing wrong?

PS: I'm using Linux

Upvotes: 0

Views: 255

Answers (2)

Ed Morton
Ed Morton

Reputation: 203129

sed is always the wrong tool for anything that involves processing multiple lines. Just use awk, it's what it was invented to do:

$ awk 'c&&!--c; /City:/{c=2}' file.html
                Mycity

See Printing with sed or awk a line following a matching pattern

Upvotes: 2

jcuenod
jcuenod

Reputation: 58375

The problem that you have right now is that regex is greedy by default

's/.*City:<\/p><\/td>.*<p>\(.*\)<\/p><\/td>.*/\1/p'
                     ^ // here!

So it's matching everything up to the last section. To be non-greedy use a ?

's/.*City:<\/p><\/td>.*?<p>\(.*\)<\/p><\/td>.*/\1/p'
                       ^

Upvotes: 2

Related Questions