linux_beginner
linux_beginner

Reputation: 111

sed'ing out HTML section

I do have long HTML table output which consists of dozens of records. Example looks like this:

<tr onclick="window.location='/team/90687/';" style="cursor: pointer;" class="">
  <td class="number">163124</td>
  <td class="img">3</td>
  <td class="user">
    <span class="name">Mosse John</span>
  </td>
  <td class="number">3332</td>
  <td class="number">497</td>
  <td class="number">20</td>
</tr>
<tr onclick="window.location='/team/342465/';" style="cursor: pointer;" class="">
  <td class="number">163124</td>
  <td class="img">2</td>
  <td class="user">
    <span class="name">Sus Peter</span>
  </td>
  <td class="number">3332</td>
  <td class="number">450</td>
  <td class="number">20</td>
</tr>

Now I want to extract section which contains user belonging to 90687, so I type:

sed my_html_file -e '/window.location.*90687/,/window.location/ !d'

Unfortunately it also fetches first line of next session which I would like to avoid. I did go trough 101 sed and awk tricks, but only solution I found is

sed my_html_file -e '/window.location.*90687/,+9 !d'

which would mean that I am interested in fetching 9 lines after pattern. The problem is that I cannot rely on "9" or any other number. Is there any way to solve it by sed ? BTW I am strongly interested in sed.

Upvotes: 1

Views: 56

Answers (2)

Naoric
Naoric

Reputation: 355

If you are not sure if the closing </tr> might be inlined with the following record, you can try this

sed -n -E '/window\.location.*90687/,/<\/tr>/ {
/<\/tr>/! { p }
/<\/tr>/ { s/(.*)<\/tr>.*$/\1<\/tr>/ p } }
' input.txt

Though there are probably more elegant solutions, this will handle also things like this:

<tr onclick="window.location='/team/90687/';" style="cursor: pointer;" class="">
  <td class="number">163124</td>
  <td class="img">3</td>
  <td class="user">
    <span class="name">Mosse John</span>
  </td>
  <td class="number">3332</td>
  <td class="number">497</td>

  <!-- Confusing Row -->
  <td class="number">20</td></tr> <tr onclick="window.location='/team/342465/';" style="cursor: pointer;" class="">

  <td class="number">163124</td>
  <td class="img">2</td>
  <td class="user">
    <span class="name">Sus Peter</span>
  </td>
  <td class="number">3332</td>
  <td class="number">450</td>
  <td class="number">20</td>
</tr>

Upvotes: 1

Andriy Makukha
Andriy Makukha

Reputation: 8314

Simple solution for your data:

sed my_html_file -e '/window.location.*90687/,/<\/tr>/ !d'

This will print all the lines until the closing tag </tr> is met.

More complex solution:

sed my_html_file -n -e '/window.location.*90687/,/window.location/ { H;x; /window.location.*window.location/ !{ x;p }} '

This will print all the lines until second window.location is met.

Upvotes: 1

Related Questions