kaidentity
kaidentity

Reputation: 608

Regex doesn't produce match when containing a new line

I'm trying to parse the page https://extensions.typo3.org/extension/tt_news/ for version numbers and corresponding dates with sed or grep. More specifically, I'm interested in the following html section:

            <tr>
                <td class="align-middle">
                    <strong>3.6.0</strong> /
                    <span class="ter-ext-state-beta">beta</span>
                    <br />
                    <small>
                        April 06, 2014
                    </small>
                </td>
                <td class="align-middle">
                    tt_news for TYPO3 4.5 - 6.2 (compatibility update)
                </td>
                <td class="align-middle">

                        <strong>4.5.0 - 6.2.99</strong>

                </td>
                <td class="align-middle">

                            <a class="btn btn-primary" title="Size: 2.58MB" href="/extension/download/tt_news/3.6.0/zip/">
                                <strong>
                                    Download ZIP Archive
                                </strong>
                            </a>

                </td>
            </tr>

I would like to get from each of these sections the version (between the strong tag) and the date (between the small tag). All my attempts have failed so far and I can narrow down the problem to something very easy. I have tested the following regex which only tries to get an tr tag followed by whitespaces and a td tag on regex101.com and there, it works perfectly fine:

<tr>\s*<td

It gives me 5 matches which is correct. The following one also works fine:

 <tr[^>]*>\s*<td

It produces 38 results because it includes those tr tags with a css class attribute. However, neither with grep nor with sed I can get this to work. As soon as I include the \s there aren't any matches anymore. Here is what it looks like:

cat tt_news_history | grep '<tr>\s*<td'

no hits.

cat tt_news_history | grep '<tr>'

6 hits.

cat tt_news_history | grep '<tr[^>]*>'

lots of hits (didn't count). Same thing with sed. What am I doing wrong? Why can't I use a \s? Thanks for any hint.

Upvotes: 0

Views: 65

Answers (1)

Dzienny
Dzienny

Reputation: 3417

There is a -z option for the GNU grep that makes \s match newlines in the input, eg:

cat tt_news_history | grep -z '<tr>\s*<td'

The relevant fragments from the info documentation:

‘-z’ ‘--null-data’ Treat input and output data as sequences of lines, each terminated by a zero byte (the ASCII NUL character) instead of a newline. Like the ‘-Z’ or ‘--null’ option, this option can be used with commands like ‘sort -z’ to process arbitrary file names.

(...)

  1. How can I match across lines?

Standard grep cannot do this, as it is fundamentally line-based. Therefore, merely using the ‘[:space:]’ character class does not match newlines in the way you might expect.

With the GNU ‘grep’ option ‘-z’ (‘--null-data’), each input “line” is terminated by a null byte; *note Other Options::. Thus, you can match newlines in the input, but typically if there is a match the entire input is output, so this usage is often combined with output-suppressing options like ‘-q’, e.g.:

printf 'foo\nbar\n' | grep -z -q 'foo[[:space:]]+bar'

If this does not suffice, you can transform the input before giving it to ‘grep’, or turn to ‘awk’, ‘sed’, ‘perl’, or many other utilities that are designed to operate across lines.

Upvotes: 2

Related Questions