user311509
user311509

Reputation: 2866

Simple Regular Expression on HTML Tags

Problem One:

</a>              

19-10-2011, 04:49 PM

             </td> <td class="thread" 

How to fetch the DATE and TIME i.e. 19-10-2011, 04:49 PM

Note: the above snippet could have unstable spacing as you see above e.g. </td> <td class

My attempt:

preg_match("#</a>(.*?)</td> <td class=\"thread\"#", $page, $fetchContent);

Result: empty


Problem Two:

<div id="post_message_43345">ANY TYPE OF CONTENT INCLUDING SPACES</tr> <tr>

I need to fetch "ANY TYPE OF CONTENT".

Note: the spacing between tags such as </tr> <tr> could vary from page to another.

My attempt:

preg_match("#<div id=\"post_message_[a-zA-Z0-9_]*\">(.*?)</tr> <tr>#", $page, $fetchedContent);

Result: empty

I'm looking for rough temporary short snippet for one task. Therefore, i didn't use HTML parser.

Any help will be appreciated.

Upvotes: 0

Views: 116

Answers (2)

mathematical.coffee
mathematical.coffee

Reputation: 56905

Problem 1

You need to use the s flag to have . match newline characters too:

preg_match("#</a>(.*?)</td> <td class=\"thread\"#s", $page, $fetchContent);

You'd probably be better off matching the date directly though:

preg_match("#([0123]?[0-9]-(?:0?[1-9]|1[012])-(?:[0-9]{4})),? ?((?:0[0-9]|1[012]):[0-5][0-9] ?[AP]M)#",...)

edit - this date regex will be a little faster (added boundaries either side):

preg_match("#\\b([0123]?[0-9]-(?:0?[1-9]|1[012])-(?:[0-9]{4}))[, ]{1,3}((?:0[0-9]|1[012]):[0-5][0-9] ?[AP]M)\\b#",...)

For both, the date is in $results[1] and the time is in $results[2].

Problem 2

Again the s flag, and to have varying spaces between the </tr> <tr> use *.

preg_match("#<div id=\"post_message_[a-zA-Z0-9_]*\">(.*?)</tr> *<tr>#s", $page, $fetchedContent);

If you want to allow for newlines between the </tr> and <tr> then do \s* instead. Same for Problem 1.

Upvotes: 1

mario
mario

Reputation: 145482

Note: the above snippet could have unstable spacing as you see above

You want it to match newlines also. The . doesn't do that normally. This would require the #s modifier basically:

  preg_match('#</a>(.*?)</td> <td class="thread"#s', ...

But you could also just add \s* twice around your (.*?) capture group. Also between the </td> and <td.

And then you could make your regex more specific \d\d-\d\d-\d\d, \d\d:\d\d to only capture the date. That might make matching the tags somewhat redundant.

Note: the spacing between tags such as could vary from page to another.

You can again just use \s* which matches spaces and newlines in any combination.

Upvotes: 1

Related Questions