user1322720
user1322720

Reputation:

regex match part of an optional subgroup

I'm trying to grab multiple events from a website. The events have a regular format of

... EVENT TITLE & LINK ... START DATE ... END DATE ... <span class="location">LOCATION</span> ...

where "..." are some html tags with style info and newlines. I want to extract LINK, START DATE, END DATE and LOCATION from these event strings. Since the format of the surrounding html code "..." is totally regular, grabbing the four pieces of information is easy enough: I match the surrounding tags and extract the part that I want, e.g.:

'|...<abbr class="dtstart">(.{10}).*?</abbr>...|s'

where "(.{10})" is the START DATE.

The problem is the LOCATION, because some events are listed with a location, others are listed without, so that in some events the span tag <span class="location">LOCATION</span> is present, in others it is simply missing.

So my question is:

How can I match LOCATION?

If I try

preg_match_all('|...<span class="location">(.+?)</span>...|s', $contents, $matches, PREG_SET_ORDER);

on an event without location, it does not match that event (but I get a LOCATION for the events that have one). If, on the other hand, I try

preg_match_all('|...(?:<span class="location">(.+?)</span>)...|s', $contents, $matches, PREG_SET_ORDER);

on any event, that code matches all events, but the LOCATION — even if it is present – is not part of my $matches.

So how can I match an irregular part of a regular but optional substring?

Thank you!

Edit (as answer to a question by zigdon):

The problem is that the LOCATION has to be matched to the other event data. Imagine this to be what I want as a result: "Congress of Society of Regex (Link to Website), April 7th to April 10th, Berlin" and "Online Tutorial (Link to Website, May 9th". The second event has no location, but the location of the first event has to be matched to the title, link, and date. Here is a link to the page that I want to grab the events from, you can look at the source code to understand the problem: https://www.fs-psycho.uni-tuebingen.de/events/previous -- at the moment I grab the events with

preg_match_all('|<dt class="vevent">\s*?<span class="summary">\s*?(<a href=".+?</a>)\s*?</span>\s*?<span class="documentByLine">\s*?<span>(?:von )?<abbr class="dtstart" title=".{0,30}">(.{10}).{0,6}</abbr>.{5,100}<abbr class="dtend" title=".+?">(.{0,10}).{5,6}</abbr></span>\s*?(?:<span>— <span class="location">(.*?)</span>,</span>)?\s*?</span>\s*?</dt>|', $contents, $matches, PREG_SET_ORDER);

This works, but I am unhappy with it, because, as mentioned in the answers, with "wild code" (from a site not my own) anything could happen between the tags. I would prefer a solution that matches only the immediate surrounding of the event parts and leaves whatever is inbetween very open, i.e. ".*?|s".

Upvotes: 0

Views: 374

Answers (1)

zigdon
zigdon

Reputation: 15073

Using regular expressions to parse HTML (or any actual markup) is usually a really bad idea. Most languages provide a library that would actually parse HTML and allow you to get the particular elements you want without trying to match tags against regular expressions. Perhaps, as it looks like you might be using PHP, you could look at something like this? http://simplehtmldom.sourceforge.net/

See also RegEx match open tags except XHTML self-contained tags

Upvotes: 1

Related Questions