regex match part of an optional subgroup

Question

I'm trying to grab multiple events from a website. The events have a regular format of

... EVENT TITLE & LINK ... START DATE ... END DATE ... LOCATION ...

where "..." are some html tags with style info and newlines. I want to extract LINK, START DATE, END DATE and LOCATION from these event strings. Since the format of the surrounding html code "..." is totally regular, grabbing the four pieces of information is easy enough: I match the surrounding tags and extract the part that I want, e.g.:

'|...(.{10}).*?...|s'

where "(.{10})" is the START DATE.

The problem is the LOCATION, because some events are listed with a location, others are listed without, so that in some events the span tag LOCATION is present, in others it is simply missing.

So my question is:

How can I match LOCATION?

If I try

preg_match_all('|...(.+?)...|s', $contents, $matches, PREG_SET_ORDER);

on an event without location, it does not match that event (but I get a LOCATION for the events that have one). If, on the other hand, I try

preg_match_all('|...(?:(.+?))...|s', $contents, $matches, PREG_SET_ORDER);

on any event, that code matches all events, but the LOCATION — even if it is present – is not part of my $matches.

So how can I match an irregular part of a regular but optional substring?

Thank you!

Edit (as answer to a question by zigdon):

The problem is that the LOCATION has to be matched to the other event data. Imagine this to be what I want as a result: "Congress of Society of Regex (Link to Website), April 7th to April 10th, Berlin" and "Online Tutorial (Link to Website, May 9th". The second event has no location, but the location of the first event has to be matched to the title, link, and date. Here is a link to the page that I want to grab the events from, you can look at the source code to understand the problem: https://www.fs-psycho.uni-tuebingen.de/events/previous -- at the moment I grab the events with

preg_match_all('|\s*?\s*?(\s*?(?:von )?(.{10}).{0,6}.{5,100}(.{0,10}).{5,6}\s*?(?:— (.*?),)?\s*?\s*?
|', $contents, $matches, PREG_SET_ORDER);

This works, but I am unhappy with it, because, as mentioned in the answers, with "wild code" (from a site not my own) anything could happen between the tags. I would prefer a solution that matches only the immediate surrounding of the event parts and leaves whatever is inbetween very open, i.e. ".*?|s".

regex match part of an optional subgroup

Answers (1)

Related Questions