Reputation: 1917
I am trying to make a kind of data miner with python. What I am about to examine is a dictionary of the Greek language. The said dictionary was originally in PDF format, and I turned it into a rougly corresponding HTML format to parse it more easily. I have done some further formating on it, since the data structure was heavily distorted.
My current task is to find and seperately store the individual words, along with their descriptions. So the first thought that came to mind about that, was to identify the words first, apart from their descriptions. The headers of the word's space has a very specific syntax, and I use that to create a corresponding regular expression to match each and every one of them.
There is one problem though. Despite the formatting I have done to HTML so far, there are still many points where a series of logical data is interrupted by the sequence < /br> followed by a newline, with random order. Is there any way to direct my regular expression to "ignore" that sequence, that is to treat that certain sequence as non-existent, when met, and therefore including those matches which are interrupted by it?
That is, without putting a (< br/>\n)? in every part of my RE, to cover every possible case.
The regular expression I use is the following:
(ο|η|το)?( )?<b>([α-ωάέήίόύώϊϋΐΰ])*</b>(, ((ο|η|το)? <b>([α-ωάέήίόύώϊϋΐΰ])*</b>))*( \(.*\))? ([Α-Ω])*\.( \(.*\))?<b>:</b>
and does a fine job with the matching, when the data is not interrupted by the sequence given above.
The problem, in case not understood, lies in that the interrupting sequence can occur anywhere within the match, therefore I am looking for a way other than covering every single spot where the sequence might occur (ignoring the sequence in deciding whether to return a match or not), as I explained earlier.
Upvotes: 2
Views: 552
Reputation: 98559
What you're asking for is a different regular expression.
The new regular expression would be the old one, with (<br\s*?/>\n?)?
or the like after every non-quantifier character.
You could write something to transmute a regular expression into the form you're looking for. It would take in your existing regex and produce a br-tolerant regex. No construct in the regular expression grammar exists to do this for you automatically.
I think the easier thing to do is to permute the source document to not contain the sequences you wish to ignore. This should be an easy text substitution.
If it weren't for your explicit use of the <b>
tags for meaning, an alternative would be to just take the plain-text document content instead of the HTML content.
Upvotes: 1