Reputation: 21
I have a small problem to extract the words which are in bold:
Médoc, Rouge
2ème Vin, Margaux, Rosé
2ème vin, Pessac-Léognan, Blanc
I have to clarify more my question : I'm trying to extract some information from web pages, so each time i found a kind of sentence but me i'm interesting in which is in bold. I give you the adress of the tree wab pages :
(http://www.nicolas.com/page.php/fr/18_409_9829_tourprignacgrandereserve.htm)
(http://www.nicolas.com/page.php/fr/18_409_8236_relaisdedurfortvivens.htm)
re(r'\s*\w+-\w+-\w+|\w+-\w+|\w+[^Rouge,Blanc,Rosé]')
Any ideas?
Upvotes: 0
Views: 100
Reputation: 71568
Seems like it's always the second to last term in the comma separated list? You can split and select the second to last, example:
>>> myStr = '2ème vin, Pessac-Léognan, Blanc'
>>> res = myStr.split(', ')[-2]
Otherwise, if you want regex alone... I'll suggest this:
>>> res = re.search(r'([^,]+),[^,]+$', myStr).group(1)
And trim if necessary for spaces.
Upvotes: 1
Reputation: 474003
You can use positive look ahead to see if Rouge
or Blanc
or Rosé
is after the word we are looking for:
>>> import re
>>> l = [u"Médoc, Rouge", u"2ème Vin, Margaux, Rosé", u"2ème vin, Pessac-Léognan, Blanc"]
>>> for s in l:
... print re.search(ur'([\w-]+)(?=\W+(Rouge|Blanc|Rosé))', s, re.UNICODE).group(0)
...
Médoc
Margaux
Pessac-Léognan
Upvotes: 2