newang
newang

Reputation: 189

Python Regex - Retrieving from XML File

I'm currently having issues with regular expressions. I'm trying to extract the name of an item from an XML file: https://www.crimezappers.com/rss/catalog/category/cid/97/store_id/1/. I have found a method, however, it is very clunky, I was wondering if there was a way to make the expression smaller?

This is what I currently have (long way):

<item>\n<title>\n<!\[CDATA\[ ([A-Za-z].[^\]]+)|<item>\n<title>\n<!\[CDATA\[\n([A-Za-z].[^\]]+)

This is my attempt at doing it:

<item>\n<title>\n<!\[CDATA\[|(?\n)| |([A-Za-z].[^\]]+)

Image of what should be found, the blue underline is what should be also found

Upvotes: 0

Views: 1299

Answers (1)

falsetru
falsetru

Reputation: 369064

Using regular expression to parse xml is not a good idea.

Use xml processing library like lxml:

>>> import requests
>>> import lxml.etree
>>> 
>>> r = requests.get('https://www.crimezappers.com/rss/...')
>>> root = lxml.etree.fromstring(r.content)
>>> root.xpath('//item/title/text()')
['Electrical Box HD Hidden Camera with Built in DVR',
 'Mini Clip On Smiley Face Button Spy Hidden Camera with Built in DVR',
 ...]

UPDATE Using regular expression.

You can use \s to match any space characters (including newline character \n):

>>> re.findall(r'<item>\s*<title>\s*<!\[CDATA\[\s*(.*?)\s*\]\]>', r.content)
['Electrical Box HD Hidden Camera with Built in DVR',
 'Mini Clip On Smiley Face Button Spy Hidden Camera with Built in DVR',
 ...]
  • Replaced [A-Za-z].[^\]]+ with (.*?)\]\]> to match everything between <![CDATA and ]]>, non-greedily (?).

Upvotes: 2

Related Questions