Reputation: 189
I'm currently having issues with regular expressions. I'm trying to extract the name of an item from an XML file: https://www.crimezappers.com/rss/catalog/category/cid/97/store_id/1/. I have found a method, however, it is very clunky, I was wondering if there was a way to make the expression smaller?
This is what I currently have (long way):
<item>\n<title>\n<!\[CDATA\[ ([A-Za-z].[^\]]+)|<item>\n<title>\n<!\[CDATA\[\n([A-Za-z].[^\]]+)
This is my attempt at doing it:
<item>\n<title>\n<!\[CDATA\[|(?\n)| |([A-Za-z].[^\]]+)
Upvotes: 0
Views: 1299
Reputation: 369064
Using regular expression to parse xml is not a good idea.
Use xml processing library like lxml
:
>>> import requests
>>> import lxml.etree
>>>
>>> r = requests.get('https://www.crimezappers.com/rss/...')
>>> root = lxml.etree.fromstring(r.content)
>>> root.xpath('//item/title/text()')
['Electrical Box HD Hidden Camera with Built in DVR',
'Mini Clip On Smiley Face Button Spy Hidden Camera with Built in DVR',
...]
UPDATE Using regular expression.
You can use \s
to match any space characters (including newline character \n
):
>>> re.findall(r'<item>\s*<title>\s*<!\[CDATA\[\s*(.*?)\s*\]\]>', r.content)
['Electrical Box HD Hidden Camera with Built in DVR',
'Mini Clip On Smiley Face Button Spy Hidden Camera with Built in DVR',
...]
[A-Za-z].[^\]]+
with (.*?)\]\]>
to match everything between <![CDATA
and ]]>
, non-greedily (?
).Upvotes: 2