Python Regex - Retrieving from XML File

Question

I'm currently having issues with regular expressions. I'm trying to extract the name of an item from an XML file: https://www.crimezappers.com/rss/catalog/category/cid/97/store_id/1/. I have found a method, however, it is very clunky, I was wondering if there was a way to make the expression smaller?

This is what I currently have (long way):

\n\n<!$$CDATA\[ ([A-Za-z].[^$$]+)|<item>\n<title>\n<!$$CDATA\[\n([A-Za-z].[^$$]+)
</code></pre>
<p>This is my attempt at doing it:</p>
<pre><code><item>\n<title>\n<!$$CDATA\[|(?\n)| |([A-Za-z].[^$$]+)
</code></pre>
<p><img src="https://i.sstatic.net/xraFZ.jpg" alt="Image of what should be found, the blue underline is what should be also found" /></p>

falsetru · Accepted Answer

Using regular expression to parse xml is not a good idea.

Use xml processing library like lxml:

>>> import requests
>>> import lxml.etree
>>> 
>>> r = requests.get('https://www.crimezappers.com/rss/...')
>>> root = lxml.etree.fromstring(r.content)
>>> root.xpath('//item/title/text()')
['Electrical Box HD Hidden Camera with Built in DVR',
 'Mini Clip On Smiley Face Button Spy Hidden Camera with Built in DVR',
 ...]

UPDATE Using regular expression.

You can use \s to match any space characters (including newline character \n):

>>> re.findall(r'\s*\s*<!$$CDATA\[\s*(.*?)\s*$$\]>', r.content)
['Electrical Box HD Hidden Camera with Built in DVR',
 'Mini Clip On Smiley Face Button Spy Hidden Camera with Built in DVR',
 ...]
</code></pre>

<ul>
<li>Replaced <code>[A-Za-z].[^\]]+</code> with <code>(.*?)\]\]></code> to match everything between <code><![CDATA</code> and <code>]]></code>, non-greedily (<code>?</code>).</li>
</ul>

Python Regex - Retrieving from XML File

Answers (1)

Related Questions