user514310
user514310

Reputation:

Extract artist and music From text (regex)

I have written following regex But its not working. Can you please help me? thank you :-)

track_desc = '''<img src="http://images.raaga.com/catalog/cd/A/A0000102.jpg" align="right" border="0" width="100" height="100" vspace="4" hspace="4" />
            <p>
            </p>
            <p> Artist(s) David: <br/>
              Music: Ramana Gogula<br/>
            </p>'''
rx = "<p><\/p><p>Artist\(s\): (.*?)<br\/>Music: (.*?)<br\/><\/p>"
m = re.search(rx, track_desc)

Output Should be:

Artist(s) David
Music: Ramana Gogula

Upvotes: 1

Views: 215

Answers (3)

Acorn
Acorn

Reputation: 50517

import lxml.html as lh
import re

track_desc = '''
<img src="http://images.raaga.com/catalog/cd/A/A0000102.jpg" align="right" border="0" width="100" height="100" vspace="4" hspace="4" />
<p>
</p>
<p> Artist(s) David: <br/>
Music: Ramana Gogula<br/>
</p>
'''

tree = lh.fromstring(track_desc)

print re.findall(r'Artist\(s\) (.+):\s*\nMusic: (.*\w)', tree.text_content())

Upvotes: 1

Bruce
Bruce

Reputation: 7132

I see a few errors:

  • regex is not multiline : should use flags=re.MULTILINE to allow to match on multilines
  • spaces are not taken into account
  • artist(s) is not followed by :

As the web page is rather strangely presented, this might be error prone to rely on a regex and I wouldn't advise to use it extensively.

Note, following seems to work:

rx='Artist(?:\(s\))?\s+(.*?)\<br\/>\s+Music:\s*(.*?)\<br'
print ("Art... : %s && Mus... : %s" % re.search(rx, track_desc,flags=re.MULTILINE).groups())

Upvotes: 0

Regexident
Regexident

Reputation: 29552

You were ignoring the whitespace:

<p>[\s\n\r]*Artist\(s\)[\s\n\r]*(.*?)[\s\n\r]*:[\s\n\r]*<br/>[\s\n\r]*Music:[\s\n\r]*(.*?)<br/>[\s\n\r]*</p>

Output is:

[1] => "David"
[2] => "Ramana Gogula"

(note that your regex didn't match the Artists(s) and Music: prefixes either)


However for production code I would not rely on such rather clumsy regex (and equally clumsily formatted HTML source).

Seriously though, ditch the idea of using regex for this if you aren't the slightest familiar with regex (which it looks like). You're using the wrong tool and a badly formatted data source. Parsing HTML with Regex is wrong in 9 out of 10 cases (see @bgporter's comment link) and doomed to fail. Apart from that HTML is hardly ever an appropriate data source (unless there really really is no alternative source).

Upvotes: 1

Related Questions