Time1
Time1

Reputation: 13

Unwanted characters in regular expressions python

So, I have a site that has an XML string, and I'd like my program to return a list of strings that appear between two strings. Here's my code:

 response = requests.get(url)


 artists=re.findall(re.escape('<name>')+'(.*?)'+re.escape('</name>'),str(response.content))
 print(artists)

This returns a list of strings. The problem is, some strings have unwanted characters in them. For example, one of the strings in the list is "Somethin\\' \\'Bout A Truck" and I'd like it to be 'Somethin' 'Bout A Truck'.

Thanks in advance.

Upvotes: 1

Views: 81

Answers (2)

Alex Martelli
Alex Martelli

Reputation: 881595

Those escapes (single backslashes, each displayed as \\) may be "unwanted" from your viewpoint but they're no doubt "present" in the response you received. So if characters are present but unwanted, you can remove them, e.g using in lieu of str(response.content)

str(response.content).replace('\\'. '')

if what you actually want to do is remove all such escapes (if you want to do something different than that you'd better explain what it is:-).

BeautifulSoup4 as recommended in the accepted answer, though a nice package indeed, does not wantonly remove characters present in the input -- it can't read your mind, so it can't know what's "unwanted" to you. E.g:

>>> import bs4
>>> s = '<name>Somethin\\\' \\\'Bout A Truck</name>'
>>> soup = bs4.BeautifulSoup(s)
>>> print(soup)
<name>Somethin\' \'Bout A Truck</name>
>>> 

As you see, the escapes (backslashes) are still there before the single-quotes.

Upvotes: 1

P_O_I_S_O_N
P_O_I_S_O_N

Reputation: 357

I think the beautiful soup(bs4) will solve this problem and it will also support for higher version of python 3.4

Upvotes: 1

Related Questions