Reputation: 353
I have some user reviews which was previously scraped from a website and I am trying to clean up the text to do some text analysis. There are several a href tags in the text that I would like to remove. For example, see a portion of text contained in a paragraph:
'We had a <a href="/redir?url=http%3A%2F%2Frestaurants.com&amp;s=8b83bf0ff8b716aae84527dc95577a310f201b166dcca25c8ca3824b15703869" target="_blank" rel="nofollow">restaurants.com</a> $25 gift certificate, so we visited this restaurant.'
I would like to remove this portion from the string:
<a href="/redir?url=http%3A%2F%2Frestaurants.com&amp;s=8b83bf0ff8b716aae84527dc95577a310f201b166dcca25c8ca3824b15703869" target="_blank" rel="nofollow">restaurants.com</a>
I am not an expert on regex, so the best I could do so far is:
import re
re.sub(r'<a href\S+', '', mytext)
But this removes only part of what I want to get rid off as shown below:
print(mytext)
'We had a target="_blank" rel="nofollow">restaurants.com</a> $25 gift certificate, so we visited this restaurant.'
I searched a lot for a solution but could only find one for javascript and several posts that warn against using regex for parsing html, which I guess does not apply to my case as I am processing a string. I guess if I read more about using regex, I can get this done, but I am looking for a quick solution. Really appreciate any help.
Upvotes: 0
Views: 760
Reputation: 590
import re
''.join(re.findall('(<a href)(.+?)(/a>)', st)[0])
That'll work for your example, if you have multiple href links you could use:
[''.join(entry) for entry in re.findall('(<a href)(.+?)(/a>)', st)]
Upvotes: 1
Reputation: 1242
as you are looking for a quick solution. just go for basic and use string manipulation.
input_string = 'We had a <a href="/redir?url=http%3A%2F%2Frestaurants.com&amp;s=8b83bf0ff8b716aae84527dc95577a310f201b166dcca25c8ca3824b15703869" target="_blank" rel="nofollow">restaurants.com</a> $25 gift certificate, so we visited this restaurant.'
input_string = input_string.split('<a href')
first_part = input_string[0]
input_string = input_string[-1].split('</a>')
sencond_part = input_string[-1]
new_string = first_part + sencond_part
print(new_string) # We had a $25 gift certificate, so we visited this restaurant.
Upvotes: 0