Reputation: 415
The Html I'm scraping from:
<tr>
<td align="left" bgcolor="#ffff99">
<font size="2">
<a href="some/link.htm">
<b>SomeStuff</b>
</a>
</font>
</td>
</tr>
</tr>
<td align="left" bgcolor="#ffff99">
<font size="2">
<a href="some/link2.htm">
<b>SomeMoreStuff</b>
</a>
</font>
</td>
</tr>
How I'm scraping the information:
my_list = []
for i in soup.find_all('a',href=re.compile('some/link')):
my_list.append(str(i.find('b')))
my_list.append(i['href'])
I need to remove the HTML tags from elements in a list.
However, when I create the loop it doesn't save any changes in the list. My list looks something like this:
my_list = ['<br>SomeStuff</br>','some/link.htm',
'<br>SomeMoreStuf</br>', 'some/link2.htm',
'<br>EvenMoreStuff</br>', 'some/link3.htm']
I've tried this:
for i in my_list:
i = i.replace('<br>','')
i = i.replace('</br>','')
And I've tried this:
for i in my_list:
if '<br>' in i:
i = i.replace('<br>','')
if '</br> in i:
i = i.replace('</br>','')
None of this is making any change in the original list. I can print out the corrections I want by not storing the changes in anything:
for i in my_list:
i.replace('<br>','')
However I need the change to be saved in the list.
Upvotes: 1
Views: 9595
Reputation: 415
So I ended up solving the problem by writing the two elements into an excel file and then using 'find and replace' in excel!
Upvotes: 0
Reputation: 444
If all the string only have tags in the beginning and end of the string, you can slice the string to remove them. Try the codes below:
for lst in my_list:
if '<br>' in lst:
my_list.append(lst[4:-5])
my_list.remove(lst)
Edits:
There is a more pythonic way to do it from @Vallentin's answer:
for i, lst in enumerate(my_list):
if '<br>' in lst:
my_list[i] = lst[4:-5]
Edits:
Actually you don't need to convert your result into string from the beginning. For this codes:
str(i.find('b'))
Please try
either
i.get_text()
or
i.b.get_text()
I think one of them should directly give you the content of your data. So you do not need to remove the tags after this.
Hope it helps.
Upvotes: 0
Reputation: 26207
All of the solutions work, you're just forgetting to update the list.
for i, element in enumerate(my_list):
element = element.replace('<br>','')
element = element.replace('</br>','')
my_list[i] = element
Now printing my_list
outputs:
['SomeStuff', 'some/link.htm', 'SomeMoreStuf', 'some/link2.htm', 'EvenMoreStuff', 'some/link3.htm']
You can also use a list comprehension, which will yield the same result:
my_list = [i.replace('<br>', '').replace('</br>', '') for i in my_list]
Upvotes: 2