Chace Mcguyer
Chace Mcguyer

Reputation: 415

How can I remove <br> and </br> tags from items in a list with Python?

The Html I'm scraping from:

<tr>
    <td align="left" bgcolor="#ffff99">
        <font size="2">
            <a href="some/link.htm">
                <b>SomeStuff</b>
            </a>
        </font>
    </td>
</tr>
</tr>
    <td align="left" bgcolor="#ffff99">
        <font size="2">
            <a href="some/link2.htm">
                <b>SomeMoreStuff</b>
            </a>
        </font>
    </td>
</tr>

How I'm scraping the information:

my_list = []
for i in soup.find_all('a',href=re.compile('some/link')):
    my_list.append(str(i.find('b')))
    my_list.append(i['href'])

I need to remove the HTML tags from elements in a list.
However, when I create the loop it doesn't save any changes in the list. My list looks something like this:

my_list = ['<br>SomeStuff</br>','some/link.htm',
           '<br>SomeMoreStuf</br>', 'some/link2.htm',
           '<br>EvenMoreStuff</br>', 'some/link3.htm']

I've tried this:

for i in my_list:
    i = i.replace('<br>','')
    i = i.replace('</br>','')

And I've tried this:

for i in my_list:
    if '<br>' in i:
        i = i.replace('<br>','')
    if '</br> in i:    
        i = i.replace('</br>','')

None of this is making any change in the original list. I can print out the corrections I want by not storing the changes in anything:

for i in my_list:
    i.replace('<br>','')

However I need the change to be saved in the list.

Upvotes: 1

Views: 9595

Answers (3)

Chace Mcguyer
Chace Mcguyer

Reputation: 415

So I ended up solving the problem by writing the two elements into an excel file and then using 'find and replace' in excel!

Upvotes: 0

Wenlong Liu
Wenlong Liu

Reputation: 444

If all the string only have tags in the beginning and end of the string, you can slice the string to remove them. Try the codes below:

for lst in my_list:
    if '<br>' in lst:
        my_list.append(lst[4:-5])
        my_list.remove(lst)

Edits:

There is a more pythonic way to do it from @Vallentin's answer:

for i, lst in enumerate(my_list):
      if '<br>' in lst:
          my_list[i] = lst[4:-5]

Edits:

Actually you don't need to convert your result into string from the beginning. For this codes:

str(i.find('b'))

Please try

either

i.get_text()

or

i.b.get_text()

I think one of them should directly give you the content of your data. So you do not need to remove the tags after this.

Hope it helps.

Upvotes: 0

vallentin
vallentin

Reputation: 26207

All of the solutions work, you're just forgetting to update the list.

for i, element in enumerate(my_list):
    element = element.replace('<br>','')
    element = element.replace('</br>','')
    my_list[i] = element

Now printing my_list outputs:

['SomeStuff', 'some/link.htm', 'SomeMoreStuf', 'some/link2.htm', 'EvenMoreStuff', 'some/link3.htm']

You can also use a list comprehension, which will yield the same result:

my_list = [i.replace('<br>', '').replace('</br>', '') for i in my_list]

Upvotes: 2

Related Questions