retr0327
retr0327

Reputation: 187

How to extract the first "src" attribute from a HTML tag

Let's say I got an HTML tag below:

target = <tr src="./sound/6/4-1-1.mp3"><td class="code">(4-1)a.</td><td class="sound"><audio controls=""><source src="./sound/6/4-1-1.mp3" type="audio/mpeg"/></audio></td><td class="text"><p class="ab">Na mapaspas a Subalis bunuaz busul tu laas.</p><p class="en">Subali is going to hit the plum.</p></td></tr>

My ideal output:

<tr src="./sound/6/4-1-1.mp3">

I've tried by using the following code:

import re
from bs4 import BeautifulSoup

soup = BeautifulSoup(target, 'lxml')
soup.find(src=re.compile('\.\w'))

However, my output:

<source src="./sound/6/4-1-1.mp3" type="audio/mpeg"/>

How can I get the ideal output as mentioned above?

Thanks for any help!!

Upvotes: 1

Views: 219

Answers (1)

I&#39;mahdi
I&#39;mahdi

Reputation: 24049

You can first find tr then with regex and '<tr.*>' find what you want like below.

Try this:

from bs4 import BeautifulSoup
import re

html="""
<tr src="./sound/6/4-1-1.mp3">
    <td class="code">(4-1)a.</td>
    <td class="sound"><audio controls="">
        <source src="./sound/6/4-1-1.mp3" type="audio/mpeg"/></audio>
    </td>
    <td class="text">
        <p class="ab">Na mapaspas a Subalis bunuaz busul tu laas.</p>
        <p class="en">Subali is going to hit the plum.</p>
    </td>
</tr>
"""
soup=BeautifulSoup(html,"lxml")
re.search(r'<tr.*>',str(soup.find("tr"))).group()

Output:

'<tr src="./sound/6/4-1-1.mp3">'

Upvotes: 1

Related Questions