Reputation: 287

Python Beautiful Soup find string and extract following string

I am programming a web crawler with the help of beautiful soup.I have the following html code:

<tr class="odd-row">
        <td>xyz</td>
        <td class="numeric">5,00%</td>      
    </tr>
<tr class="even-row">
        <td>abc</td>
        <td class="numeric">50,00%</td                      
    </tr>
<tr class="odd-row">
        <td>ghf</td>
        <td class="numeric">2,50%</td>

My goal is to write the numbers after class="numeric" to a specific variable. I want to do this conditional on the string above the class statement (e.g. "xyz", "abc", ...).

At the moment I am doing the following:

for c in soup.find_all("a", string=re.compile('abc')):
    abc=c.string

But of course it returns the string "abc" and not the number in the tag afterwards. So basically my question is how to adress the string after class="numeric" conditional on the string beforehand.

Thanks for your help!!!

Upvotes: 2

Answers (3)

Padraic Cunningham

Reputation: 180550

Once you find the correct tdwhich I presume is what you meant to have in place of a then get the next sibling with the class you want:

h = """<tr class="odd-row">
        <td>xyz</td>
        <td class="numeric">5,00%</td>
    </tr>
<tr class="even-row">
        <td>abc</td>
        <td class="numeric">50,00%</td
    </tr>
<tr class="odd-row">
        <td>ghf</td>
        <td class="numeric">2,50%</td>"""


from bs4 import BeautifulSoup

soup = BeautifulSoup(h)

for td in soup.find_all("td",text="abc"):
    print(td.find_next_sibling("td",class_="numeric"))

If the numeric td is always next you can just call find_next_sibling():

for td in soup.find_all("td",text="abc"):
    print(td.find_next_sibling())

For your input both would give you:

td class="numeric">50,00%</td>

Upvotes: 4

Janosch Gräf

Reputation: 155

So as I understand your question you want to iterate over the tuples ('xyz', '5,00%'), ('abc', '50,00%'), ('ghf', '2,50%'). Is that correct?

But I don't understand how your code produces any results, since you are searching for <a> tags.

Instead you should iterate over the <tr> tags and then take the strings inside the <td> tags. Notice the double next_sibling for accessing the second <td>, since the first next_sibling would reference the whitespace between the two tags.

html = """
<tr class="odd-row">
    <td>xyz</td>
    <td class="numeric">5,00%</td>      
</tr>
<tr class="even-row">
    <td>abc</td>
    <td class="numeric">50,00%</td                      
</tr>
<tr class="odd-row">
    <td>ghf</td>
    <td class="numeric">2,50%</td>
</tr>
"""

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'html.parser')

for tr in soup.find_all("tr"):
    print((tr.td.string, tr.td.next_sibling.next_sibling.string))

Upvotes: 0

xbb

Reputation: 2163

If I understand your question correctly, and if I assume your html code will always follow your sample structure, you can do this:

result = {}
table_rows = soup.find_all("tr")
for row in table_rows:
    table_columns = row.find_all("td")
    result[table_columns[0].text] = tds[1].text
print result  #### {u'xyz': u'2,50%', u'abc': u'2,50%', u'ghf': u'2,50%'}

You got a dictionary eventually with the key names are 'xyz','abc'..etc and their values are the string in class="numeric"

Upvotes: 0

Python Beautiful Soup find string and extract following string

Answers (3)

Related Questions