James Huang
James Huang

Reputation: 876

How to get the content of a tag with a Beautiful Soup?

I'm trying to extract questions from various AMC tests. Consider https://artofproblemsolving.com/wiki/index.php/2002_AMC_10B_Problems/Problem_1 for example. To get the problem text, I just need the regular string text in the first <p> element and the latex in the <img> in the first <p> element.

My code so far:

res = requests.get('https://artofproblemsolving.com/wiki/index.php/2016_AMC_10B_Problems/Problem_1')
soup = bs4.BeautifulSoup(res.text, 'html.parser')
latex_equation = soup.select('p img')[0].get('alt')

It works when I get the latex equation, but there is more parts of the question before in double quotes. Is there a way to get the other part of the question which is "What is the value of". I'm thinking of using a regex but I want to see if Beautiful Soup has a feature that can get it for me.

Upvotes: 2

Views: 874

Answers (2)

John Henry 5
John Henry 5

Reputation: 169

Well BS4 seems to be a bit buggy. Took me a while to get this. Don't think that it is viable with these weird spacings and everything. A RegEx would be your best option. Let me know if this is good. Checked on the first 2 questions and they worked fine. The AMC does have some image problems with geometry, however, so I don't think it will work for those.

import bs4
import requests
import re

res = requests.get('https://artofproblemsolving.com/wiki/index.php/2016_AMC_10B_Problems/Problem_1')
soup = bs4.BeautifulSoup(res.content, 'html.parser').find('p')
elements = [i for i in soup.prettify().split("\n") if i][1:-2]
latex_reg = re.compile(r'alt="(.*?)"')
for n, i in enumerate(elements):
    mo = latex_reg.search(i)
    if mo:
        elements[n] = mo.group(1)
    elements[n] = re.sub(' +', ' ', elements[n]).lstrip()
    if elements[n][0] == "$":
        elements[n] = " "+elements[n]+" "

print(elements)
print("".join(elements))

Upvotes: 1

MendelG
MendelG

Reputation: 20098

Try using zip():

import requests
from bs4 import BeautifulSoup

URL = "https://artofproblemsolving.com/wiki/index.php/2016_AMC_10B_Problems/Problem_1"
soup = BeautifulSoup(requests.get(URL).content, "html.parser")

for text, tag in zip(soup.select_one(".mw-parser-output p"), soup.select("p img")):
    print(text, tag.get("alt"))
    break

Output:

What is the value of  $\frac{2a^{-1}+\frac{a^{-1}}{2}}{a}$

Edit:

soup = BeautifulSoup(requests.get(URL).content, "html.parser")

for text, tag in zip(soup.select(".mw-parser-output p"), soup.select("p img")):
    print(text.text.strip(), tag.get("alt"))

Upvotes: 3

Related Questions