How to get the content of a tag with a Beautiful Soup?

Question

I'm trying to extract questions from various AMC tests. Consider https://artofproblemsolving.com/wiki/index.php/2002_AMC_10B_Problems/Problem_1 for example. To get the problem text, I just need the regular string text in the first

element and the latex in the in the first

element.

My code so far:

res = requests.get('https://artofproblemsolving.com/wiki/index.php/2016_AMC_10B_Problems/Problem_1')
soup = bs4.BeautifulSoup(res.text, 'html.parser')
latex_equation = soup.select('p img')[0].get('alt')

It works when I get the latex equation, but there is more parts of the question before in double quotes. Is there a way to get the other part of the question which is "What is the value of". I'm thinking of using a regex but I want to see if Beautiful Soup has a feature that can get it for me.

John Henry 5 · Accepted Answer

Well BS4 seems to be a bit buggy. Took me a while to get this. Don't think that it is viable with these weird spacings and everything. A RegEx would be your best option. Let me know if this is good. Checked on the first 2 questions and they worked fine. The AMC does have some image problems with geometry, however, so I don't think it will work for those.

import bs4
import requests
import re

res = requests.get('https://artofproblemsolving.com/wiki/index.php/2016_AMC_10B_Problems/Problem_1')
soup = bs4.BeautifulSoup(res.content, 'html.parser').find('p')
elements = [i for i in soup.prettify().split("
") if i][1:-2]
latex_reg = re.compile(r'alt="(.*?)"')
for n, i in enumerate(elements):
    mo = latex_reg.search(i)
    if mo:
        elements[n] = mo.group(1)
    elements[n] = re.sub(' +', ' ', elements[n]).lstrip()
    if elements[n][0] == "$":
        elements[n] = " "+elements[n]+" "

print(elements)
print("".join(elements))

How to get the content of a tag with a Beautiful Soup?

Answers (2)

Related Questions