measure_theory
measure_theory

Reputation: 874

BeautifulSoup: Scraping answers from form

I need to scrape the answers to the questions from the following link, including the check boxes.

Here's what I have so far:

from bs4 import BeautifulSoup
import selenium.webdriver as webdriver

url = 'https://www.adviserinfo.sec.gov/IAPD/content/viewform/adv/Sections/iapd_AdvPrivateFundReportingSection.aspx?ORG_PK=161227&FLNG_PK=05C43A1A0008018C026407B10062D49D056C8CC0'

driver = webdriver.Firefox()
driver.get(url)

soup = BeautifulSoup(driver.page_source)

The following gives me all the written answers, if there are any:

soup.find_all('span', {'class':'PrintHistRed'})

and I think I can piece together all the checkbox answers from this:

soup.find_all('img')

but these aren't going to be ordered correctly, because this doesn't pick up the "No Information Filed" answers that aren't written in red.

I also feel like there's a much better way to be doing this. Ideally I want (for the first 6 questions) to return:

['APEX INVESTMENT FUND, V, L.P',
 '805-2054766781',
 'Delaware',
 'United States',
 'APEX MANAGEMENT V, LLC',
 'X',
 'O',
 'No Information Filed',
 'NO',
 'NO']

EDIT

Martin's answer below seems to do the trick, however when I put it in a loop, the results begin to change after the 3rd iteration. Any ideas how to fix this?

from bs4 import BeautifulSoup
import requests
import re

for x in range(5):
    url = 'https://www.adviserinfo.sec.gov/IAPD/content/viewform/adv/Sections/iapd_AdvPrivateFundReportingSection.aspx?ORG_PK=161227&FLNG_PK=05C43A1A0008018C026407B10062D49D056C8CC0'
    html = requests.get(url)
    soup = BeautifulSoup(html.text, "lxml")

    tags = list(soup.find_all('span', {'class':'PrintHistRed'}))
    tags.extend(list(soup.find_all('img', alt=re.compile('Radio|Checkbox')))[2:])       # 2: skip "are you an adviser" at the top
    tags.extend([t.parent for t in soup.find_all(text="No Information Filed")])

    output = []

    for entry in sorted(tags):
        if entry.name == 'img':
            alt = entry['alt']
            if 'Radio' in alt:
                output.append('NO' if 'not selected' in alt else 'YES')
            else:
                output.append('O' if 'not checked' in alt else 'X')
        else:
            output.append(entry.text)

    print output[:9] 

Upvotes: 1

Views: 241

Answers (2)

Martin Evans
Martin Evans

Reputation: 46759

The website does not generate any of the required HTML via Javascript, so I have chosen to use just requests to get the HTML (which should be faster).

One approach to solving your problem is to store all the tags for your three different types into a single array. If this is then sorted, it will result in the tags being in tree order.

The first search simply uses your PrintHistRed to get the matching span tags. Secondly it finds all img tags that have alt text containing either the word Radio or Checkbox. Lastly it searches for all locations where No Information Filed is found and returns the parent tag.

The tags can now be sorted and a suitable output array built containing the information in the required format:

from bs4 import BeautifulSoup
import requests
import re

url = 'https://www.adviserinfo.sec.gov/IAPD/content/viewform/adv/Sections/iapd_AdvPrivateFundReportingSection.aspx?ORG_PK=161227&FLNG_PK=05C43A1A0008018C026407B10062D49D056C8CC0'
html = requests.get(url)
soup = BeautifulSoup(html.text, "lxml")

tags = list(soup.find_all('span', {'class':'PrintHistRed'}))
tags.extend(list(soup.find_all('img', alt=re.compile('Radio|Checkbox')))[2:])       # 2: skip "are you an adviser" at the top
tags.extend([t.parent for t in soup.find_all(text="No Information Filed")])

output = []

for entry in sorted(tags):
    if entry.name == 'img':
        alt = entry['alt']
        if 'Radio' in alt:
            output.append('NO' if 'not selected' in alt else 'YES')
        else:
            output.append('O' if 'not checked' in alt else 'X')
    else:
        output.append(entry.text)

print output[:9]        # Display the first 9 entries

Giving you:

[u'APEX INVESTMENT FUND V, L.P.', u'805-2054766781', u'Delaware', u'United States', 'X', 'O', u'No Information Filed', 'NO', 'YES']

Upvotes: 1

Bill Bell
Bill Bell

Reputation: 21643

I've looked fairly carefully at the HTML. I doubt there is an utterly simple way of scraping pages like this.

I would begin with an analysis, looking for similar questions. For instance, 11 through 16 inclusive can likely be handled in the same way. 19 and 21 appear to be similar. There may or may not be others.

I would work out how to handle each type of similar question as given by the rows containing them. For example, how would I handle 19 and 21? Then I would write code to identify the rows for the questions noting the question number for each. Finally I would use the appropriate code using the row number to winkle out information from it. In other words, when I encountered question 19 I'd use the code meant for either 19 or 21.

Upvotes: 0

Related Questions