Reputation: 874
I need to scrape the answers to the questions from the following link, including the check boxes.
Here's what I have so far:
from bs4 import BeautifulSoup
import selenium.webdriver as webdriver
url = 'https://www.adviserinfo.sec.gov/IAPD/content/viewform/adv/Sections/iapd_AdvPrivateFundReportingSection.aspx?ORG_PK=161227&FLNG_PK=05C43A1A0008018C026407B10062D49D056C8CC0'
driver = webdriver.Firefox()
driver.get(url)
soup = BeautifulSoup(driver.page_source)
The following gives me all the written answers, if there are any:
soup.find_all('span', {'class':'PrintHistRed'})
and I think I can piece together all the checkbox answers from this:
soup.find_all('img')
but these aren't going to be ordered correctly, because this doesn't pick up the "No Information Filed" answers that aren't written in red.
I also feel like there's a much better way to be doing this. Ideally I want (for the first 6 questions) to return:
['APEX INVESTMENT FUND, V, L.P',
'805-2054766781',
'Delaware',
'United States',
'APEX MANAGEMENT V, LLC',
'X',
'O',
'No Information Filed',
'NO',
'NO']
EDIT
Martin's answer below seems to do the trick, however when I put it in a loop, the results begin to change after the 3rd iteration. Any ideas how to fix this?
from bs4 import BeautifulSoup
import requests
import re
for x in range(5):
url = 'https://www.adviserinfo.sec.gov/IAPD/content/viewform/adv/Sections/iapd_AdvPrivateFundReportingSection.aspx?ORG_PK=161227&FLNG_PK=05C43A1A0008018C026407B10062D49D056C8CC0'
html = requests.get(url)
soup = BeautifulSoup(html.text, "lxml")
tags = list(soup.find_all('span', {'class':'PrintHistRed'}))
tags.extend(list(soup.find_all('img', alt=re.compile('Radio|Checkbox')))[2:]) # 2: skip "are you an adviser" at the top
tags.extend([t.parent for t in soup.find_all(text="No Information Filed")])
output = []
for entry in sorted(tags):
if entry.name == 'img':
alt = entry['alt']
if 'Radio' in alt:
output.append('NO' if 'not selected' in alt else 'YES')
else:
output.append('O' if 'not checked' in alt else 'X')
else:
output.append(entry.text)
print output[:9]
Upvotes: 1
Views: 241
Reputation: 46759
The website does not generate any of the required HTML via Javascript, so I have chosen to use just requests
to get the HTML (which should be faster).
One approach to solving your problem is to store all the tags for your three different types into a single array. If this is then sorted, it will result in the tags being in tree order.
The first search simply uses your PrintHistRed
to get the matching span tags. Secondly it finds all img
tags that have alt
text containing either the word Radio
or Checkbox
. Lastly it searches for all locations where No Information Filed
is found and returns the parent tag.
The tags can now be sorted and a suitable output
array built containing the information in the required format:
from bs4 import BeautifulSoup
import requests
import re
url = 'https://www.adviserinfo.sec.gov/IAPD/content/viewform/adv/Sections/iapd_AdvPrivateFundReportingSection.aspx?ORG_PK=161227&FLNG_PK=05C43A1A0008018C026407B10062D49D056C8CC0'
html = requests.get(url)
soup = BeautifulSoup(html.text, "lxml")
tags = list(soup.find_all('span', {'class':'PrintHistRed'}))
tags.extend(list(soup.find_all('img', alt=re.compile('Radio|Checkbox')))[2:]) # 2: skip "are you an adviser" at the top
tags.extend([t.parent for t in soup.find_all(text="No Information Filed")])
output = []
for entry in sorted(tags):
if entry.name == 'img':
alt = entry['alt']
if 'Radio' in alt:
output.append('NO' if 'not selected' in alt else 'YES')
else:
output.append('O' if 'not checked' in alt else 'X')
else:
output.append(entry.text)
print output[:9] # Display the first 9 entries
Giving you:
[u'APEX INVESTMENT FUND V, L.P.', u'805-2054766781', u'Delaware', u'United States', 'X', 'O', u'No Information Filed', 'NO', 'YES']
Upvotes: 1
Reputation: 21643
I've looked fairly carefully at the HTML. I doubt there is an utterly simple way of scraping pages like this.
I would begin with an analysis, looking for similar questions. For instance, 11 through 16 inclusive can likely be handled in the same way. 19 and 21 appear to be similar. There may or may not be others.
I would work out how to handle each type of similar question as given by the rows containing them. For example, how would I handle 19 and 21? Then I would write code to identify the rows for the questions noting the question number for each. Finally I would use the appropriate code using the row number to winkle out information from it. In other words, when I encountered question 19 I'd use the code meant for either 19 or 21.
Upvotes: 0