Reputation: 1049
I wanted to use python webscraping to feed an ml application I did that would make a summary of summaries to ease my daily research work.
I seem to meet some difficulties as while I have been using a lot of suggestions on the web, such as this one:
Python Selenium accessing HTML source
I keep getting the AttributeError: 'NoneType' object has no attribute 'page_source'/'content' depending on the tries/used modules
I need this source to feed beautiful soup to scrape the source and find my ml script.
My first attempt was to use requests:
from bs4 import BeautifulSoup as BS
import requests
import time
import datetime
print ('start!')
print(datetime.datetime.now())
page="http://www.genecards.org/cgi-bin/carddisp.pl?gene=COL1A1&keywords=COL1A1"
This is my target page. I usually do like 20 requests a day, so it's not like I wanted to vampirize the website, and since I need them at the same moment, I wanted to automate the retrieval task since the longest part is to get the url, load it, copy and paste the summaries. I am also reasonnable since I respect some delays before loading another page. I tried passing as a regular browser since the site doesn't like robots (it disallows /ProductRedirect and a thing with a number I could not find in google?)
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:44.0) Gecko/20100101 Firefox/44.0'}
current_page = requests.get(page, headers=headers)
print(current_page)
print(current_page.content)
soup=BS(current_page.content,"lxml")
I always end up getting no content, while request get code 200 and I can load this page by myself in firefox. So i tried with Selenium
from bs4 import BeautifulSoup as BS
from selenium import webdriver
import time
import datetime
print ('start!')
print(datetime.datetime.now())
browser = webdriver.Firefox()
current_page =browser.get(page)
time.sleep(10)
this works and loads a page. I added the delay to be sure not to spam the host and to be sure to fully load the page. then neither:
html=current_page.content
nor
html=current_page.page_source
nor
html=current_page
works as an input for:
soup=BS(html,"lxml")
It always ends up saying that it doesn't have the page_source attribute (while it should have since it loads correctly in the selenium invoked web browser window).
I don't know what to try next. It's like the user-agent header was not working for requests, and it is very strange that selenium returned page has no source.
What could I try next? Thanks.
Note that I also tried:
browser.get(page)
time.sleep(8)
print(browser)
print(browser.page_source)
html=browser.page_source
soup=BS(html,"lxml")
for summary in soup.find('section', attrs={'id':'_summaries'})
print(summary)
but while it can get the source, it just fails at BS stage with ; "AttributeError: 'NoneType' object has no attribute 'find'"
Upvotes: 2
Views: 919
Reputation: 50
You shouldn't have to convert the html to a string object.
Try:
html = browser.page_source
soup = BS(html,"lxml")
Upvotes: 1
Reputation: 473753
The problem is that you are trying to iterate over the result of .find()
. Instead you need .find_all()
:
for summary in soup.find_all('section', attrs={'id':'_summaries'})
print(summary)
Or, if there is a single element, don't use a loop:
summary = soup.find('section', attrs={'id':'_summaries'})
print(summary)
Upvotes: 2