Ando Jurai
Ando Jurai

Reputation: 1049

Webscraping in python: BS, selenium, and None error

I wanted to use python webscraping to feed an ml application I did that would make a summary of summaries to ease my daily research work. I seem to meet some difficulties as while I have been using a lot of suggestions on the web, such as this one:
Python Selenium accessing HTML source I keep getting the AttributeError: 'NoneType' object has no attribute 'page_source'/'content' depending on the tries/used modules I need this source to feed beautiful soup to scrape the source and find my ml script. My first attempt was to use requests:

from bs4 import BeautifulSoup as BS
import requests
import time
import datetime
print ('start!')
print(datetime.datetime.now())

page="http://www.genecards.org/cgi-bin/carddisp.pl?gene=COL1A1&keywords=COL1A1"

This is my target page. I usually do like 20 requests a day, so it's not like I wanted to vampirize the website, and since I need them at the same moment, I wanted to automate the retrieval task since the longest part is to get the url, load it, copy and paste the summaries. I am also reasonnable since I respect some delays before loading another page. I tried passing as a regular browser since the site doesn't like robots (it disallows /ProductRedirect and a thing with a number I could not find in google?)

headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:44.0) Gecko/20100101 Firefox/44.0'}
current_page = requests.get(page,  headers=headers)
print(current_page)
print(current_page.content)
soup=BS(current_page.content,"lxml")

I always end up getting no content, while request get code 200 and I can load this page by myself in firefox. So i tried with Selenium

from bs4 import BeautifulSoup as BS
from selenium import webdriver
import time
import datetime
print ('start!')
print(datetime.datetime.now())

browser = webdriver.Firefox()
current_page =browser.get(page)
time.sleep(10)

this works and loads a page. I added the delay to be sure not to spam the host and to be sure to fully load the page. then neither:

html=current_page.content

nor

html=current_page.page_source

nor

html=current_page

works as an input for:

soup=BS(html,"lxml")

It always ends up saying that it doesn't have the page_source attribute (while it should have since it loads correctly in the selenium invoked web browser window).

I don't know what to try next. It's like the user-agent header was not working for requests, and it is very strange that selenium returned page has no source.

What could I try next? Thanks.

Note that I also tried:

browser.get(page)
time.sleep(8)
print(browser)
print(browser.page_source)
html=browser.page_source
soup=BS(html,"lxml")
for summary in soup.find('section', attrs={'id':'_summaries'})
    print(summary)

but while it can get the source, it just fails at BS stage with ; "AttributeError: 'NoneType' object has no attribute 'find'"

Upvotes: 2

Views: 919

Answers (2)

Brandon
Brandon

Reputation: 50

You shouldn't have to convert the html to a string object.

Try:

html = browser.page_source
soup = BS(html,"lxml")

Upvotes: 1

alecxe
alecxe

Reputation: 473753

The problem is that you are trying to iterate over the result of .find(). Instead you need .find_all():

for summary in soup.find_all('section', attrs={'id':'_summaries'})
    print(summary)

Or, if there is a single element, don't use a loop:

summary = soup.find('section', attrs={'id':'_summaries'})
print(summary)

Upvotes: 2

Related Questions