Peter Lazarov
Peter Lazarov

Reputation: 281

Parse from a JS generated site

I am trying to parse (623) 337-**** from a JS generated site. My code is :

from selenium import webdriver
import re
browser = webdriver.Firefox()
browser.get('http://www.spokeo.com/search?q=Joe+Henderson,+Phoenix,+AZ&sao7=t104#:18643819031')
content = browser.page_source
browser.quit()
m_obj = re.search(r"(\(\d{3}\)\s\d{3}-\*{4})", content)
if m_obj:    
    print m_obj.group(0)

For some reason it doesn`t print anything. Any help is apreciated

Sidenote : Is there a faster way to do it in python

Upvotes: 1

Views: 93

Answers (1)

alecxe
alecxe

Reputation: 473833

The problem is that some of the content is loaded dynamically via post page load ajax requests.

You should wait until an element becomes visible (documentation) - then get the source code of the page:

import re

from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium import webdriver
from selenium.webdriver.support.wait import WebDriverWait


browser = webdriver.Firefox()
browser.get('http://www.spokeo.com/search?q=Joe+Henderson,+Phoenix,+AZ&sao7=t104#:18643819031')

WebDriverWait(browser, 10).until(EC.presence_of_element_located((By.ID, "profile_details_section_header")))
content = browser.page_source

m_obj = re.search(r"(\(\d{3}\)\s\d{3}-\*{4})", content)
if m_obj:    
    print m_obj.group(0)

browser.quit()

Or you can call time.sleep() or browser.implicitly_wait() instead - though it doesn't sound quite right.

Prints (623) 337-****.

Hope that helps.

Upvotes: 1

Related Questions