Reputation: 16189
I'm trying to grab the business names from a Google local search results page such as this:
Given the following:
... I would have thought that the XPath //div[@class ="_rl"]
or //*[@class ="_rl"]
would suffice, but they each return nothing. I know I need to make the query more explicit/precise, but how exactly?
I'm using Python and lxml
, if that is of relevance.
Upvotes: 0
Views: 161
Reputation: 5292
Below is the working code-
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait # available since 2.4.0
from selenium.webdriver.support import expected_conditions as EC # available since 2.26.0
from selenium.webdriver.common.by import By
from lxml import etree
import lxml.html
from bs4 import BeautifulSoup
driver = webdriver.Chrome()
driver.get("https://www.google.com/ncr#q=chiropractors%2BNew+York,+NY&rflfq=1&rlha=0&tbm=lcl")
WebDriverWait(driver,1000).until(EC.presence_of_all_elements_located((By.TAG_NAME,"body")))
tree = etree.fromstring(driver.page_source)
print 'Using pure python-----------'*2
d=driver.find_elements_by_xpath("//div[@class='_pl _ki']")
for i in d:
print i.text.split("\n")[0]
print 'Using bs4-----------------'*2
soup = BeautifulSoup(driver.page_source,'html.parser')
raw = soup.find_all('div', class_='_rl')
for i in raw:
print i.text
print 'Using lxml---------------'*2
tree = lxml.html.fromstring(driver.page_source)
e=tree.cssselect("._rl")
for i in e:
d = i.xpath('.//text()')
print ''.join(d)
driver.close()
It prints:
Using pure python-----------Using pure python-----------
TAI Chiropractic
Body in Balance Chiropractic
Lamb Chiropractic
Esprit Wellness
Jamie H Bassel DC PC
Madison Avenue Chiropractic Center
Howard Benedikt DC
44'Th Street Chiropractic
Rockefeller Health & Medical Chiropractic
Frank J. Valente, DC, PC
Dr. Robert Shire
5th Avenue Chiropractic
Peterson Chiropractic
NYC Chiropractic Solutions
20 East Chiropractic of Midtown
GRAND CENTRAL CHIROPRACTIC WELLNESS CENTER
Park Avenue Chiropractic Center - Dr Nancy Jacobs
Murray Hill Chiropractic PC
Empire Sports & Spine
JW Chiropractic
Using bs4-----------------Using bs4-----------------
TAI Chiropractic
Body in Balance Chiropractic
Lamb Chiropractic
Esprit Wellness
Jamie H Bassel DC PC
Madison Avenue Chiropractic Center
Howard Benedikt DC
44'Th Street Chiropractic
Rockefeller Health & Medical Chiropractic
Frank J. Valente, DC, PC
Dr. Robert Shire
5th Avenue Chiropractic
Peterson Chiropractic
NYC Chiropractic Solutions
20 East Chiropractic of Midtown
GRAND CENTRAL CHIROPRACTIC WELLNESS CENTER
Park Avenue Chiropractic Center - Dr Nancy Jacobs
Murray Hill Chiropractic PC
Empire Sports & Spine
JW Chiropractic
Using lxml---------------Using lxml---------------
TAI Chiropractic
Body in Balance Chiropractic
Lamb Chiropractic
Esprit Wellness
Jamie H Bassel DC PC
Madison Avenue Chiropractic Center
Howard Benedikt DC
44'Th Street Chiropractic
Rockefeller Health & Medical Chiropractic
Frank J. Valente, DC, PC
Dr. Robert Shire
5th Avenue Chiropractic
Peterson Chiropractic
NYC Chiropractic Solutions
20 East Chiropractic of Midtown
GRAND CENTRAL CHIROPRACTIC WELLNESS CENTER
Park Avenue Chiropractic Center - Dr Nancy Jacobs
Murray Hill Chiropractic PC
Empire Sports & Spine
JW Chiropractic
Upvotes: 1
Reputation: 702
you're capturing the element enclosing the text, not the text enclosed in the element. you need to either get the text attribute of the returned object, or add to your xpath statement so it gets the text specifically:
#from the object
list_of_elements = tree.xpath('//div[@class ="_rl"]')
for l in list_of_elements:
print(l.text)
#capture the text
list_of_text = tree.xpath('//div[@class ="_rl"]/text()')
for l in list_of_text:
print(l)
Upvotes: 1
Reputation: 3760
You mention Python, but based upon your screenshot it seems that perhaps you want to just get the xpath from the browswer?
In Chrome Developer Tools, you can right click on the element and select "Copy XPath."
Upvotes: 1