Pyderman
Pyderman

Reputation: 16189

Getting XPath for the text of a div of a certain class?

I'm trying to grab the business names from a Google local search results page such as this:

enter image description here

Given the following:

enter image description here

... I would have thought that the XPath //div[@class ="_rl"] or //*[@class ="_rl"] would suffice, but they each return nothing. I know I need to make the query more explicit/precise, but how exactly?

I'm using Python and lxml, if that is of relevance.

Upvotes: 0

Views: 161

Answers (3)

Learner
Learner

Reputation: 5292

Below is the working code-

from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait # available since 2.4.0
from selenium.webdriver.support import expected_conditions as EC # available since 2.26.0
from selenium.webdriver.common.by import By
from lxml import etree
import lxml.html
from bs4 import BeautifulSoup


driver = webdriver.Chrome()
driver.get("https://www.google.com/ncr#q=chiropractors%2BNew+York,+NY&rflfq=1&rlha=0&tbm=lcl")
WebDriverWait(driver,1000).until(EC.presence_of_all_elements_located((By.TAG_NAME,"body")))

tree = etree.fromstring(driver.page_source)



print 'Using pure python-----------'*2
d=driver.find_elements_by_xpath("//div[@class='_pl _ki']")
for i in d:
    print i.text.split("\n")[0]

print 'Using bs4-----------------'*2
soup = BeautifulSoup(driver.page_source,'html.parser')
raw = soup.find_all('div', class_='_rl')
for i in raw:
    print i.text


print 'Using lxml---------------'*2

tree = lxml.html.fromstring(driver.page_source)

e=tree.cssselect("._rl")

for i in e:
    d = i.xpath('.//text()')
    print ''.join(d)


driver.close()

It prints:

Using pure python-----------Using pure python-----------
TAI Chiropractic
Body in Balance Chiropractic
Lamb Chiropractic
Esprit Wellness
Jamie H Bassel DC PC
Madison Avenue Chiropractic Center
Howard Benedikt DC
44'Th Street Chiropractic
Rockefeller Health & Medical Chiropractic
Frank J. Valente, DC, PC
Dr. Robert Shire
5th Avenue Chiropractic
Peterson Chiropractic
NYC Chiropractic Solutions
20 East Chiropractic of Midtown
GRAND CENTRAL CHIROPRACTIC WELLNESS CENTER
Park Avenue Chiropractic Center - Dr Nancy Jacobs
Murray Hill Chiropractic PC
Empire Sports & Spine
JW Chiropractic
Using bs4-----------------Using bs4-----------------
TAI Chiropractic
Body in Balance Chiropractic
Lamb Chiropractic
Esprit Wellness
Jamie H Bassel DC PC
Madison Avenue Chiropractic Center
Howard Benedikt DC
44'Th Street Chiropractic
Rockefeller Health & Medical Chiropractic
Frank J. Valente, DC, PC
Dr. Robert Shire
5th Avenue Chiropractic
Peterson Chiropractic
NYC Chiropractic Solutions
20 East Chiropractic of Midtown
GRAND CENTRAL CHIROPRACTIC WELLNESS CENTER
Park Avenue Chiropractic Center - Dr Nancy Jacobs
Murray Hill Chiropractic PC
Empire Sports & Spine
JW Chiropractic
Using lxml---------------Using lxml---------------
TAI Chiropractic
Body in Balance Chiropractic
Lamb Chiropractic
Esprit Wellness
Jamie H Bassel DC PC
Madison Avenue Chiropractic Center
Howard Benedikt DC
44'Th Street Chiropractic
Rockefeller Health & Medical Chiropractic
Frank J. Valente, DC, PC
Dr. Robert Shire
5th Avenue Chiropractic
Peterson Chiropractic
NYC Chiropractic Solutions
20 East Chiropractic of Midtown
GRAND CENTRAL CHIROPRACTIC WELLNESS CENTER
Park Avenue Chiropractic Center - Dr Nancy Jacobs
Murray Hill Chiropractic PC
Empire Sports & Spine
JW Chiropractic

Upvotes: 1

tlastowka
tlastowka

Reputation: 702

you're capturing the element enclosing the text, not the text enclosed in the element. you need to either get the text attribute of the returned object, or add to your xpath statement so it gets the text specifically:

#from the object
list_of_elements = tree.xpath('//div[@class ="_rl"]')
for l in list_of_elements:
    print(l.text)

#capture the text
list_of_text = tree.xpath('//div[@class ="_rl"]/text()')
for l in list_of_text:
    print(l)

Upvotes: 1

duffn
duffn

Reputation: 3760

You mention Python, but based upon your screenshot it seems that perhaps you want to just get the xpath from the browswer?

In Chrome Developer Tools, you can right click on the element and select "Copy XPath."

Chrome Copy XPath

Upvotes: 1

Related Questions