why xpath derived from chrome does not work

Question

I am trying to scrap data from web of science

And here is the specific page I am going to work with.

Below is the code I use for extract the abstract:

import lxml
import requests

url = 'https://apps.webofknowledge.com/full_record.do?product=WOS&search_mode=GeneralSearch&qid=2&SID=Q1yAnqE4al4KxALF7RM&page=1&doc=3&cacheurlFromRightClick=no'
s = requests.Session()
d = s.get(url)
soup1 = etree.HTML(d.text)

And here is the xpath I got through the copy xpath in Chrome:

//*[@id="records_form"]/div/div/div/div[1]/div/div[4]/p/text()

So I tried to get the abstract like this

path = '//*[@id="records_form"]/div/div/div/div[1]/div/div[4]/p/text()'   
print(soup1.xpath(path))

However, I just hot an empty list! Then I tried another way to test the xpath.

Firstly, I save the specific page as a local html file.

with open('1.html','w',encoding='UTF=8') as f:
    f.write(d.text)
f.close()

Then, open the file

s.mount('file://',FileAdapter())
d = s.get('file:///K:/single_paper.html')
soup2 = etree.HTML(d.text)
soup2.xpath('//*[@id="records_form"]/div/div/div/div[1]/div/div[4]/p/text()')

And it gives me the abstract I want! Could anyone tell me why that happens?

Weired when I try to do the steps with another page in the saving local file way, it returns an empty list again!

I checked that the xpath given by Chrome is the same for these two pages.

So could anyone tell me what's wrong with my code and how to fix it?

vold · Accepted Answer

Browser given full Xpaths are usually unhelpful and you should use relative and clever ones based on attributes (such as id, class, etc) or any identifying features like contains(@href, 'image').

You could try more specific xpath expression: (//div[@class="block-record-info"])[2]/p/text() and rewrite your code like this:

import requests
from lxml import html

url = 'https://apps.webofknowledge.com/full_record.do?product=WOS&search_mode=GeneralSearch&qid=2&SID=Q1yAnqE4al4KxALF7RM&page=1&doc=3&cacheurlFromRightClick=no'
s = requests.Session()
r = s.get(url)
tree = html.fromstring(r.content)
element = tree.xpath('(//div[@class="block-record-info"])[2]/p/text()')
print(element)

Output:

why xpath derived from chrome does not work

Answers (1)

Related Questions