Kai Xie
Kai Xie

Reputation: 125

why xpath derived from chrome does not work

I am trying to scrap data from web of science

And here is the specific page I am going to work with.

Below is the code I use for extract the abstract:

import lxml
import requests

url = 'https://apps.webofknowledge.com/full_record.do?product=WOS&search_mode=GeneralSearch&qid=2&SID=Q1yAnqE4al4KxALF7RM&page=1&doc=3&cacheurlFromRightClick=no'
s = requests.Session()
d = s.get(url)
soup1 = etree.HTML(d.text)

And here is the xpath I got through the copy xpath in Chrome:

//*[@id="records_form"]/div/div/div/div[1]/div/div[4]/p/text()

So I tried to get the abstract like this

path = '//*[@id="records_form"]/div/div/div/div[1]/div/div[4]/p/text()'   
print(soup1.xpath(path))

However, I just hot an empty list! Then I tried another way to test the xpath.

Firstly, I save the specific page as a local html file.

with open('1.html','w',encoding='UTF=8') as f:
    f.write(d.text)
f.close()

Then, open the file

s.mount('file://',FileAdapter())
d = s.get('file:///K:/single_paper.html')
soup2 = etree.HTML(d.text)
soup2.xpath('//*[@id="records_form"]/div/div/div/div[1]/div/div[4]/p/text()')

And it gives me the abstract I want! Could anyone tell me why that happens?

Weired when I try to do the steps with another page in the saving local file way, it returns an empty list again!

I checked that the xpath given by Chrome is the same for these two pages.

So could anyone tell me what's wrong with my code and how to fix it?

Upvotes: 1

Views: 4435

Answers (1)

vold
vold

Reputation: 1549

Browser given full Xpaths are usually unhelpful and you should use relative and clever ones based on attributes (such as id, class, etc) or any identifying features like contains(@href, 'image').

You could try more specific xpath expression: (//div[@class="block-record-info"])[2]/p/text() and rewrite your code like this:

import requests
from lxml import html

url = 'https://apps.webofknowledge.com/full_record.do?product=WOS&search_mode=GeneralSearch&qid=2&SID=Q1yAnqE4al4KxALF7RM&page=1&doc=3&cacheurlFromRightClick=no'
s = requests.Session()
r = s.get(url)
tree = html.fromstring(r.content)
element = tree.xpath('(//div[@class="block-record-info"])[2]/p/text()')
print(element)

Output:enter image description here

Upvotes: 3

Related Questions