Reputation: 125
I am trying to scrap data from web of science
And here is the specific page I am going to work with.
Below is the code I use for extract the abstract:
import lxml
import requests
url = 'https://apps.webofknowledge.com/full_record.do?product=WOS&search_mode=GeneralSearch&qid=2&SID=Q1yAnqE4al4KxALF7RM&page=1&doc=3&cacheurlFromRightClick=no'
s = requests.Session()
d = s.get(url)
soup1 = etree.HTML(d.text)
And here is the xpath I got through the copy xpath in Chrome:
//*[@id="records_form"]/div/div/div/div[1]/div/div[4]/p/text()
So I tried to get the abstract like this
path = '//*[@id="records_form"]/div/div/div/div[1]/div/div[4]/p/text()'
print(soup1.xpath(path))
However, I just hot an empty list! Then I tried another way to test the xpath.
Firstly, I save the specific page as a local html file.
with open('1.html','w',encoding='UTF=8') as f:
f.write(d.text)
f.close()
Then, open the file
s.mount('file://',FileAdapter())
d = s.get('file:///K:/single_paper.html')
soup2 = etree.HTML(d.text)
soup2.xpath('//*[@id="records_form"]/div/div/div/div[1]/div/div[4]/p/text()')
And it gives me the abstract I want! Could anyone tell me why that happens?
Weired when I try to do the steps with another page in the saving local file way, it returns an empty list again!
I checked that the xpath given by Chrome is the same for these two pages.
So could anyone tell me what's wrong with my code and how to fix it?
Upvotes: 1
Views: 4435
Reputation: 1549
Browser given full Xpaths are usually unhelpful and you should use relative and clever ones based on attributes (such as id, class, etc) or any identifying features like contains(@href, 'image').
You could try more specific xpath expression: (//div[@class="block-record-info"])[2]/p/text()
and rewrite your code like this:
import requests
from lxml import html
url = 'https://apps.webofknowledge.com/full_record.do?product=WOS&search_mode=GeneralSearch&qid=2&SID=Q1yAnqE4al4KxALF7RM&page=1&doc=3&cacheurlFromRightClick=no'
s = requests.Session()
r = s.get(url)
tree = html.fromstring(r.content)
element = tree.xpath('(//div[@class="block-record-info"])[2]/p/text()')
print(element)
Upvotes: 3