Reputation: 708
I am scraping Google Scholar and have trouble getting the right XPath expression. When I inspect the wanted elements it returns me expressions like these:
//*[@id="gs_res_ccl_mid"]/div[2]/div[2]/div[3]/a[3]
//*[@id="gs_res_ccl_mid"]/div[3]/div/div[3]/a[3]
// *[@id="gs_res_ccl_mid"]/div[6]/div[2]/div[3]/a[3]
I ended up with the generic expression:
//*[@id="gs_res_ccl_mid"]//a[3]
Also tried the alternative, with similar results:
//*[@id="gs_res_ccl_mid"]/div*/div*/div*/a[3]
The output is something like (I can not post the entire result set because I dont't have 10 points of reputation):
[
'https://scholar.google.es/scholar?cites=5812018205123467454&as_sdt=2005&sciodt=0,5&hl=es',
'https://scholar.google.es/citations?user=EOc3O8AAAAAJ&hl=es&oi=sra',
'https://scholar.google.es/citations?user=nd8O1XQAAAAJ&hl=es&oi=sra',
'https://scholar.google.es/scholar?cites=15483392402856138853&as_sdt=2005&sciodt=0,5&hl=es',
'https://scholar.google.es/scholar?cites=7733120668292842687&as_sdt=2005&sciodt=0,5&hl=es',
'https://scholar.google.es/scholar?cites=15761030700327980189&as_sdt=2005&sciodt=0,5&hl=es'
]
The problem with the output is that there are 3 unwanted elements extras and they all have this piece of text citations?user
. What can I do to rid me off the unwanted elements?
My code:
def paperOthers(exp,atr=None):
thread = browser.find_elements(By.XPATH,(" %s" % exp))
xArray = []
for t in thread:
if atr == 0:
xThread = t.get_attribute('id')
elif atr == 1:
xThread = t.get_attribute('href')
else:
xThread = t.text
xArray.append(xThread)
return xArray
Which I call with:
rcites = paperOthers("//*[@id='gs_res_ccl_mid']//a[3]", 1)
Upvotes: 0
Views: 71
Reputation: 14135
Change the XPath to exclude the items with text.
rcites = paperOthers("//*[@id='gs_res_ccl_mid']//a[3][not(contains(.,'citations?user'))]",1)
Upvotes: 1