powerPixie
powerPixie

Reputation: 708

XPath getting a specific set of elements within a class

I am scraping Google Scholar and have trouble getting the right XPath expression. When I inspect the wanted elements it returns me expressions like these:

//*[@id="gs_res_ccl_mid"]/div[2]/div[2]/div[3]/a[3]
//*[@id="gs_res_ccl_mid"]/div[3]/div/div[3]/a[3]
// *[@id="gs_res_ccl_mid"]/div[6]/div[2]/div[3]/a[3]

I ended up with the generic expression:

//*[@id="gs_res_ccl_mid"]//a[3]

Also tried the alternative, with similar results:

//*[@id="gs_res_ccl_mid"]/div*/div*/div*/a[3]

The output is something like (I can not post the entire result set because I dont't have 10 points of reputation):

[
'https://scholar.google.es/scholar?cites=5812018205123467454&as_sdt=2005&sciodt=0,5&hl=es',
'https://scholar.google.es/citations?user=EOc3O8AAAAAJ&hl=es&oi=sra',
'https://scholar.google.es/citations?user=nd8O1XQAAAAJ&hl=es&oi=sra',
'https://scholar.google.es/scholar?cites=15483392402856138853&as_sdt=2005&sciodt=0,5&hl=es',
'https://scholar.google.es/scholar?cites=7733120668292842687&as_sdt=2005&sciodt=0,5&hl=es',
'https://scholar.google.es/scholar?cites=15761030700327980189&as_sdt=2005&sciodt=0,5&hl=es'
]

The problem with the output is that there are 3 unwanted elements extras and they all have this piece of text citations?user. What can I do to rid me off the unwanted elements?

My code:

def paperOthers(exp,atr=None): 
                  
     thread = browser.find_elements(By.XPATH,(" %s" % exp))
   
     xArray = []
    
     for t in thread:
         if atr == 0:
             xThread = t.get_attribute('id')
         elif atr == 1:                
             xThread = t.get_attribute('href')
         else:
             xThread = t.text         
         xArray.append(xThread)  
        
     return xArray

Which I call with:

rcites = paperOthers("//*[@id='gs_res_ccl_mid']//a[3]", 1)

Upvotes: 0

Views: 71

Answers (1)

supputuri
supputuri

Reputation: 14135

Change the XPath to exclude the items with text.

rcites = paperOthers("//*[@id='gs_res_ccl_mid']//a[3][not(contains(.,'citations?user'))]",1)

Upvotes: 1

Related Questions