Amistad
Amistad

Reputation: 7410

Normalize space in Xpath with Python scrapy

I am trying to extract content from the Stanford website using Scrapy and Xpath. The following line gets me what I want:

response.xpath('//h2[@class="schoolName"]/following-sibling::ul//text()').getall()

However, the output of the list is as follows:

[' \n\t\t\t\n\t\t\t\t\n\t\t\t\t\t\n\t\t\t\t\t', 
 '\n\t\t\t\t\t\tAccounting (ACCT)\n\t\t\t\t\t', 
 '\n\t\t\t\t\n\t\t\t\n\t\t\t\t\n\t\t\t\t\t\n\t\t\t\t\t', 
 '\n\t\t\t\t\t\tAction Learning Programs (ALP)\n\t\t\t\t\t', 
 '\n\t\t\t\t\n\t\t\t\n\t\t\t\t\n\t\t\t\t\t\n\t\t\t\t\t', 
 '\n\t\t\t\t\t\tEconomic Analysis & Policy (MGTECON)\n\t\t\t\t\t', 
 '\n\t\t\t\t\n\t\t\t\n\t\t\t\t\n\t\t\t\t\t\n\t\t\t\t\t', '\n\t\t\t\t\t\tFinance 
 (FINANCE)\n\t\t\t\t\t', '\n\t\t\t\t\n\t\t\t\n\t\t\t\t\n\t\t\t\t\t\n\t\t\t\t\t', 
 '\n\t\t\t\t\t\tGSB General & Interdisciplinary (GSBGEN)\n\t\t\t\t\t', 
 '\n\t\t\t\t\n\t\t\t\n\t\t\t\t\n\t\t\t\t\t\n\t\t\t\t\t', '\n\t\t\t\t\t\tHuman Resource Management 
  (HRMGT)\n\t\t\t\t\t', '\n\t\t\t']

As is evident, the ouput is littered with extra whitespaces with \n and \t. I don't want to iterate over the list again to remove these unwanted characters since the list is huge(truncated in for readability). I tried using Xpath's normalize space in order to fix this but it did not work.

>>>response.xpath('normalize-space(//h2[@class="schoolName"]/following-sibling::ul//text())').getall()
['']

What am i doing wrong ??

Upvotes: 2

Views: 388

Answers (3)

Kleber Noel
Kleber Noel

Reputation: 353

Indexing a little deeper into your target node e.g. ./ul/li/a/text() rather than ./ul//text() fixes the empty item issue. Note that I visited the webpage you want to scrape and tried some xpaths.

Then all you have to do is apply the strip logic JaSON mentioned with something like:

map(lambda x: x.strip(), response.xpath('//h2[@class="schoolName"]/following-sibling::ul/li/a/text()'))

Also, whether normalize-space works over many nodes depends on the XPath version used in your version of scrapy. In that respect your post is a duplicate of Is it possible to apply normalize-space to all nodes XPath expression finds?

Upvotes: 1

JaSON
JaSON

Reputation: 4869

You need to use strip method to get rid of tab/new-line characters:

[text for text in [text.strip() for text in response.xpath('//h2[@class="schoolName"]/following-sibling::ul//text()').getall()] if text]

Upvotes: 0

mcz
mcz

Reputation: 30

U can use split() as an alternative to normalize-space():

list = [' \n\t\t\t\n\t\t\t\t\n\t\t\t\t\t\n\t\t\t\t\t', 
 '\n\t\t\t\t\t\tAccounting (ACCT)\n\t\t\t\t\t', 
 '\n\t\t\t\t\n\t\t\t\n\t\t\t\t\n\t\t\t\t\t\n\t\t\t\t\t', 
 '\n\t\t\t\t\t\tAction Learning Programs (ALP)\n\t\t\t\t\t', 
 '\n\t\t\t\t\n\t\t\t\n\t\t\t\t\n\t\t\t\t\t\n\t\t\t\t\t', 
 '\n\t\t\t\t\t\tEconomic Analysis & Policy (MGTECON)\n\t\t\t\t\t', 
 '\n\t\t\t\t\n\t\t\t\n\t\t\t\t\n\t\t\t\t\t\n\t\t\t\t\t', '\n\t\t\t\t\t\tFinance FINANCE)\n\t\t\t\t\t', '\n\t\t\t\t\n\t\t\t\n\t\t\t\t\n\t\t\t\t\t\n\t\t\t\t\t', 
 '\n\t\t\t\t\t\tGSB General & Interdisciplinary (GSBGEN)\n\t\t\t\t\t', 
 '\n\t\t\t\t\n\t\t\t\n\t\t\t\t\n\t\t\t\t\t\n\t\t\t\t\t', '\n\t\t\t\t\t\tHuman Resource Management (HRMGT)\n\t\t\t\t\t', '\n\t\t\t']

for x in list:
    print(x.split())

My output:

['Accounting', '(ACCT)']
[]
['Action', 'Learning', 'Programs', '(ALP)']
[]
['Economic', 'Analysis', '&', 'Policy', '(MGTECON)']
[]
['Finance', 'FINANCE)']
[]
['GSB', 'General', '&', 'Interdisciplinary', '(GSBGEN)']
[]
['Human', 'Resource', 'Management', '(HRMGT)']
[]

And then u can simply store the output values that have content in an extra list like this:

Final Code:

...

list = response.xpath('//h2[@class="schoolName"]/following-sibling::ul//text()').getall()

output = []

for x in list:
  i = x.split()
  if i:
      output.append(" ".join(i))
    
print(output)

Output:

['Accounting (ACCT)', 'Action Learning Programs (ALP)', 'Economic Analysis & Policy (MGTECON)', 'Finance FINANCE)', 'GSB General & Interdisciplinary (GSBGEN)', 'Human Resource Management (HRMGT)']

Single line solution: (based on JaSON's idea)

output = [data.strip() for data in response.xpath('//h2[@class="schoolName"]/following-sibling::ul//text()').getall() if data.strip()]

print(output)

Output:

['Accounting (ACCT)', 'Action Learning Programs (ALP)', 'Economic Analysis & Policy (MGTECON)', 'Finance FINANCE)', 'GSB General & Interdisciplinary (GSBGEN)', 'Human Resource Management (HRMGT)']

Upvotes: 0

Related Questions