Reputation: 7410
I am trying to extract content from the Stanford website using Scrapy and Xpath. The following line gets me what I want:
response.xpath('//h2[@class="schoolName"]/following-sibling::ul//text()').getall()
However, the output of the list is as follows:
[' \n\t\t\t\n\t\t\t\t\n\t\t\t\t\t\n\t\t\t\t\t',
'\n\t\t\t\t\t\tAccounting (ACCT)\n\t\t\t\t\t',
'\n\t\t\t\t\n\t\t\t\n\t\t\t\t\n\t\t\t\t\t\n\t\t\t\t\t',
'\n\t\t\t\t\t\tAction Learning Programs (ALP)\n\t\t\t\t\t',
'\n\t\t\t\t\n\t\t\t\n\t\t\t\t\n\t\t\t\t\t\n\t\t\t\t\t',
'\n\t\t\t\t\t\tEconomic Analysis & Policy (MGTECON)\n\t\t\t\t\t',
'\n\t\t\t\t\n\t\t\t\n\t\t\t\t\n\t\t\t\t\t\n\t\t\t\t\t', '\n\t\t\t\t\t\tFinance
(FINANCE)\n\t\t\t\t\t', '\n\t\t\t\t\n\t\t\t\n\t\t\t\t\n\t\t\t\t\t\n\t\t\t\t\t',
'\n\t\t\t\t\t\tGSB General & Interdisciplinary (GSBGEN)\n\t\t\t\t\t',
'\n\t\t\t\t\n\t\t\t\n\t\t\t\t\n\t\t\t\t\t\n\t\t\t\t\t', '\n\t\t\t\t\t\tHuman Resource Management
(HRMGT)\n\t\t\t\t\t', '\n\t\t\t']
As is evident, the ouput is littered with extra whitespaces with \n and \t. I don't want to iterate over the list again to remove these unwanted characters since the list is huge(truncated in for readability). I tried using Xpath's normalize space in order to fix this but it did not work.
>>>response.xpath('normalize-space(//h2[@class="schoolName"]/following-sibling::ul//text())').getall()
['']
What am i doing wrong ??
Upvotes: 2
Views: 388
Reputation: 353
Indexing a little deeper into your target node e.g. ./ul/li/a/text()
rather than ./ul//text()
fixes the empty item issue. Note that I visited the webpage you want to scrape and tried some xpaths.
Then all you have to do is apply the strip logic JaSON mentioned with something like:
map(lambda x: x.strip(), response.xpath('//h2[@class="schoolName"]/following-sibling::ul/li/a/text()'))
Also, whether normalize-space works over many nodes depends on the XPath version used in your version of scrapy. In that respect your post is a duplicate of Is it possible to apply normalize-space to all nodes XPath expression finds?
Upvotes: 1
Reputation: 4869
You need to use strip
method to get rid of tab/new-line characters:
[text for text in [text.strip() for text in response.xpath('//h2[@class="schoolName"]/following-sibling::ul//text()').getall()] if text]
Upvotes: 0
Reputation: 30
U can use split()
as an alternative to normalize-space()
:
list = [' \n\t\t\t\n\t\t\t\t\n\t\t\t\t\t\n\t\t\t\t\t',
'\n\t\t\t\t\t\tAccounting (ACCT)\n\t\t\t\t\t',
'\n\t\t\t\t\n\t\t\t\n\t\t\t\t\n\t\t\t\t\t\n\t\t\t\t\t',
'\n\t\t\t\t\t\tAction Learning Programs (ALP)\n\t\t\t\t\t',
'\n\t\t\t\t\n\t\t\t\n\t\t\t\t\n\t\t\t\t\t\n\t\t\t\t\t',
'\n\t\t\t\t\t\tEconomic Analysis & Policy (MGTECON)\n\t\t\t\t\t',
'\n\t\t\t\t\n\t\t\t\n\t\t\t\t\n\t\t\t\t\t\n\t\t\t\t\t', '\n\t\t\t\t\t\tFinance FINANCE)\n\t\t\t\t\t', '\n\t\t\t\t\n\t\t\t\n\t\t\t\t\n\t\t\t\t\t\n\t\t\t\t\t',
'\n\t\t\t\t\t\tGSB General & Interdisciplinary (GSBGEN)\n\t\t\t\t\t',
'\n\t\t\t\t\n\t\t\t\n\t\t\t\t\n\t\t\t\t\t\n\t\t\t\t\t', '\n\t\t\t\t\t\tHuman Resource Management (HRMGT)\n\t\t\t\t\t', '\n\t\t\t']
for x in list:
print(x.split())
My output:
['Accounting', '(ACCT)']
[]
['Action', 'Learning', 'Programs', '(ALP)']
[]
['Economic', 'Analysis', '&', 'Policy', '(MGTECON)']
[]
['Finance', 'FINANCE)']
[]
['GSB', 'General', '&', 'Interdisciplinary', '(GSBGEN)']
[]
['Human', 'Resource', 'Management', '(HRMGT)']
[]
And then u can simply store the output values that have content in an extra list like this:
Final Code:
...
list = response.xpath('//h2[@class="schoolName"]/following-sibling::ul//text()').getall()
output = []
for x in list:
i = x.split()
if i:
output.append(" ".join(i))
print(output)
Output:
['Accounting (ACCT)', 'Action Learning Programs (ALP)', 'Economic Analysis & Policy (MGTECON)', 'Finance FINANCE)', 'GSB General & Interdisciplinary (GSBGEN)', 'Human Resource Management (HRMGT)']
Single line solution: (based on JaSON's idea)
output = [data.strip() for data in response.xpath('//h2[@class="schoolName"]/following-sibling::ul//text()').getall() if data.strip()]
print(output)
Output:
['Accounting (ACCT)', 'Action Learning Programs (ALP)', 'Economic Analysis & Policy (MGTECON)', 'Finance FINANCE)', 'GSB General & Interdisciplinary (GSBGEN)', 'Human Resource Management (HRMGT)']
Upvotes: 0