Reputation: 265
I am looking to scrape the article titles. I cannot figure out how to extract the title text. Can you please take a look at my code below and suggest solutions.
I am new to scrapy. I appreciate the help!
Screenshot of the web developer view of the web page https://i.sstatic.net/bPn4W.jpg
import scrapy
class BrickSetSpider(scrapy.Spider):
name = "brickset_spider"
start_urls = ['https://www.mckinsey.com/search?q=Agile&start=1']
def parse(self, response):
for quote in response.css('div.text-wrapper'):
item = {
'text': quote.css('h3.headline::text').extract(),
}
print(item)
yield item
Upvotes: 0
Views: 1278
Reputation: 3717
Looks good for new-to-scrapy developer! I'd changed only selector in you parse
function:
for quote in response.css('div.block-list div.item'):
yield {
'text': quote.css('h3.headline::text').get(),
}
UPD: hm, looks like your website makes additional request for data.
Open developer tools and check request to https://www.mckinsey.com/services/ContentAPI/SearchAPI.svc/search
with params {"q":"Agile","page":1,"app":"","sort":"default","ignoreSpellSuggestion":false}
.
You can make scrapy.Request
with these params and appropriate headers and get json with data. It will be easily parsed with json
lib.
UPD2: as I can see from this curl curl 'https://www.mckinsey.com/services/ContentAPI/SearchAPI.svc/search' -H 'content-type: application/json' --data-binary '{"q":"Agile","page”:1,”app":"","sort":"default","ignoreSpellSuggestion":false}' --compressed
, we need to make request in this way:
from scrapy import Request
import json
data = {"q": "Agile", "page": 1, "app": "", "sort": "default", "ignoreSpellSuggestion": False}
headers = {"content-type": "application/json"}
url = "https://www.mckinsey.com/services/ContentAPI/SearchAPI.svc/search"
yield Request(url, headers=headers, body=json.dumps(data), callback=self.parse_api)
and then in parse_api
function just parse response:
def parse_api(self, response):
data = json.loads(response.body)
# and then extract what you need
So you can iterate parameter page
in request and get all pages.
UPD3: Working solution:
from scrapy import Spider, Request
import json
class BrickSetSpider(Spider):
name = "brickset_spider"
data = {"q": "Agile", "page": 1, "app": "", "sort": "default", "ignoreSpellSuggestion": False}
headers = {"content-type": "application/json"}
url = "https://www.mckinsey.com/services/ContentAPI/SearchAPI.svc/search"
def start_requests(self):
yield Request(self.url, headers=self.headers, method='POST',
body=json.dumps(self.data), meta={'page': 1})
def parse(self, response):
data = json.loads(response.body)
results = data.get('data', {}).get('results')
if not results:
return
for row in results:
yield {'title': row.get('title')}
page = response.meta['page'] + 1
self.data['page'] = page
yield Request(self.url, headers=self.headers, method='POST', body=json.dumps(self.data), meta={'page': page})
Upvotes: 5
Reputation: 2027
if you just want to select text of h1 tag all you have to do is
[tag.css('::text').extract_first(default='') for tag in response.css('.attr')]
This is using the xpath, might be easier.
//h1[@class='state']/text()
Also, I would recommend checking out BeautifulSoup for python. It is very easy and effective at reading entire html of page and extracting text. https://www.crummy.com/software/BeautifulSoup/bs4/doc/
A very simple example would be like this.
from bs4 import BeautifulSoup
text = '''
<td><a href="http://www.fakewebsite.com">Please can you strip me?</a>
<br/><a href="http://www.fakewebsite.com">I am waiting....</a>
</td>
'''
soup = BeautifulSoup(text)
print(soup.get_text())
Upvotes: 0