Reputation: 339
I'm trying to parse a JSON response from the New York Times API with Scrapy to CSV so that I could have a summary of all related articles to a particular query. I'd like to spit this out as a CSV with link, publication date, summary, and title so that I could run a few keyword searches on the summary description. I'm new to both Python and Scrapy but here's my spider (I'm getting an HTTP 400 error). I've xx'ed out my api key in the spider:
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from nytimesAPIjson.items import NytimesapijsonItem
import json
import urllib2
class MySpider(BaseSpider):
name = "nytimesapijson"
allowed_domains = ["http://api.nytimes.com/svc/search/v2/articlesearch"]
req = urllib2.urlopen('http://api.nytimes.com/svc/search/v2/articlesearch.json?q="financial crime"&facet_field=day_of_week&begin_date=20130101&end_date=20130916&page=2&rank=newest&api-key=xxx)
def json_parse(self, response):
jsonresponse= json.loads(response)
item = NytimesapijsonItem()
item ["pubDate"] = jsonresponse["pub_date"]
item ["description"] = jsonresponse["lead_paragraph"]
item ["title"] = jsonresponse["print_headline"]
item ["link"] = jsonresponse["web_url"]
items.append(item)
return items
If anybody has any ideas/suggestions, including onese outside of Scrapy, please let me know. Thanks in advance.
Upvotes: 1
Views: 3087
Reputation: 473763
You should set start_urls
and use parse
method:
from scrapy.spider import BaseSpider
import json
class MySpider(BaseSpider):
name = "nytimesapijson"
allowed_domains = ["api.nytimes.com"]
start_urls = ['http://api.nytimes.com/svc/search/v2/articlesearch.json?q="financial crime"&facet_field=day_of_week&begin_date=20130101&end_date=20130916&page=2&rank=newest&api-key=xxx']
def parse(self, response):
jsonresponse = json.loads(response)
print jsonresponse
Upvotes: 2