Reputation: 13
I am trying to scrape an event website and I have the attached code to scrape the event name and location. I write the output to a csv file, but then the csv file has all the event names appended to each other in a single line.
For example, suppose I have two events Bruno Mars and Maroon 5, and their locations as San Jose, Santa Clara. The current output is,
event_name event_location
Bruno Mars, Maroon 5 San Jose, Santa Clara
But I was hoping to see,
event_name event_location
Bruno Mars San Jose
Maroon 5 Santa Clara.
Can someone please let me know why this formatting is getting weird for me? I have attached the code here. I then use scrapy crawl event_spider -o output.csv -t csv
to run my code.
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from event_test.items import EventItem
class EventSpider(BaseSpider):
name = "event_spider"
allowed_domains = ["eventful.com"]
start_urls = [
"http://eventful.com/sanjose/events"
]
def parse(self, response):
hxs = HtmlXPathSelector(response)
events = hxs.select("/html/body[@id='events']/div[@id='outer-container']/div[@id='mid-container']/div[@id='inner-container']/div[@id='content']/div[@class='cols-2-1']/div[@class='alpha']/div[@id='top-events']/div[@class='section top-events cage-dbl-border cage-bdr-mdgrey']/div[@id='events-scroll']/div[@id='events-scroll-items']/ul[@id='events-scroll-items-list']/li[@class='top-events-item ']")
items = []
for event in events:
item = EventItem()
item['event_name'] = event.select("//h2/a/span/text()").extract()
item['event_locality'] = event.select("//span[@class='locality']/text()").extract()
items.append(item)
return items
Upvotes: 1
Views: 1667
Reputation: 473753
I've simplified the code and xpaths in your spider:
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from event_test.items import EventItem
class EventSpider(BaseSpider):
name = "event_spider"
allowed_domains = ["eventful.com"]
start_urls = ["http://eventful.com/sanjose/events"]
def parse(self, response):
hxs = HtmlXPathSelector(response)
events = hxs.select("//li[contains(@class, 'top-events-item')]")
for event in events:
item = EventItem()
item['event_name'] = event.select(".//h2/a/span/text()").extract()[0]
item['event_locality'] = event.select(".//span[@class='locality']/text()").extract()[0]
yield item
Here's what you'll get in the csv file:
event_name,event_locality
Under the Influence of Music Tour,Mountain View
Bruno Mars,San Jose
John Mayer: Born & Raised Tour 2013,Mountain View
New Kids on the Block with 98 Degrees and ...,San Jose
Amy Grant,San Jose
Styx,Saratoga
Bob Dylan with Wilco,Mountain View
Kenny Chesney with Eli Young Band,Mountain View
Smash Mouth \/ Sugar Ray \/ Gin Blossoms \...,Saratoga
Creedence Clearwater Revisited \/ 38 Special,Saratoga
Hope that helps.
Upvotes: 1