Reputation: 21
My goal is to scrape the comics in order of day of the week and save it to an excel datasheet. My source is https://comic.naver.com/webtoon/weekday.nhn.
I have had success scraping the data directly through the terminal and would like to write a proper script for the entire process, but have had not had much success.
directly scraping the data through the terminal with response.xpath("//div[@class='list_area daily_all']/div[1]/div/h4/span/text()").extract()
will properly yield the data. The weekdays are ordered from div[1~7], and this code returns "Monday."
The following code returns a list of Monday comics.
response.xpath("//div[@class='list_area daily_all']/div[1]/div//ul/li/a[@class='title']/text()").extract()
However, the following code does not return the desired results.
def parse(self, response):
for webtoon in response.xpath("//div[@class='list_area daily_all']/div/div"):
yield {
'Day': webtoon.xpath('/h4/span/text()').extract(),
'Title': webtoon.xpath("/ul/li/a[@class='title']/text()").extract(),
}
The expected result would be 7 lines of the following code, in order of day of the week
{'Day': [day], 'Title': [title1, title2, title3]}
However, my code is returning
{'Day': [], 'Title': []}
I hope this all makes sense.
Upvotes: 0
Views: 55
Reputation: 431
You need to start your "Day" and "Title" regex with a . (dot).
When you do this, doesn't matter that you are not using response.xpath
you are still trying to get a h4
element at the root of the XML, not a h4
tag after the list_area daily_all
div.
webtoon.xpath('/h4/span/text()').extract()
The correct way to do this is adding a .
before the /h4
, this dot references the current position of your previous xpath selector.
webtoon.xpath('./h4/span/text()').extract()
Upvotes: 1