Reputation: 167
I basically have a list of titles to search on a website which are stored in a csv.
I'm extracting those values and then trying to add append them to the search link in the start_urls
function.
However, when I run the script, it only takes the last value of the list. Is there any particular reason why this happens?
class MySpider(CrawlSpider):
name = "test"
allowed_domains = ["example.com"]
import pandas as pd
df = pd.read_csv('test.csv')
saved_column = df.ProductName
for a in saved_column:
start_urls = ["http://www.example.com/search?noOfResults=20&keyword="+str(a)"]
def parse(self,response):
Upvotes: 1
Views: 1283
Reputation: 5814
There is a conceptual error in your code. You are making the loop but without any action other than rotating the urls. So the parse function is called with the last value of the loop.
A possible other approach would be to override 'start_requests' method of the spider:
def start_requests(self):
df = pd.read_csv('test.csv')
saved_column = df.ProductName
for url in saved_column:
yield Request(url, self.parse)
Idea got from here: How to generate the start_urls dynamically in crawling?
Upvotes: 1