Reputation: 111
I am trying to get data(title) from this page. My code doesn't work. What am I doing wrong?
scrapy shell https://www.indiegogo.com/projects/functional-footwear-run-pain-free#/
response.css('.t-h3--sansSerif::text').getall()
Upvotes: 0
Views: 612
Reputation: 836
Always check the source of the page from view-source. Looking at the source it looks like it does not contain the element you are looking for. Instead it is dynamically created with javascript.
You can use selenium to scrape such sites. But selenium comes with its caveats. It is synchronous.
And since you are using scrapy, a better option is to use scrapy-splash package. Splash renders javascript and return fully rendered html page which you can easily scrape with xpath or css selectors. Remember, you need to run Splash server in a docker container. And use it like a proxy server to render javascript.
docker pull scrapinghub/splash
docker run -d -p 8050:8050 --memory=1.5G --restart=always scrapinghub/splash --maxrss 1500 --max-timeout 3600 --slots 10
Here's a link to the documentation. https://splash.readthedocs.io/en/stable/
Your script would look something like this. Instead of scrapy.Request
, you can makes requests like
from scrapy_splash import SplashRequest
yield SplashRequest(url=url, callback=self.parse, meta={})
And then you are good to go.
Upvotes: 1
Reputation: 504
I think may be the problem is that the element is dynamically added through Js and that could be the reason scrapy being not able to extract it may be you should try using selenium.
Here is selnium code to get the element:
titles = WebDriverWait(driver, 10).until(
EC.presence_of_all_elements_located((By.CSS_SELECTOR, "#main .is-12-touch+ .is-12-touch"))
)
for title in titles:
t = title.text
print("t = ", title)
Upvotes: 1