SPFort
SPFort

Reputation: 63

Python Scrapy is not getting all html elements from a webpage

I am trying to use Scrapy to get the names of all current WWE superstars from the following url: http://www.wwe.com/superstars However, when I run my scraper, it does not return any names. I believe (through attempting the problem with other modules) that the problem is that Scrapy is not finding all of the html elements from the page. I attempted the problem with requests and Beautiful Soup, and when I looked at the html that requests got, it was missing important aspects of the html that I was seeing in my browsers inspector. The html containing the names looks like this:

<div class="superstars--info"> == $0
    <span class="superstars--name">name here</span>
</div>

My code is posted below. Is there something that I am doing wrong that is causing this not to work?

import scrapy

class SuperstarSpider(scrapy.Spider):
    name = "star_spider"
    start_urls = ["http://www.wwe.com/superstars"]

    def parse(self, response):
        star_selector = '.superstars--info'
        for star in response.css(star_selector):
            NAME_SELECTOR = 'span ::text'
            yield {
                'name' : star.css(NAME_SELECTOR).extract_first(),
            }

Upvotes: 0

Views: 1892

Answers (2)

notorious.no
notorious.no

Reputation: 5107

Sounds like the site has dynamic content which maybe loaded using javascript and/or xhr calls. Look into splash it's a javascript render engine that behaves a lot like phantomjs. If you know how to use docker, splash is super simple to setup. After you have splash setup, you'll have to integrate it with scrapy by using the scrapy-splash plugin.

Upvotes: 2

drec4s
drec4s

Reputation: 8077

Since the content is javascript generated, you have two options: use something like selenium to mimic a browser and parse the html content, or if you can, query an API directly.

In this case, this simple solution works:

import requests
import json


URL = "http://www.wwe.com/api/superstars"

with requests.session() as s:
    s.headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:50.0) Gecko/20100101 Firefox/50.0'}
    resp = s.get(URL).json()
    for x in resp['talent'][:10]:
        print(x['name'])

Output (first 10 records):

Abdullah the Butcher
Adam Bomb
Adam Cole
Adam Rose
Aiden English
AJ Lee
AJ Styles
Akam
Akeem
Akira Tozawa

Upvotes: 1

Related Questions