Python Scrapy is not getting all html elements from a webpage

Question

I am trying to use Scrapy to get the names of all current WWE superstars from the following url: http://www.wwe.com/superstars However, when I run my scraper, it does not return any names. I believe (through attempting the problem with other modules) that the problem is that Scrapy is not finding all of the html elements from the page. I attempted the problem with requests and Beautiful Soup, and when I looked at the html that requests got, it was missing important aspects of the html that I was seeing in my browsers inspector. The html containing the names looks like this:

 == $0
    name here

My code is posted below. Is there something that I am doing wrong that is causing this not to work?

import scrapy

class SuperstarSpider(scrapy.Spider):
    name = "star_spider"
    start_urls = ["http://www.wwe.com/superstars"]

    def parse(self, response):
        star_selector = '.superstars--info'
        for star in response.css(star_selector):
            NAME_SELECTOR = 'span ::text'
            yield {
                'name' : star.css(NAME_SELECTOR).extract_first(),
            }

drec4s · Accepted Answer

Since the content is javascript generated, you have two options: use something like selenium to mimic a browser and parse the html content, or if you can, query an API directly.

In this case, this simple solution works:

import requests
import json


URL = "http://www.wwe.com/api/superstars"

with requests.session() as s:
    s.headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:50.0) Gecko/20100101 Firefox/50.0'}
    resp = s.get(URL).json()
    for x in resp['talent'][:10]:
        print(x['name'])

Output (first 10 records):

Abdullah the Butcher
Adam Bomb
Adam Cole
Adam Rose
Aiden English
AJ Lee
AJ Styles
Akam
Akeem
Akira Tozawa

Python Scrapy is not getting all html elements from a webpage

Answers (2)

Related Questions