dwismer
dwismer

Reputation: 15

Scrapy is returning content from a different webpage

I am trying to scrape fight data from Tapology.com, but the content I am pulling through Scrapy is giving me content for a completely different web page. For example, I want to pull the fighter names from the following link:

https://www.tapology.com/fightcenter/bouts/184425-ufc-189-ruthless-robbie-lawler-vs-rory-red-king-macdonald-ii

So I open scrapy shell with:

scrapy shell 'https://www.tapology.com/fightcenter/bouts/184425-ufc-189-ruthless-robbie-lawler-vs-rory-red-king-macdonald-ii'

I then try to pull the fighter names with the following code:

response.css('.fighterNames ::text').getall()

I get this as a reslut:

['\n', '\n', '\n', 'Billy Ayash', '\n', '\n', '\n', 'Dennis Reed', '\n', '\n', '\n', '\n', '"The Punisher"', '\n', '\n', '\n']

As you can see on the webpage, and if you inspect the HTML, the names returned should be 'Robbie Lawler' and 'Rory MacDonald.' What's even more odd is that Scrapy returns different content every time I test this webpage in the shell environment. It won't always return content from the fight webpage for Billy Ayash and Dennis Reed.

Is something wrong with Scrapy? Is something wrong with Tapology.com? Any help would be appreciated! I've used Scrapy on ufcstats.com with no issues whatsoever, both before and after this test.

Here's the full code:

(base) davidwismer@Davids-MacBook-Pro ~ % scrapy shell 'https://www.tapology.com/fightcenter/bouts/184425-ufc-189-ruthless-robbie-lawler-vs-rory-red-king-macdonald-ii'
2021-03-03 17:18:03 [scrapy.utils.log] INFO: Scrapy 2.4.1 started (bot: scrapybot)
2021-03-03 17:18:03 [scrapy.utils.log] INFO: Versions: lxml 4.6.1.0, libxml2 2.9.10, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 20.3.0, Python 3.8.5 (default, Sep  4 2020, 02:22:02) - [Clang 10.0.0 ], pyOpenSSL 19.1.0 (OpenSSL 1.1.1h  22 Sep 2020), cryptography 3.1.1, Platform macOS-10.15.7-x86_64-i386-64bit
2021-03-03 17:18:03 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor
2021-03-03 17:18:03 [scrapy.crawler] INFO: Overridden settings:
{'DUPEFILTER_CLASS': 'scrapy.dupefilters.BaseDupeFilter',
 'LOGSTATS_INTERVAL': 0}
2021-03-03 17:18:03 [scrapy.extensions.telnet] INFO: Telnet Password: b44d20b5d1bbeb73
2021-03-03 17:18:03 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.memusage.MemoryUsage']
2021-03-03 17:18:04 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2021-03-03 17:18:04 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2021-03-03 17:18:04 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2021-03-03 17:18:04 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2021-03-03 17:18:04 [scrapy.core.engine] INFO: Spider opened
2021-03-03 17:18:05 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.tapology.com/fightcenter/bouts/184425-ufc-189-ruthless-robbie-lawler-vs-rory-red-king-macdonald-ii> (referer: None)
2021-03-03 17:18:05 [asyncio] DEBUG: Using selector: KqueueSelector
[s] Available Scrapy objects:
[s]   scrapy     scrapy module (contains scrapy.Request, scrapy.Selector, etc)
[s]   crawler    <scrapy.crawler.Crawler object at 0x7fc4d97c5730>
[s]   item       {}
[s]   request    <GET https://www.tapology.com/fightcenter/bouts/184425-ufc-189-ruthless-robbie-lawler-vs-rory-red-king-macdonald-ii>
[s]   response   <200 https://www.tapology.com/fightcenter/bouts/184425-ufc-189-ruthless-robbie-lawler-vs-rory-red-king-macdonald-ii>
[s]   settings   <scrapy.settings.Settings object at 0x7fc4d97c5e50>
[s]   spider     <DefaultSpider 'default' at 0x7fc4d9e26100>
[s] Useful shortcuts:
[s]   fetch(url[, redirect=True]) Fetch URL and update local objects (by default, redirects are followed)
[s]   fetch(req)                  Fetch a scrapy.Request and update local objects 
[s]   shelp()           Shell help (print this help)
[s]   view(response)    View response in a browser
2021-03-03 17:18:05 [asyncio] DEBUG: Using selector: KqueueSelector
In [1]: response.css('.fighterNames ::text').getall()
Out[1]: 
['\n',
 '\n',
 '\n',
 'Billy Ayash',
 '\n',
 '\n',
 '\n',
 'Dennis Reed',
 '\n',
 '\n',
 '\n',
 '\n',
 '"The Punisher"',
 '\n',
 '\n',
 '\n']

Upvotes: 0

Views: 132

Answers (1)

gribvirus74
gribvirus74

Reputation: 765

I tested it with requests + BeautifulSoup4 and got the same results.

However, when I set the User-Agent header to something else (value taken from my web browser in the example below), I got valid results. Here's the code:

from requests import get
from bs4 import BeautifulSoup


def get_names(with_user_agent: bool):
    if with_user_agent:
        headers = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64; rv:86.0) Gecko/20100101 Firefox/86.0'}
    else:
        headers = {}

    r = get('https://www.tapology.com/fightcenter/bouts/184425-ufc-189-ruthless-robbie-lawler-vs-rory-red-king-macdonald-ii', headers=headers)
    r.raise_for_status()

    soup = BeautifulSoup(r.text, features='html.parser')
    names = soup.select('.fighterNames span')

    print('Names:')
    for n in names:
        print(n.text.strip())
    print('---')


if __name__ == '__main__':
    print('Without user agent:')
    for i in range(3):
        get_names(False)

    print('\nWith user agent:')
    for i in range(3):
        get_names(True)

Output:

Without user agent:
Names:
Jared Downing
Danny Tims
"Demon Eyes"

---
Names:
Allen Hope
Mike Kent
"Bunzy"

---
Names:
Paweł Sikora
Patryk Domke
"Ponczek"
"Patrykos"
---

With user agent:
Names:
Robbie Lawler
Rory MacDonald
"Ruthless"
"Red King"
---
Names:
Robbie Lawler
Rory MacDonald
"Ruthless"
"Red King"
---
Names:
Robbie Lawler
Rory MacDonald
"Ruthless"
"Red King"
---

Upvotes: 1

Related Questions