XIII
XIII

Reputation: 2044

Scraping Coinmarketcap data returns only the first 10 results, why the rest 90 don't return?

I've no issue scraping it and even scraping any number of the pages I define but it shows only the first 10 results of each page

def scrape_pages(page_num):
for page in range(1, page_num+1):
    headers = {'User-Agent': 
    'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 Safari/537.36'}

    url = "https://coinmarketcap.com/?page={}".format(page)
    page_tree = requests.get(url, headers=headers)
    pageSoup = BeautifulSoup(page_tree.content, 'html.parser')

    print("Page {} Parsed successfully!".format(url))

Upvotes: 1

Views: 1000

Answers (1)

baduker
baduker

Reputation: 20022

It's because the first ten results are in the HTML you get back. However, the rest is added dynamically by JavaScript, so BeautifulSoup won't see this because it's simply not there.

However, there's an API you can use to get the table data (for all the pages too, if you feel like it).

Here's how:

from urllib.parse import urlencode

import requests
from tabulate import tabulate

query_string = [
    ('start', '1'),
    ('limit', '100'),
    ('sortBy', 'market_cap'),
    ('sortType', 'desc'),
    ('convert', 'USD'),
    ('cryptoType', 'all'),
    ('tagType', 'all'),
]

base = "https://api.coinmarketcap.com/data-api/v3/cryptocurrency/listing?"
response = requests.get(f"{base}{urlencode(query_string)}").json()

results = [
    [
        currency["name"],
        round(currency["quotes"][0]["price"], 4),
    ]
    for currency in response["data"]["cryptoCurrencyList"]
]

print(tabulate(results, headers=["Currency", "Price"], tablefmt="pretty"))

Output:

+-----------------------+------------+
|       Currency        |   Price    |
+-----------------------+------------+
|        Bitcoin        | 46204.9211 |
|       Ethereum        | 1488.0481  |
|        Tether         |   0.9995   |
|     Binance Coin      |  212.8729  |
|        Cardano        |    0.93    |
|       Polkadot        |  31.1603   |
|          XRP          |   0.4464   |
|       Litecoin        |  167.2676  |
|       Chainlink       |  25.1752   |
|     Bitcoin Cash      |  488.9875  |
|        Stellar        |   0.3724   |
|       USD Coin        |   0.9998   |
|                       |            |
|     and many more     |   values   |
+-----------------------+------------+

EDIT: To loop over the pages you might want to try this:

from urllib.parse import urlencode

import requests

query_string = [
    ('start', '1'),
    ('limit', '100'),
    ('sortBy', 'market_cap'),
    ('sortType', 'desc'),
    ('convert', 'USD'),
    ('cryptoType', 'all'),
    ('tagType', 'all'),
]

base = "https://api.coinmarketcap.com/data-api/v3/cryptocurrency/listing?"

with requests.Session() as session:
    response = session.get(f"{base}{urlencode(query_string)}").json()
    last_page = (int(response["data"]["totalCount"]) // 100) + 1
    all_pages = [1 if i == 1 else (i * 100) + 1 for i in range(1, last_page)]

    for page in all_pages[:2]:  # Get the first two pages; remove the slice to get all pages.
        query_string = [
            ('start', str(page)),
            ('limit', '100'),
            ('sortBy', 'market_cap'),
            ('sortType', 'desc'),
            ('convert', 'USD'),
            ('cryptoType', 'all'),
            ('tagType', 'all'),
        ]
        response = session.get(f"{base}{urlencode(query_string)}").json()
        results = [
            [
                currency["name"],
                round(currency["quotes"][0]["price"], 4),
            ]
            for currency in response["data"]["cryptoCurrencyList"]
        ]
        print(results)

NOTE: I'm throttling this example by adding [:2] to the for loop, but if you want to go for all the pages just remove this [:2] so the loop looks like this:

for page in all_pages:
    #  the rest of the body ...

Upvotes: 4

Related Questions