chitown2015
chitown2015

Reputation: 43

how to scrape historic snapshot table from coinmarketcap using beautifulsoup

I am trying to scrape historical snapshot data from coinmarketcap using python:

https://coinmarketcap.com/historical/20201227/

I've tried to use the beautifulsoup. It works fine until row 20 but after that the returned rows look a lot different.

import pandas as pd
import requests
from bs4 import BeautifulSoup

date = '20211219/'
URL = 'https://coinmarketcap.com/historical/' + date
webpage = requests.get(URL)
soup = BeautifulSoup(webpage.text, 'lxml')    # 'html.parser'
tr = soup.find_all('tr', attrs={'class': 'cmc-table-row'})

The first twenty elements of tr contains all the columns from the webpage.

Starting with the 21st element it looks much different and doesn't include what's actually on the table on the webpage:

So i am not successful in scraping the data after 20th row. How can I access this part of the table?

Upvotes: 1

Views: 804

Answers (2)

sw.
sw.

Reputation: 3231

The problem you have is potentially an anti-scraping technique from CoinMarketCap. To work with the CoinMarketCap weekly historical files under https://coinmarketcap.com/historical/ you should first render the page (e.g. headless browser) and save the DOM document to a file. Then, your script will work.

The problem is solved in this way because the page you are seeing with the browser using, for example, the Google Chrome inspector is not the same as the original HTML. The original HTML includes scripts that run and modify the DOM and just then it is fully parseable via BeautifulSoup.

If you prefer, instead of using a headless browser directly you can use it through tools such as Playwright. There are Q&A here in SO about this. For example, Get entire Playwright page in html and Text

Upvotes: 1

Barry the Platipus
Barry the Platipus

Reputation: 10460

In case you haven't found a solution by now: that page is pulling the info from an api, and the following code will get you the data you're after:

import pandas as pd
import requests

my_date = '2020-12-27'

r = requests.get(f'https://web-api.coinmarketcap.com/v1/cryptocurrency/listings/historical?convert=USD,USD,BTC&date={my_date}&limit=5000&start=1')
df = pd.DataFrame(r.json()['data'])
print(df)

This return a rather large dataframe [4048 rows x 33 columns]:

id name symbol slug num_market_pairs date_added tags max_supply circulating_supply total_supply platform cmc_rank self_reported_circulating_supply self_reported_market_cap tvl_ratio last_updated quote.BTC.price quote.BTC.volume_24h quote.BTC.percent_change_1h quote.BTC.percent_change_24h quote.BTC.percent_change_7d quote.BTC.market_cap quote.BTC.fully_diluted_market_cap quote.BTC.tvl quote.BTC.last_updated quote.USD.price quote.USD.volume_24h quote.USD.percent_change_1h quote.USD.percent_change_24h quote.USD.percent_change_7d quote.USD.market_cap quote.USD.tvl quote.USD.last_updated
0 1 Bitcoin BTC bitcoin 9712 2013-04-28T00:00:00.000Z ['mineable', 'pow', 'sha-256', 'store-of-value', 'state-channel', 'coinbase-ventures-portfolio', 'three-arrows-capital-portfolio', 'polychain-capital-portfolio', 'binance-labs-portfolio', 'blockchain-capital-portfolio', 'boostvc-portfolio', 'cms-holdings-portfolio', 'dcg-portfolio', 'dragonfly-capital-portfolio', 'electric-capital-portfolio', 'fabric-ventures-portfolio', 'framework-ventures-portfolio', 'galaxy-digital-portfolio', 'huobi-capital-portfolio', 'alameda-research-portfolio', 'a16z-portfolio', '1confirmation-portfolio', 'winklevoss-capital-portfolio', 'usv-portfolio', 'placeholder-ventures-portfolio', 'pantera-capital-portfolio', 'multicoin-capital-portfolio', 'paradigm-portfolio'] 2.1e+07 1.85828e+07 1.85828e+07 1 2020-12-27T23:00:00.000Z 1 2.53042e+06 0 0 0 1.85828e+07 2020-12-27T23:59:41.000Z 26272.3 6.64799e+10 -0.910864 -0.623152 11.9051 4.88213e+11 2020-12-27T23:00:00.000Z
1 1027 Ethereum ETH ethereum 5916 2015-08-07T00:00:00.000Z ['mineable', 'pow', 'smart-contracts', 'ethereum-ecosystem', 'coinbase-ventures-portfolio', 'three-arrows-capital-portfolio', 'polychain-capital-portfolio', 'binance-labs-portfolio', 'blockchain-capital-portfolio', 'boostvc-portfolio', 'cms-holdings-portfolio', 'dcg-portfolio', 'dragonfly-capital-portfolio', 'electric-capital-portfolio', 'fabric-ventures-portfolio', 'framework-ventures-portfolio', 'hashkey-capital-portfolio', 'kenetic-capital-portfolio', 'huobi-capital-portfolio', 'alameda-research-portfolio', 'a16z-portfolio', '1confirmation-portfolio', 'winklevoss-capital-portfolio', 'usv-portfolio', 'placeholder-ventures-portfolio', 'pantera-capital-portfolio', 'multicoin-capital-portfolio', 'paradigm-portfolio', 'injective-ecosystem', 'bnb-chain'] nan 1.1401e+08 1.1401e+08 2 2020-12-27T23:00:00.000Z 0.0259834 993197 -0.514148 7.36142 6.94848 2.96236e+06 2020-12-27T23:59:41.000Z 682.642 2.60936e+10 -0.514148 7.36142 6.94848 7.78281e+10 2020-12-27T23:00:00.000Z
2 825 Tether USDT tether 9666 2015-02-25T00:00:00.000Z ['payments', 'stablecoin', 'asset-backed-stablecoin', 'avalanche-ecosystem', 'solana-ecosystem', 'arbitrum-ecosytem', 'moonriver-ecosystem', 'injective-ecosystem', 'bnb-chain', 'usd-stablecoin'] nan 2.07532e+10 2.12833e+10 3 2020-12-27T23:00:00.000Z 3.80193e-05 3.62606e+06 -0.00446154 0.0374141 -0.0789107 789021 2020-12-27T23:59:41.000Z 0.998854 9.52649e+10 -0.00446154 0.0374141 -0.0789107 2.07294e+10 2020-12-27T23:00:00.000Z
3 52 XRP XRP xrp 683 2013-08-04T00:00:00.000Z ['medium-of-exchange', 'enterprise-solutions', 'binance-chain', 'arrington-xrp-capital-portfolio', 'galaxy-digital-portfolio', 'a16z-portfolio', 'pantera-capital-portfolio'] 1e+11 4.5404e+10 9.99908e+10 4 2020-12-27T23:00:00.000Z 1.07733e-05 352094 -1.1233 -3.96119 -49.0989 489151 2020-12-27T23:59:41.000Z 0.283039 9.25033e+09 -1.1233 -3.96119 -49.0989 1.28511e+10 2020-12-27T23:00:00.000Z
4 2 Litecoin LTC litecoin 747 2013-04-28T00:00:00.000Z ['mineable', 'pow', 'scrypt', 'medium-of-exchange', 'binance-chain', 'bnb-chain'] 8.4e+07 6.61837e+07 6.61837e+07 5 2020-12-27T23:00:00.000Z 0.00485367 536813 -0.325724 -1.50027 11.2073 321234 2020-12-27T23:59:41.000Z 127.517 1.41033e+10 -0.325724 -1.50027 11.2073 8.43955e+09 2020-12-27T23:00:00.000Z

[...]

Upvotes: -1

Related Questions