Reputation: 43
I am trying to scrape historical snapshot data from coinmarketcap using python:
https://coinmarketcap.com/historical/20201227/
I've tried to use the beautifulsoup. It works fine until row 20 but after that the returned rows look a lot different.
import pandas as pd
import requests
from bs4 import BeautifulSoup
date = '20211219/'
URL = 'https://coinmarketcap.com/historical/' + date
webpage = requests.get(URL)
soup = BeautifulSoup(webpage.text, 'lxml') # 'html.parser'
tr = soup.find_all('tr', attrs={'class': 'cmc-table-row'})
The first twenty elements of tr contains all the columns from the webpage.
Starting with the 21st element it looks much different and doesn't include what's actually on the table on the webpage:
So i am not successful in scraping the data after 20th row. How can I access this part of the table?
Upvotes: 1
Views: 804
Reputation: 3231
The problem you have is potentially an anti-scraping technique from CoinMarketCap. To work with the CoinMarketCap weekly historical files under https://coinmarketcap.com/historical/ you should first render the page (e.g. headless browser) and save the DOM document to a file. Then, your script will work.
The problem is solved in this way because the page you are seeing with the browser using, for example, the Google Chrome inspector is not the same as the original HTML. The original HTML includes scripts that run and modify the DOM and just then it is fully parseable via BeautifulSoup.
If you prefer, instead of using a headless browser directly you can use it through tools such as Playwright. There are Q&A here in SO about this. For example, Get entire Playwright page in html and Text
Upvotes: 1
Reputation: 10460
In case you haven't found a solution by now: that page is pulling the info from an api, and the following code will get you the data you're after:
import pandas as pd
import requests
my_date = '2020-12-27'
r = requests.get(f'https://web-api.coinmarketcap.com/v1/cryptocurrency/listings/historical?convert=USD,USD,BTC&date={my_date}&limit=5000&start=1')
df = pd.DataFrame(r.json()['data'])
print(df)
This return a rather large dataframe [4048 rows x 33 columns]:
id | name | symbol | slug | num_market_pairs | date_added | tags | max_supply | circulating_supply | total_supply | platform | cmc_rank | self_reported_circulating_supply | self_reported_market_cap | tvl_ratio | last_updated | quote.BTC.price | quote.BTC.volume_24h | quote.BTC.percent_change_1h | quote.BTC.percent_change_24h | quote.BTC.percent_change_7d | quote.BTC.market_cap | quote.BTC.fully_diluted_market_cap | quote.BTC.tvl | quote.BTC.last_updated | quote.USD.price | quote.USD.volume_24h | quote.USD.percent_change_1h | quote.USD.percent_change_24h | quote.USD.percent_change_7d | quote.USD.market_cap | quote.USD.tvl | quote.USD.last_updated | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | Bitcoin | BTC | bitcoin | 9712 | 2013-04-28T00:00:00.000Z | ['mineable', 'pow', 'sha-256', 'store-of-value', 'state-channel', 'coinbase-ventures-portfolio', 'three-arrows-capital-portfolio', 'polychain-capital-portfolio', 'binance-labs-portfolio', 'blockchain-capital-portfolio', 'boostvc-portfolio', 'cms-holdings-portfolio', 'dcg-portfolio', 'dragonfly-capital-portfolio', 'electric-capital-portfolio', 'fabric-ventures-portfolio', 'framework-ventures-portfolio', 'galaxy-digital-portfolio', 'huobi-capital-portfolio', 'alameda-research-portfolio', 'a16z-portfolio', '1confirmation-portfolio', 'winklevoss-capital-portfolio', 'usv-portfolio', 'placeholder-ventures-portfolio', 'pantera-capital-portfolio', 'multicoin-capital-portfolio', 'paradigm-portfolio'] | 2.1e+07 | 1.85828e+07 | 1.85828e+07 | 1 | 2020-12-27T23:00:00.000Z | 1 | 2.53042e+06 | 0 | 0 | 0 | 1.85828e+07 | 2020-12-27T23:59:41.000Z | 26272.3 | 6.64799e+10 | -0.910864 | -0.623152 | 11.9051 | 4.88213e+11 | 2020-12-27T23:00:00.000Z | |||||||
1 | 1027 | Ethereum | ETH | ethereum | 5916 | 2015-08-07T00:00:00.000Z | ['mineable', 'pow', 'smart-contracts', 'ethereum-ecosystem', 'coinbase-ventures-portfolio', 'three-arrows-capital-portfolio', 'polychain-capital-portfolio', 'binance-labs-portfolio', 'blockchain-capital-portfolio', 'boostvc-portfolio', 'cms-holdings-portfolio', 'dcg-portfolio', 'dragonfly-capital-portfolio', 'electric-capital-portfolio', 'fabric-ventures-portfolio', 'framework-ventures-portfolio', 'hashkey-capital-portfolio', 'kenetic-capital-portfolio', 'huobi-capital-portfolio', 'alameda-research-portfolio', 'a16z-portfolio', '1confirmation-portfolio', 'winklevoss-capital-portfolio', 'usv-portfolio', 'placeholder-ventures-portfolio', 'pantera-capital-portfolio', 'multicoin-capital-portfolio', 'paradigm-portfolio', 'injective-ecosystem', 'bnb-chain'] | nan | 1.1401e+08 | 1.1401e+08 | 2 | 2020-12-27T23:00:00.000Z | 0.0259834 | 993197 | -0.514148 | 7.36142 | 6.94848 | 2.96236e+06 | 2020-12-27T23:59:41.000Z | 682.642 | 2.60936e+10 | -0.514148 | 7.36142 | 6.94848 | 7.78281e+10 | 2020-12-27T23:00:00.000Z | |||||||
2 | 825 | Tether | USDT | tether | 9666 | 2015-02-25T00:00:00.000Z | ['payments', 'stablecoin', 'asset-backed-stablecoin', 'avalanche-ecosystem', 'solana-ecosystem', 'arbitrum-ecosytem', 'moonriver-ecosystem', 'injective-ecosystem', 'bnb-chain', 'usd-stablecoin'] | nan | 2.07532e+10 | 2.12833e+10 | 3 | 2020-12-27T23:00:00.000Z | 3.80193e-05 | 3.62606e+06 | -0.00446154 | 0.0374141 | -0.0789107 | 789021 | 2020-12-27T23:59:41.000Z | 0.998854 | 9.52649e+10 | -0.00446154 | 0.0374141 | -0.0789107 | 2.07294e+10 | 2020-12-27T23:00:00.000Z | |||||||
3 | 52 | XRP | XRP | xrp | 683 | 2013-08-04T00:00:00.000Z | ['medium-of-exchange', 'enterprise-solutions', 'binance-chain', 'arrington-xrp-capital-portfolio', 'galaxy-digital-portfolio', 'a16z-portfolio', 'pantera-capital-portfolio'] | 1e+11 | 4.5404e+10 | 9.99908e+10 | 4 | 2020-12-27T23:00:00.000Z | 1.07733e-05 | 352094 | -1.1233 | -3.96119 | -49.0989 | 489151 | 2020-12-27T23:59:41.000Z | 0.283039 | 9.25033e+09 | -1.1233 | -3.96119 | -49.0989 | 1.28511e+10 | 2020-12-27T23:00:00.000Z | |||||||
4 | 2 | Litecoin | LTC | litecoin | 747 | 2013-04-28T00:00:00.000Z | ['mineable', 'pow', 'scrypt', 'medium-of-exchange', 'binance-chain', 'bnb-chain'] | 8.4e+07 | 6.61837e+07 | 6.61837e+07 | 5 | 2020-12-27T23:00:00.000Z | 0.00485367 | 536813 | -0.325724 | -1.50027 | 11.2073 | 321234 | 2020-12-27T23:59:41.000Z | 127.517 | 1.41033e+10 | -0.325724 | -1.50027 | 11.2073 | 8.43955e+09 | 2020-12-27T23:00:00.000Z |
[...]
Upvotes: -1