Able Archer
Able Archer

Reputation: 569

How to parse Historical BTC Data from Coinmarketcap?

I am trying to learn how to web scrape BTC historical data from Coinmarketcap.com using Python, requests, and BeautifulSoup.

I would like to parse the following:

1)Date

2)Close

3)Volume

4)Market Cap

Here is my code so far:

import requests
from bs4 import BeautifulSoup
from fake_useragent import UserAgent

ua = UserAgent()
header = {'user-agent': ua.chrome}
response = requests.get('https://coinmarketcap.com/currencies/bitcoin/historical-data/', headers=header)

# html.parser
soup = BeautifulSoup(response.content,'lxml')  

tags = soup.find_all('td')
print(tags)

I am able to scrape the data I need but I am not sure how to parse it correctly. I would prefer to have the dates go back as far as possible ('All Time'). Any advice would be greatly appreciated. Thanks in advance!

Upvotes: 3

Views: 4355

Answers (3)

niko
niko

Reputation: 5281

EDIT

It seems CoinMarketCap changed their DOM, so here is an update:

import lxml.html
import requests
from typing import Dict, List


def coinmarketcap_get_btc(start_date: str, end_date: str) -> List[Dict]:
    # Build the url
    url = f'https://coinmarketcap.com/currencies/bitcoin/historical-data/?start={start_date}&end={end_date}'
    # Make the request and parse the tree
    response = requests.get(url, timeout=5)
    tree = lxml.html.fromstring(response.text)
    # Extract table and raw data
    table = tree.find_class('cmc-table')[0]
    xpath_0, xpath_1 = 'div[3]/div/table/thead/tr', 'div[3]/div/table/tbody/tr/td[%d]/div'
    cols = [_.text_content() for _ in table.xpath(xpath_0 + '/th')]
    dates = (_.text_content() for _ in table.xpath(xpath_1 % 1))
    m = map(lambda d: (float(_.text_content().replace(',', '')) for _ in table.xpath(xpath_1 % d)),
            range(2, 8))
    return [{k: v for k, v in zip(cols, _)} for _ in zip(dates, *m)]

Getting a df instead is as simple as using pd.DataFrame.from_dict.


Original

You can requests and lxml for this:

Here is a function coinmarketcap_get_btc that will take the start and end dates as parameters and gather the relevant data

import lxml.html
import pandas
import requests


def float_helper(string):
    try:
        return float(string)
    except ValueError:
        return None


def coinmarketcap_get_btc(start_date: str, end_date: str) -> pandas.DataFrame:
    # Build the url
    url = f'https://coinmarketcap.com/currencies/bitcoin/historical-data/?start={start_date}&end={end_date}'
    # Make the request and parse the tree
    response = requests.get(url, timeout=5)
    tree = lxml.html.fromstring(response.text)
    # Extract table and raw data
    table = tree.find_class('table-responsive')[0]
    raw_data = [_.text_content() for _ in table.find_class('text-right')]
    # Process the data
    col_names = ['Date'] + raw_data[:6]
    row_list = []
    for x in raw_data[6:]:
        _, date, _open, _high, _low, _close, _vol, _m_cap, _ = x.replace(',', '').split('\n')
        row_list.append([date, float_helper(_open), float_helper(_high), float_helper(_low),
                         float_helper(_close), float_helper(_vol), float_helper(_m_cap)])
    return pandas.DataFrame(data=row_list, columns=col_names)

You can always leave out the columns that are not of interest and add further functionalities (e.g. accepting datetime.datetime objects as dates).

Attention, the f-string used to build the URL requires at least version 3.x of Python (I believe 3.6), so if you are using an old version just use either one of the 'string{var}.format(var=var)' or 'string%s' % var notations.

Example

df = coinmarketcap_get_btc(start_date='20130428', end_date='20191020')
df
#              Date    Open*     High      Low  Close**        Volume    Market Cap
# 0     Oct 19 2019  7973.80  8082.63  7944.78  7988.56  1.379783e+10  1.438082e+11
# 1     Oct 18 2019  8100.93  8138.41  7902.16  7973.21  1.565159e+10  1.435176e+11
# 2     Oct 17 2019  8047.81  8134.83  8000.94  8103.91  1.431305e+10  1.458540e+11
# 3     Oct 16 2019  8204.67  8216.81  7985.09  8047.53  1.607165e+10  1.448240e+11
# 4     Oct 15 2019  8373.46  8410.71  8182.71  8205.37  1.522041e+10  1.476501e+11
# ...           ...      ...      ...      ...      ...           ...           ...
# 2361  May 02 2013   116.38   125.60    92.28   105.21           NaN  1.168517e+09
# 2362  May 01 2013   139.00   139.89   107.72   116.99           NaN  1.298955e+09
# 2363  Apr 30 2013   144.00   146.93   134.05   139.00           NaN  1.542813e+09
# 2364  Apr 29 2013   134.44   147.49   134.00   144.54           NaN  1.603769e+09
# 2365  Apr 28 2013   135.30   135.98   132.10   134.21           NaN  1.488567e+09
# 
# [2366 rows x 7 columns]

Upvotes: 3

QHarr
QHarr

Reputation: 84455

You could have a function which takes the number of months to return (you could alter this but months is a good enough example) then use pandas read_html to grab the table and subset for columns. This is currently set-up to work from today's date.

import requests
import pandas as pd
from datetime import datetime
from dateutil.relativedelta import relativedelta

def get_date_range(number_of_months:int):
    now = datetime.now()
    dt_end = now.strftime("%Y%m%d")
    dt_start = (now - relativedelta(months=number_of_months)).strftime("%Y%m%d")
    return f'start={dt_start}&end={dt_end}'

number_of_months = 3

table = pd.read_html(f'https://coinmarketcap.com/currencies/bitcoin/historical-data/?{get_date_range(number_of_months)}')[0]
table = table[['Date', 'Close**', 'Volume','Market Cap']]
print(table)

Upvotes: 2

SIM
SIM

Reputation: 22440

This is one of the ways how you can get your aforementioned fields out of that table using BeautifulSoup library. I used .select() instead of .find_all() to locate the desired items.

Working solution:

import pandas
import requests
from bs4 import BeautifulSoup

link = 'https://coinmarketcap.com/currencies/bitcoin/historical-data/?start={}&end={}'

def get_coinmarketcap_info(url,s_date,e_date):
    response = requests.get(url.format(s_date,e_date))
    soup = BeautifulSoup(response.text,"lxml")

    for items in soup.select("table.table tr.text-right"):
        date = items.select_one("td.text-left").get_text(strip=True)
        close = items.select_one("td[data-format-market-cap]").find_previous_sibling().get_text(strip=True)
        volume = items.select_one("td[data-format-market-cap]").get_text(strip=True)
        marketcap = items.select_one("td[data-format-market-cap]").find_next_sibling().get_text(strip=True)
        yield date,close,volume,marketcap

if __name__ == '__main__':
    dataframe = (elem for elem in get_coinmarketcap_info(link,s_date='20130428',e_date='20191020'))
    df = pandas.DataFrame(dataframe)
    print(df)

Upvotes: 2

Related Questions