Reputation: 569
I am trying to learn how to web scrape BTC historical data from Coinmarketcap.com using Python, requests, and BeautifulSoup.
I would like to parse the following:
1)Date
2)Close
3)Volume
4)Market Cap
Here is my code so far:
import requests
from bs4 import BeautifulSoup
from fake_useragent import UserAgent
ua = UserAgent()
header = {'user-agent': ua.chrome}
response = requests.get('https://coinmarketcap.com/currencies/bitcoin/historical-data/', headers=header)
# html.parser
soup = BeautifulSoup(response.content,'lxml')
tags = soup.find_all('td')
print(tags)
I am able to scrape the data I need but I am not sure how to parse it correctly. I would prefer to have the dates go back as far as possible ('All Time'). Any advice would be greatly appreciated. Thanks in advance!
Upvotes: 3
Views: 4355
Reputation: 5281
It seems CoinMarketCap changed their DOM, so here is an update:
import lxml.html
import requests
from typing import Dict, List
def coinmarketcap_get_btc(start_date: str, end_date: str) -> List[Dict]:
# Build the url
url = f'https://coinmarketcap.com/currencies/bitcoin/historical-data/?start={start_date}&end={end_date}'
# Make the request and parse the tree
response = requests.get(url, timeout=5)
tree = lxml.html.fromstring(response.text)
# Extract table and raw data
table = tree.find_class('cmc-table')[0]
xpath_0, xpath_1 = 'div[3]/div/table/thead/tr', 'div[3]/div/table/tbody/tr/td[%d]/div'
cols = [_.text_content() for _ in table.xpath(xpath_0 + '/th')]
dates = (_.text_content() for _ in table.xpath(xpath_1 % 1))
m = map(lambda d: (float(_.text_content().replace(',', '')) for _ in table.xpath(xpath_1 % d)),
range(2, 8))
return [{k: v for k, v in zip(cols, _)} for _ in zip(dates, *m)]
Getting a df instead is as simple as using pd.DataFrame.from_dict
.
You can requests
and lxml
for this:
Here is a function coinmarketcap_get_btc
that will take the start and end dates as parameters and gather the relevant data
import lxml.html
import pandas
import requests
def float_helper(string):
try:
return float(string)
except ValueError:
return None
def coinmarketcap_get_btc(start_date: str, end_date: str) -> pandas.DataFrame:
# Build the url
url = f'https://coinmarketcap.com/currencies/bitcoin/historical-data/?start={start_date}&end={end_date}'
# Make the request and parse the tree
response = requests.get(url, timeout=5)
tree = lxml.html.fromstring(response.text)
# Extract table and raw data
table = tree.find_class('table-responsive')[0]
raw_data = [_.text_content() for _ in table.find_class('text-right')]
# Process the data
col_names = ['Date'] + raw_data[:6]
row_list = []
for x in raw_data[6:]:
_, date, _open, _high, _low, _close, _vol, _m_cap, _ = x.replace(',', '').split('\n')
row_list.append([date, float_helper(_open), float_helper(_high), float_helper(_low),
float_helper(_close), float_helper(_vol), float_helper(_m_cap)])
return pandas.DataFrame(data=row_list, columns=col_names)
You can always leave out the columns that are not of interest and add further functionalities (e.g. accepting datetime.datetime
objects as dates).
Attention, the f-string
used to build the URL requires at least version 3.x of Python (I believe 3.6), so if you are using an old version just use either one of the 'string{var}.format(var=var)'
or 'string%s' % var
notations.
Example
df = coinmarketcap_get_btc(start_date='20130428', end_date='20191020')
df
# Date Open* High Low Close** Volume Market Cap
# 0 Oct 19 2019 7973.80 8082.63 7944.78 7988.56 1.379783e+10 1.438082e+11
# 1 Oct 18 2019 8100.93 8138.41 7902.16 7973.21 1.565159e+10 1.435176e+11
# 2 Oct 17 2019 8047.81 8134.83 8000.94 8103.91 1.431305e+10 1.458540e+11
# 3 Oct 16 2019 8204.67 8216.81 7985.09 8047.53 1.607165e+10 1.448240e+11
# 4 Oct 15 2019 8373.46 8410.71 8182.71 8205.37 1.522041e+10 1.476501e+11
# ... ... ... ... ... ... ... ...
# 2361 May 02 2013 116.38 125.60 92.28 105.21 NaN 1.168517e+09
# 2362 May 01 2013 139.00 139.89 107.72 116.99 NaN 1.298955e+09
# 2363 Apr 30 2013 144.00 146.93 134.05 139.00 NaN 1.542813e+09
# 2364 Apr 29 2013 134.44 147.49 134.00 144.54 NaN 1.603769e+09
# 2365 Apr 28 2013 135.30 135.98 132.10 134.21 NaN 1.488567e+09
#
# [2366 rows x 7 columns]
Upvotes: 3
Reputation: 84455
You could have a function which takes the number of months to return (you could alter this but months is a good enough example) then use pandas read_html to grab the table and subset for columns. This is currently set-up to work from today's date.
import requests
import pandas as pd
from datetime import datetime
from dateutil.relativedelta import relativedelta
def get_date_range(number_of_months:int):
now = datetime.now()
dt_end = now.strftime("%Y%m%d")
dt_start = (now - relativedelta(months=number_of_months)).strftime("%Y%m%d")
return f'start={dt_start}&end={dt_end}'
number_of_months = 3
table = pd.read_html(f'https://coinmarketcap.com/currencies/bitcoin/historical-data/?{get_date_range(number_of_months)}')[0]
table = table[['Date', 'Close**', 'Volume','Market Cap']]
print(table)
Upvotes: 2
Reputation: 22440
This is one of the ways how you can get your aforementioned fields out of that table using BeautifulSoup
library. I used .select()
instead of .find_all()
to locate the desired items.
Working solution:
import pandas
import requests
from bs4 import BeautifulSoup
link = 'https://coinmarketcap.com/currencies/bitcoin/historical-data/?start={}&end={}'
def get_coinmarketcap_info(url,s_date,e_date):
response = requests.get(url.format(s_date,e_date))
soup = BeautifulSoup(response.text,"lxml")
for items in soup.select("table.table tr.text-right"):
date = items.select_one("td.text-left").get_text(strip=True)
close = items.select_one("td[data-format-market-cap]").find_previous_sibling().get_text(strip=True)
volume = items.select_one("td[data-format-market-cap]").get_text(strip=True)
marketcap = items.select_one("td[data-format-market-cap]").find_next_sibling().get_text(strip=True)
yield date,close,volume,marketcap
if __name__ == '__main__':
dataframe = (elem for elem in get_coinmarketcap_info(link,s_date='20130428',e_date='20191020'))
df = pandas.DataFrame(dataframe)
print(df)
Upvotes: 2