Reputation: 51
I am a beginner and I'm developing a code to visualize the spread of corona virus globally, I want to extract the .csv file from the GitHub Repo(csse_covid_19_data) where a new .csv file is uploaded every 2 days. Instead of downloading the file manually is it possible to import the latest csv file to notebook automatically?
I have tried scraping the data but it doesn't help
import requests
url = 'https://github.com/CSSEGISandData/COVID-19/blob/master/csse_covid_19_data/csse_covid_19_daily_reports/03-08-2020.csv'
response = requests.get(url)
print(response.text)
Upvotes: 1
Views: 5581
Reputation: 814
This solution is specific to your use case:
Install PyGithub package using the below pip
command:
!pip install PyGithub
Generate Github API token from this page by clicking on Generate new token
and pass that token as a string in the below code at the place of token
to establish a connection with Github:
from github.MainClass import Github
g = Github(token)
Now you are connected with Github using your credentials and you can access all of your repo contents as well as other public repos.
Load the repo in which your CSV
files are stored:
repo = g.get_repo("CSSEGISandData/COVID-19")
Get the list of object of the files stored in the directory where your CSV
files are stored:
file_list = repo.get_contents("csse_covid_19_data/csse_covid_19_daily_reports")
Since the directory where these CSV files are stored also contains one .gitignore
file and one README.md
file and file nomenclature are of the format "mm-dd-yyyy", so README.md is present at the last and the last-second file is your latest updated file. To access that run the below code:
github_dir_path = 'https://github.com/CSSEGISandData/COVID-19/raw/master/csse_covid_19_data/csse_covid_19_daily_reports/'
file_path = github_dir_path + str(file_list[-2]).split('/')[-1].split(".")[0]+ '.csv'
Load the data from the specified path using the read_csv()
method of pandas.
import pandas as pd
df = pd.read_csv(file_path, error_bad_lines=False)
Try this code if you want to specify the path manually:
Get the path of your CSV file from Github by right-clicking on raw
as shown below and assign its value to the file_path
:
file_path = 'https://github.com/CSSEGISandData/COVID-19/raw/master/csse_covid_19_data/csse_covid_19_daily_reports/03-08-2020.csv'
Load the data from the specified path using the read_csv()
method of pandas:
import pandas as pd
df = pd.read_csv(file_path, error_bad_lines=False)
Try this code if you want to specify the path automatically:
Set a time when you want to refresh your code and integrate the below-given solution with that.
Since you know the directory where the latest files are getting stored and how frequently new files are getting added to that directory, you can just change the date dynamically for the current date in the mm-dd-yyyy format:
from datetime import date
file_date = str(date.today().strftime('%m-%d-%Y'))
file_date
Output: 03-11-2020
Similarly, just change the value of file_date if you want to run your code for yesterday's date:
from datetime import date, timedelta
file_date = str((date.today() - timedelta(days = 1)).strftime('%m-%d-%Y'))
file_date
Output: 03-10-2020
Since currently in that directory, the last file uploaded is on 9th March 2020, so we are going to use that date:
from datetime import date, timedelta
file_date = str((date.today() - timedelta(days = 2)).strftime('%m-%d-%Y'))
file_date
Output: 03-09-2020
Generate file_path dynamically:
github_dir_path = 'https://github.com/CSSEGISandData/COVID-19/raw/master/csse_covid_19_data/csse_covid_19_daily_reports/'
file_path = github_dir_path + file_date + '.csv'
Load the data from the specified path using the read_csv()
method of pandas.
import pandas as pd
df = pd.read_csv(file_path, error_bad_lines=False)
Upvotes: 7
Reputation: 768
Use:
https://raw.githubusercontent.com/CSSEGISandData/COVID19/master/csse_covid_19_data/csse_covid_19_daily_reports/03-08-2020.csv
[The 'raw' text]
Example:
import requests
url = 'https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_daily_reports/03-08-2020.csv'
resp = requests.get(url)
print(resp.text)
Upvotes: 2