Fomite
Fomite

Reputation: 2273

How to download and write a file from Github using Requests

Lets say there's a file that lives at the github repo:

https://github.com/someguy/brilliant/blob/master/somefile.txt

I'm trying to use requests to request this file, write the content of it to disk in the current working directory where it can be used later. Right now, I'm using the following code:

import requests
from os import getcwd

url = "https://github.com/someguy/brilliant/blob/master/somefile.txt"
directory = getcwd()
filename = directory + 'somefile.txt'
r = requests.get(url)

f = open(filename,'w')
f.write(r.content)

Undoubtedly ugly, and more importantly, not working. Instead of the expected text, I get:

<!DOCTYPE html>
<!--

Hello future GitHubber! I bet you're here to remove those nasty inline styles,
DRY up these templates and make 'em nice and re-usable, right?

Please, don't. https://github.com/styleguide/templates/2.0

-->
<html>
  <head>
    <meta http-equiv="Content-type" content="text/html; charset=utf-8">
    <title>Page not found &middot; GitHub</title>
    <style type="text/css" media="screen">
      body {
        background: #f1f1f1;
        font-family: "HelveticaNeue", Helvetica, Arial, sans-serif;
        text-rendering: optimizeLegibility;
        margin: 0; }

      .container { margin: 50px auto 40px auto; width: 600px; text-align: center; }

      a { color: #4183c4; text-decoration: none; }
      a:visited { color: #4183c4 }
      a:hover { text-decoration: none; }

      h1 { letter-spacing: -1px; line-height: 60px; font-size: 60px; font-weight: 100; margin: 0px; text-shadow: 0 1px 0 #fff; }
      p { color: rgba(0, 0, 0, 0.5); margin: 20px 0 40px; }

      ul { list-style: none; margin: 25px 0; padding: 0; }
      li { display: table-cell; font-weight: bold; width: 1%; }
      #error-suggestions { font-size: 14px; }
      #next-steps { margin: 25px 0 50px 0;}
      #next-steps li { display: block; width: 100%; text-align: center; padding: 5px 0; font-weight: normal; color: rgba(0, 0, 0, 0.5); }
      #next-steps a { font-weight: bold; }
      .divider { border-top: 1px solid #d5d5d5; border-bottom: 1px solid #fafafa;}

      #parallax_wrapper {
        position: relative;
        z-index: 0;
      }
      #parallax_field {
        overflow: hidden;
        position: absolute;
        left: 0;
        top: 0;
        height: 370px;
        width: 100%;
      }

etc etc.

Content from Github, but not the content of the file. What am I doing wrong?

Upvotes: 44

Views: 83504

Answers (4)

Martijn Pieters
Martijn Pieters

Reputation: 1124758

The content of the file in question is included in the returned data. You are getting the full GitHub view of that file, not just the contents.

If you want to download just the file, you need to use the Raw link at the top of the page, which will be (for your example):

https://raw.githubusercontent.com/someguy/brilliant/master/somefile.txt

Note the change in domain name, and the blob/ part of the path is gone.

To demonstrate this with the requests GitHub repository itself:

>>> import requests
>>> r = requests.get('https://github.com/kennethreitz/requests/blob/master/README.rst')
>>> 'Requests:' in r.text
True
>>> r.headers['Content-Type']
'text/html; charset=utf-8'
>>> r = requests.get('https://raw.githubusercontent.com/kennethreitz/requests/master/README.rst')
>>> 'Requests:' in r.text
True
>>> r.headers['Content-Type']
'text/plain; charset=utf-8'
>>> print r.text
Requests: HTTP for Humans
=========================


.. image:: https://travis-ci.org/kennethreitz/requests.png?branch=master
[... etc. ...]

Upvotes: 50

Burhan Khalid
Burhan Khalid

Reputation: 174708

You need to request the raw version of the file, from https://raw.githubusercontent.com.

See the difference:

https://raw.githubusercontent.com/django/django/master/setup.py vs. https://github.com/django/django/blob/master/setup.py

Also, you should probably add a / between your directory and the filename:

>>> getcwd()+'foo.txt'
'/Users/burhanfoo.txt'
>>> import os
>>> os.path.join(getcwd(),'foo.txt')
'/Users/burhan/foo.txt'

Upvotes: 15

Rotem jackoby
Rotem jackoby

Reputation: 22218

Adding a working example ready for copy+paste:

import requests
from requests.structures import CaseInsensitiveDict

url = "https://raw.githubusercontent.com/organization/repo/branch/folder/file"

# If repo is private - we need to add a token in header:
headers = CaseInsensitiveDict()
headers["Authorization"] = "token TOKEN"

resp = requests.get(url, headers=headers)
print(resp.status_code)

(*) If repo is not private - remove the headers part.


Bonus:
Check out this Curl < --> Python-requests online converter.

Upvotes: 3

Tom
Tom

Reputation: 8800

Just as an update, https://raw.github.com was migrated to https://raw.githubusercontent.com. So the general format is:

url = "https://raw.githubusercontent.com/user/repo/branch/[subfolders]/file"

E.g. https://raw.githubusercontent.com/earnestt1234/seedir/master/setup.py. Still use requests.get(url) as in Martijn's answer.

Upvotes: 10

Related Questions