Reputation: 6302
I use Github to store the text of one of my web sites, but the problem is Google indexing the text in Github as well. So the same text will show up both on my site and on Github. e.g. this search The top hit is my site. The second hit is the Github repository.
I don't mind if people see the sources but I don't want Google to index it (and maybe penalize for duplicate content.) Is there any way, besides taking the repository private, to tell Google to stop indexing it?
What happens in the case of Github Pages? Those are sites where the source is in a Github repository. Do they have the same problem of duplication?
Take this search the top most hit leads to the Marpa site but I don't see the source listed in the search result. How?
Upvotes: 82
Views: 33338
Reputation: 6302
This answer is not correct any more since GitHub changed the default branch from "master" to "main" and also changed the "robots.txt" file.
The https://github.com/robots.txt file of GitHub allows the indexing of the blobs in the 'master' branch, but restricts all other branches. So if you don't have a 'master' branch, Google is not supposed to index your pages.
How to remove the 'master' branch:
In your clone create a new branch - let's call it 'main' and push it to GitHub
git checkout -b main
git push -u origin main
On GitHub change the default branch (see in the Settings section of your repository) or here https://github.com/blog/421-pick-your-default-branch
Then remove the master branch from your clone and from GitHub:
git branch -d master
git push origin :master
Get other people who might have already forked your repository to do the same.
Alternatively, if you'd like to financially support GitHub, you can go private https://help.github.com/articles/making-a-public-repository-private
Upvotes: 92
Reputation: 96
I can think of two solutions that work at the present time:
tags
. So for example, instead of my-repo
, rename it to tags-my-repo
. OR:Why I think the older solutions in this thread no longer work: https://github.com/robots.txt has changed since then. At the time of the original question in 2013, robots.txt looked liked this:
User-agent: Googlebot
Allow: /*/*/tree/master
Allow: /*/*/blob/master
Disallow: /ekansa/Open-Context-Data
Disallow: /ekansa/opencontext-*
Disallow: /*/*/pulse
Disallow: /*/*/tree/*
...
whereas now there are no Allow
s but only Disallow
s:
User-agent: *
Disallow: /*/pulse
Disallow: /*/tree/
Disallow: /gist/
Disallow: /*/forks
...
Disallow: /*/branches
Disallow: /*/tags
...
If you simply create a new branch, make that default, and delete the old one, the URL https://github.com/user-name/repo-name
will simply show your new default branch and remain crawl-able under the current robots.txt
.
How my solutions above work: (they are based on how Google currently interprets robots.txt)
Solution 1 would make your repo's URL match Disallow: /*/tags
, thereby excluding it from crawling. So as a matter of fact you can prefix your repo name with any single word from disallow
paths of the form /*/word
without ending slash (so tree
doesn't work since Disallow: /*/tree/
ends with a slash).
Solution 2 simply ensures that the default branch, which is the only branch crawled, doesn't contain stuff that you don't want crawled. In other words, it "moves" all relevant stuff to a branch, so they're in https://github.com/user-name/repo-name/tree/branch-name
, which won't be crawled due to Disallow: /*/tree/
.
Disclaimers
robots.txt
looks like at any given point in time.Upvotes: 8
Reputation: 4319
simple answer: make your repo private.
https://help.github.com/articles/making-a-public-repository-private
Upvotes: 0
Reputation: 16012
If want to stick to the master branch there seems to be no way around using a private repo (and upselling your GitHub account) or using another service that offers private repos for free like Bitbucket.
Upvotes: 0
Reputation: 67
Short awnser. Yes you can with robots.txt.
If you want to prevent Googlebot from crawling content on your site, you have a number of options, including using robots.txt to block access to files and directories on your server.
You need a robots.txt file only if your site includes content that you don't want search engines to index. If you want search engines to index everything in your site, you don't need a robots.txt file (not even an empty one).
While Google won't crawl or index the content of pages blocked by robots.txt, we may still index the URLs if we find them on other pages on the web. As a result, the URL of the page and, potentially, other publicly available information such as anchor text in links to the site, or the title from the Open Directory Project (www.dmoz.org), can appear in Google search results.
Sources:
http://support.google.com/webmasters/bin/answer.py?hl=en&answer=93708 http://support.google.com/webmasters/bin/answer.py?hl=en&answer=156449
Upvotes: -6