JamesFaix
JamesFaix

Reputation: 8655

Github shows 4% of repository is in a language that is not used in repository

I have a personal repository on GitHub that is completely written in C#, with a few XML configuration files, and some PowerShell files from included NuGet packages. On the main repository page, GitHub shows a colored bar to display the breakdown of different languages used in the repository.
enter image description here

If you click this bar, it shows the language names and actual percents. enter image description here

This particular language breakdown seems a bit odd to me, since I am the only contributor, and I have never used Smalltalk.

If you click a language name, it will show you a list of the files using that language. enter image description here

In this last image, you can see on the left side that the repository really only contains C#, XML, PowerShell, text and markdown files.

So why does GitHub think I'm using Smalltalk? And why doesn't the color bar mention that I'm using XML?

Upvotes: 1

Views: 634

Answers (3)

pchaigno
pchaigno

Reputation: 13063

As Philip and VonC noted, GitHub uses Linguist to compute the language statistics.

So why does GitHub think I'm using Smalltalk?

Linguist relies first on the file extension to determine the language of a file. It then uses a set of refinement strategies for conflicting extensions (e.g., .cs is used by both Smalltalk and C#). These refinement strategies are not 100% accurate (in can even get pretty bad for small files). Thus, files with conflicting extensions may be classified incorrectly.

How can I fix it?

You can use Linguist overrides to tell Linguist that all .cs files in your repository are C# with a gitattributes file:

*.cs linguist-language=C#

And why doesn't the color bar mention that I'm using XML?

Linguist only counts programming and markup languages in the statistics. XML is classified as a data language.

Why doesn't Smalltalk appear in the search results?

The search results are cached to avoid computing them every time you visit the page. They probably weren't up-to-date when you took the screenshot.

Upvotes: 2

Philip
Philip

Reputation: 1532

GitHub uses a heuristic to identify the language(s) of your repository. The underlying library is linguist. Misclassification is common enough that it's the top Troubleshooting section: My repository is detected as the wrong language.

Upvotes: 1

VonC
VonC

Reputation: 1323753

Since GitHub is using linguist to detect languages, you can open a PR to report some files incorrectly tagged as "Smalltalk".

For instance, issue 2012 is still active (even though it is closed).

Upvotes: 0

Related Questions