Reputation: 6052
FFmpeg is considered by many to be the work of Fabrice Bellard, and maybe even his magnum opus, but since he stopped contributing to the project (under the pseudonym Gérard Lantau) in 2004, I wondered how much of it can actually still be said to be his. By comparison, Linus Torvalds' Wikipedia page states:
As of 2006, approximately 2% of the Linux kernel was written by Torvalds himself.[28] Because thousands have contributed to it, his percentage is still one of the largest. However, he said in 2012 that his own personal contribution is now mostly merging code written by others, with little programming.
This despite the fact that Torvalds is still an active contributor to the Linux kernel, whereas Bellard hasn't been an active contributor to FFmpeg for almost two decades.
FFmpeg being an open-source project tracked with Git, it seems like the question should be technically and objectively answerable, but as someone who hates mailing lists and the generally archaic ways that big open-source projects like to do things, I wouldn't know where to start in doing so.
Just how much of the modern FFmpeg codebase is Fabrice Bellard actually responsible for, in comparison to the other FFmpeg devs?
Upvotes: 6
Views: 4502
Reputation: 12193
Using git blame, you can conclude that Bellard is the person who last touched 8851 of the 1942819 lines in the code base, or 0.46% of them.
With some 8000 files in the repo containing a total of nearly 2 million lines, running git blame
on each file will take a long time, but it would let you see how many lines were still in the repo that Bellard/Lantau had contributed. As @Gyan says, though, this will only report lines that are exactly as he wrote them, any change in whitespace or style will be attributed to the person who made those trivial changes.
That being said, here's the loop:
git clone https://github.com/FFmpeg/FFmpeg
cd FFmpeg
for f in $(git ls-tree HEAD -r --name-only) ; do git blame $f ; done > blame
That loop will take a long time to run (it took about 5 hours on my computer), but eventually you'll be able to extract the author from each line with something like this:
cat blame | sed -e 's/ *20[012][0-9].*//' -e 's/^[^(]*(//' > blame-author
that's based on parsing lines from the blame output that look like this:
f1ab71b0463 (Timo Rothenpieler 2017-05-11 22:53:41 +0200 26) *.ptx.c
6bcd3e05998 (Federico Tomassetti 2015-08-13 20:13:48 +0200 11) compiler:
5d3049559af COPYING.GPL (Diego Biurrun 2007-07-12 20:27:07 +0000 187) the Program or works based on it.
my crude parser is not perfect, but it's enough to get statistics out of a crude tool like blame.
Let's count lines by authors, now:
cat blame-author | sort | uniq -c | sort -nr | less -N
shows the list of contributors to the code base, ranked from high to low by the number of lines last touched by that contributor according to the commit logs. Here's the top 50 list:
1 209136 Paul B Mahol
2 121248 Michael Niedermayer
3 114289 Anton Khirnov
4 109653 Andreas Rheinhardt
5 75457 Diego Biurrun
6 54739 Ronald S. Bultje
7 48739 James Almer
8 48571 Kostya Shishkov
9 48096 Shivraj Patil
10 44086 Martin Storsjö
11 41019 Mark Thompson
12 40305 Clément Bœsch
13 37204 Stefano Sabatini
14 34637 Vittorio Giovara
15 26003 Luca Barbato
16 21898 Justin Ruggles
17 20845 Mans Rullgard
18 20403 Lynne
19 20172 Nicolas George
20 19849 Vitor Sessak
21 18044 Kaustubh Raste
22 17297 Aurelien Jacobs
23 16258 Måns Rullgård
24 15242 Hao Chen
25 14281 Peter Ross
26 13971 Mike Melanson
27 13943 Marton Balint
28 11798 Guillaume Martres
29 11284 Rostislav Pehlivanov
30 11013 Shiyou Yin
31 10836 foo86
32 9895 Baptiste Coudurier
33 9375 Derek Buitenhuis
34 9367 Janne Grunau
35 9214 Matthieu Bouron
36 9160 Carl Eugen Hoyos
37 9065 wm4
38 8851 Fabrice Bellard
39 8813 Zhou Xiaoyong
40 8625 Timo Rothenpieler
41 8410 Reimar Döffinger
42 8361 Steven Liu
43 7409 Timothy Gu
44 7147 Thilo Borgmann
45 6886 Lukasz Marek
46 6667 Martin Vignali
47 6445 Ben Avison
48 6274 Limin Wang
49 6213 rcombs
50 6138 Daniel Kang
In this list, you can find Bellard in position 38, with 8851 lines, or 0.46% of the 1942819 lines wc -l blame-author
says were analyzed.
I should have removed tests/ref
and tests/reference.pnm
from my processing, since those are a lot of binary files, but without them there are still 1.8M lines, so the answer remain around .4 to .5%.
Even better, I should have identified and filtered out all binary files. My blame-author
file has some binary lines due to them. Again, I believe it's a minor error, but it's there nonetheless.
The four COPYING.*GPL*
files are included, but were obviously not written by whoever committed them. That's only 1680 lines, but credit is given to committing something, not actually writing it. git blame
is a crude tool.
492 of those lines are attributed to Bellard himself, so leaving them out would reduce the estimate of his surviving contribution to about 0.42% of the code base.
git blame
can accept a --ignore-revs-file FILENAME
option that lists commits that only apply style changes. E.g., I use that in my repos to exclude the commits where I am just reformatting Python code with black, or you could use it to ignore commits that only change CRLF to LF line endings in some files. I did not try to find style-only commits in FFmpeg but one could improve the significance of these statistics by doing so.
I didn't see the name Lantau anywhere, so I assume all of Bellard's contributions are under that name.
For future reference, should anyone actually care, my analysis is based on this commit, which is the HEAD of the master branch at the moment of writing:
commit 8ad988ac37d4d92dbb60796e26c3ad558a3eebeb (HEAD -> master, origin/master, origin/HEAD)
Author: Saliev, Rafik F <[email protected]>
Date: Fri Dec 16 09:37:27 2022 +0000
Upvotes: 14
Reputation: 6052
This was simpler to do than I expected, turns out it could all be done in Git.
First I cloned FFmpeg from its Git server and waited a few minutes for Git to download the several hundred megabytes that make up the FFmpeg codebase:
git clone https://git.ffmpeg.org/ffmpeg.git
Since git shortlog -sne --all
prints a full list of contributors by number of commits, I did:
$ git shortlog -sne --all | grep fabrice
613 Fabrice Bellard <[email protected]>
Interestingly, git shortlog -sne --all | grep lantau
doesn't return anything, despite "Gerard Lantau" widely being cited as the pseudonym that he wrote FFmpeg under.
I then got a list of all 613 of Bellard's commits with:
git log --author="Fabrice Bellard"
This shows that the last of these commits was made in 2015.
Doing:
git log --author="Fabrice Bellard" --reverse
...shows that the first one was made in December 2000, via Subversion:
commit 9aeeeb63f7e1ab7b0b7bb839a5f258667a2d2d78 Author: Fabrice Bellard <[email protected]> Date: Wed Dec 20 00:02:47 2000 +0000 Initial revision Originally committed as revision 2 to svn://svn.ffmpeg.org/ffmpeg/trunk
As a naive answer to the question, I can calculate the number of commits Fabrice Bellard made as a percentage of all the commits ever made to FFmpeg. git log --all | wc -l
shows a total of 1412173 (1.4 million) commits to FFmpeg from 2,368 developers (git shortlog -sne --all | wc -l
).
613 as a percentage of 1,412,173 is 0.04340827929, which means Fabrice Bellard's commits currently represent 0.04% of the FFmpeg codebase, with the other ~2000 devs being responsible for the remaining 99.96%.
This is interesting, but commits as a metric don't seem like they would paint a realistic picture - to me a more interesting but more complex metric would be how many lines of code that Fabrice Bellard wrote still exist in the FFmpeg codebase. I don't know if this is possible with Git, and if it is, I definitely don't know how to do it accurately.
Upvotes: 3