Hashim Aziz
Hashim Aziz

Reputation: 6052

How much of modern FFmpeg is written by Fabrice Bellard?

FFmpeg is considered by many to be the work of Fabrice Bellard, and maybe even his magnum opus, but since he stopped contributing to the project (under the pseudonym Gérard Lantau) in 2004, I wondered how much of it can actually still be said to be his. By comparison, Linus Torvalds' Wikipedia page states:

As of 2006, approximately 2% of the Linux kernel was written by Torvalds himself.[28] Because thousands have contributed to it, his percentage is still one of the largest. However, he said in 2012 that his own personal contribution is now mostly merging code written by others, with little programming.

This despite the fact that Torvalds is still an active contributor to the Linux kernel, whereas Bellard hasn't been an active contributor to FFmpeg for almost two decades.

FFmpeg being an open-source project tracked with Git, it seems like the question should be technically and objectively answerable, but as someone who hates mailing lists and the generally archaic ways that big open-source projects like to do things, I wouldn't know where to start in doing so.

Just how much of the modern FFmpeg codebase is Fabrice Bellard actually responsible for, in comparison to the other FFmpeg devs?

Upvotes: 6

Views: 4502

Answers (2)

joanis
joanis

Reputation: 12193

TL;DR

Using git blame, you can conclude that Bellard is the person who last touched 8851 of the 1942819 lines in the code base, or 0.46% of them.

Details

With some 8000 files in the repo containing a total of nearly 2 million lines, running git blame on each file will take a long time, but it would let you see how many lines were still in the repo that Bellard/Lantau had contributed. As @Gyan says, though, this will only report lines that are exactly as he wrote them, any change in whitespace or style will be attributed to the person who made those trivial changes.

That being said, here's the loop:

git clone https://github.com/FFmpeg/FFmpeg
cd FFmpeg
for f in $(git ls-tree HEAD -r --name-only) ; do git blame $f ; done > blame

That loop will take a long time to run (it took about 5 hours on my computer), but eventually you'll be able to extract the author from each line with something like this:

cat blame | sed -e 's/ *20[012][0-9].*//' -e 's/^[^(]*(//' > blame-author

that's based on parsing lines from the blame output that look like this:

f1ab71b0463 (Timo Rothenpieler   2017-05-11 22:53:41 +0200 26) *.ptx.c
6bcd3e05998 (Federico Tomassetti 2015-08-13 20:13:48 +0200 11) compiler:
5d3049559af COPYING.GPL (Diego Biurrun 2007-07-12 20:27:07 +0000 187) the Program or works based on it.

my crude parser is not perfect, but it's enough to get statistics out of a crude tool like blame.

Let's count lines by authors, now:

cat blame-author | sort | uniq -c | sort -nr | less -N

shows the list of contributors to the code base, ranked from high to low by the number of lines last touched by that contributor according to the commit logs. Here's the top 50 list:

      1  209136 Paul B Mahol
      2  121248 Michael Niedermayer
      3  114289 Anton Khirnov
      4  109653 Andreas Rheinhardt
      5   75457 Diego Biurrun
      6   54739 Ronald S. Bultje
      7   48739 James Almer
      8   48571 Kostya Shishkov
      9   48096 Shivraj Patil
     10   44086 Martin Storsjö
     11   41019 Mark Thompson
     12   40305 Clément Bœsch
     13   37204 Stefano Sabatini
     14   34637 Vittorio Giovara
     15   26003 Luca Barbato
     16   21898 Justin Ruggles
     17   20845 Mans Rullgard
     18   20403 Lynne
     19   20172 Nicolas George
     20   19849 Vitor Sessak
     21   18044 Kaustubh Raste
     22   17297 Aurelien Jacobs
     23   16258 Måns Rullgård
     24   15242 Hao Chen
     25   14281 Peter Ross
     26   13971 Mike Melanson
     27   13943 Marton Balint
     28   11798 Guillaume Martres
     29   11284 Rostislav Pehlivanov
     30   11013 Shiyou Yin
     31   10836 foo86
     32    9895 Baptiste Coudurier
     33    9375 Derek Buitenhuis
     34    9367 Janne Grunau
     35    9214 Matthieu Bouron
     36    9160 Carl Eugen Hoyos
     37    9065 wm4
     38    8851 Fabrice Bellard
     39    8813 Zhou Xiaoyong
     40    8625 Timo Rothenpieler
     41    8410 Reimar Döffinger
     42    8361 Steven Liu
     43    7409 Timothy Gu
     44    7147 Thilo Borgmann
     45    6886 Lukasz Marek
     46    6667 Martin Vignali
     47    6445 Ben Avison
     48    6274 Limin Wang
     49    6213 rcombs
     50    6138 Daniel Kang

In this list, you can find Bellard in position 38, with 8851 lines, or 0.46% of the 1942819 lines wc -l blame-author says were analyzed.

Methodological limitations

I should have removed tests/ref and tests/reference.pnm from my processing, since those are a lot of binary files, but without them there are still 1.8M lines, so the answer remain around .4 to .5%.

Even better, I should have identified and filtered out all binary files. My blame-author file has some binary lines due to them. Again, I believe it's a minor error, but it's there nonetheless.

The four COPYING.*GPL* files are included, but were obviously not written by whoever committed them. That's only 1680 lines, but credit is given to committing something, not actually writing it. git blame is a crude tool. 492 of those lines are attributed to Bellard himself, so leaving them out would reduce the estimate of his surviving contribution to about 0.42% of the code base.

git blame can accept a --ignore-revs-file FILENAME option that lists commits that only apply style changes. E.g., I use that in my repos to exclude the commits where I am just reformatting Python code with black, or you could use it to ignore commits that only change CRLF to LF line endings in some files. I did not try to find style-only commits in FFmpeg but one could improve the significance of these statistics by doing so.

I didn't see the name Lantau anywhere, so I assume all of Bellard's contributions are under that name.

For future reference, should anyone actually care, my analysis is based on this commit, which is the HEAD of the master branch at the moment of writing:

commit 8ad988ac37d4d92dbb60796e26c3ad558a3eebeb (HEAD -> master, origin/master, origin/HEAD)
Author: Saliev, Rafik F <[email protected]>
Date:   Fri Dec 16 09:37:27 2022 +0000

Upvotes: 14

Hashim Aziz
Hashim Aziz

Reputation: 6052

Naive answer: calculating percentage of commits

This was simpler to do than I expected, turns out it could all be done in Git.

First I cloned FFmpeg from its Git server and waited a few minutes for Git to download the several hundred megabytes that make up the FFmpeg codebase:

git clone https://git.ffmpeg.org/ffmpeg.git

Since git shortlog -sne --all prints a full list of contributors by number of commits, I did:

$ git shortlog -sne --all | grep fabrice
613  Fabrice Bellard <[email protected]>

Interestingly, git shortlog -sne --all | grep lantau doesn't return anything, despite "Gerard Lantau" widely being cited as the pseudonym that he wrote FFmpeg under.

I then got a list of all 613 of Bellard's commits with:

git log --author="Fabrice Bellard"

This shows that the last of these commits was made in 2015.

Doing:

git log --author="Fabrice Bellard" --reverse

...shows that the first one was made in December 2000, via Subversion:

commit 9aeeeb63f7e1ab7b0b7bb839a5f258667a2d2d78   
Author: Fabrice Bellard <[email protected]>
Date:   Wed Dec 20 00:02:47 2000 +0000

Initial revision

Originally committed as revision 2 to svn://svn.ffmpeg.org/ffmpeg/trunk

As a naive answer to the question, I can calculate the number of commits Fabrice Bellard made as a percentage of all the commits ever made to FFmpeg. git log --all | wc -l shows a total of 1412173 (1.4 million) commits to FFmpeg from 2,368 developers (git shortlog -sne --all | wc -l).

613 as a percentage of 1,412,173 is 0.04340827929, which means Fabrice Bellard's commits currently represent 0.04% of the FFmpeg codebase, with the other ~2000 devs being responsible for the remaining 99.96%.

This is interesting, but commits as a metric don't seem like they would paint a realistic picture - to me a more interesting but more complex metric would be how many lines of code that Fabrice Bellard wrote still exist in the FFmpeg codebase. I don't know if this is possible with Git, and if it is, I definitely don't know how to do it accurately.

Upvotes: 3

Related Questions