Reputation: 541
I am working on a statical analysis of some python libraries, the source code of the libraries is available on Github. Is there a way to find out how many times a certain library was used in other applications? The GitHub insight provides information for one month only which is not enough in my case to compare the popularity of the libraries.
Thanks in advance.
Upvotes: 2
Views: 704
Reputation: 2933
You can use githunt
But you have to write some code to extract information from html page using Beautiful Soup library of python
There exist a Kaggle dataset, but again it is not updated and limited to specific domain only.
Upvotes: 0
Reputation: 1200
Yes, there is. I have recently performed research on this topic. First and foremost, I would recommend https://sourcegraph.com/search. Sourcegraph hosts millions of repositories, and allows for very powerful search in these repositories. With this website, you can search for e.g. content:"import my_module" language:Python
to find a significant number of uses of my_module
in practice. This tool allows for many different filters, and is quite useful. (I have no affiliation with Sourcegraph.)
I would also like to add the result of my aforementioned research here. I produced a module entitled module_dependencies
, which can be used for exactly this task. It relies on sourcegraph, and can be used like so:
from module_dependencies import Module
from pprint import pprint
# Attempt to find 1000 imports of the "nltk" module
# in both Python files and Jupyter Notebooks each
module = Module("nltk", count="1000")
# Frequency of use of objects within the module
pprint(module.usage()[:15])
# How frequently the module was used (not particularly useful unless count="all")
print("NLTK was used", module.nested_usage()["nltk"]['occurrences'], "times")
# Show an interactive plot
module.plot()
This program outputs:
[2022-01-03 14:14:39,127] [module_dependencies.module.session] [INFO ] - Fetching Python source code containing imports of `nltk`...
[2022-01-03 14:14:42,824] [module_dependencies.module.session] [INFO ] - Fetched Python source code containing imports of `nltk` (status code 200)
[2022-01-03 14:14:42,825] [module_dependencies.module.session] [INFO ] - Parsing 6,830,859 bytes of Python source code as JSON...
[2022-01-03 14:14:42,865] [module_dependencies.module.session] [INFO ] - Parsed 6,830,859 bytes of Python source code as JSON...
[2022-01-03 14:14:42,866] [module_dependencies.module.session] [INFO ] - Extracting dependencies of 725 files of Python source code...
Parsing Files: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 725/725 [00:02<00:00, 258.48files/s]
[2022-01-03 14:14:45,702] [module_dependencies.module.session] [INFO ] - Extracted dependencies of 725 files of Python source code.
[2022-01-03 14:14:45,703] [module_dependencies.module.session] [INFO ] - Fetching Jupyter Notebook source code containing imports of `nltk`...
[2022-01-03 14:14:48,726] [module_dependencies.module.session] [INFO ] - Fetched Jupyter Notebook source code containing imports of `nltk` (status code 200)
[2022-01-03 14:14:48,726] [module_dependencies.module.session] [INFO ] - Parsing 25,713,281 bytes of Jupyter Notebook source code as JSON...
[2022-01-03 14:14:48,886] [module_dependencies.module.session] [INFO ] - Parsed 25,713,281 bytes of Jupyter Notebook source code as JSON...
[2022-01-03 14:14:48,888] [module_dependencies.module.session] [INFO ] - Extracting dependencies of 495 files of Jupyter Notebook source code...
Parsing Files: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 495/495 [00:02<00:00, 167.09files/s]
[2022-01-03 14:14:51,851] [module_dependencies.module.session] [INFO ] - Extracted dependencies of 495 files of Jupyter Notebook source code.
[('nltk.tokenize.word_tokenize', 327),
('nltk.download', 298),
('nltk.corpus.stopwords.words', 257),
('nltk.tokenize.sent_tokenize', 126),
('nltk.stem.porter.PorterStemmer', 115),
('nltk.stem.wordnet.WordNetLemmatizer', 99),
('nltk.tag.pos_tag', 75),
('nltk.stem.snowball.SnowballStemmer', 48),
('nltk.data.path.append', 42),
('nltk.probability.FreqDist', 42),
('nltk.tokenize.RegexpTokenizer', 42),
('nltk.tokenize.TweetTokenizer', 35),
('nltk.corpus.wordnet.synsets', 33),
('nltk.data.load', 32),
('nltk.translate.bleu_score.corpus_bleu', 29)]
NLTK was used 2487 times
And then opens an interactive plot like:
This plot can be interacted with to see the number of times each section was used, including the root itself.
To very concisely answer your question, you can use the following:
from module_dependencies import Module
mod_name = "mymodule"
module = Module(mod_name , count="all")
print(f"{mod_name} was used {module.nested_usage()[mod_name]['occurrences']} times")
This provides a clear, verifiable number of uses in real projects hosted on GitHub (or Gitlab). module_dependencies
also extracts the links to those repositories and files that use your module of interest, and tracks how many stars each of those repositories have, in case that is interesting for your analysis.
See https://tomaarsen.github.io/module_dependencies/ for the documentation of module_dependencies
. Once again: I am the author of this module.
Upvotes: 4