Reputation: 2538
I was recently tasked with auditing all of the python modules my team uses across our entire production code repository.
I came up with the following:
find ~/svn/ -name *.py
| xargs grep -hn "^import\|^from"
| awk -F ":" '{print $2}'
| awk '{if (/from/) print $2; else {$1 = ""; print $0} }'
| sed 's/,\| /\n/g'
| sort
| uniq > /tmp/pythonpkgs.txt
In sequence, it
import
or from
:
character and uses what follows, so the file name and the number of the output aren't includedfrom foo import bar
, print foo
, else if it's of form import foo
print foo
import a, b, c
I hacked this together on my own, but I imagine it could be better. How would you do it? Consolidate the awk
s?
Upvotes: 0
Views: 162
Reputation: 1910
Here is my consolidated awk:
/^[ \t]*import .* as/ {
sub("^[ \t]+",""); # remove leading whitespace
sub("[ \t]*#.*",""); # remove comments
print $2;
next;
}
/^[ \t]*from (.*) import/ {
sub("^[ \t]+",""); # remove leading whitespace
sub("[ \t]*#.*",""); # remove comments
print $2;
next;
}
/^[ \t]*import (.*)/ {
sub("^[ \t]+",""); # remove leading whitespace
sub("[ \t]*#.*",""); # remove comments
split(substr($0,7),a,","); # split on commas
for (i=1;i<=length(a);i++) {
gsub("[ \t]+","",a[i]); # trim whitespace
print a[i];
}
next;
}
Call with:
find . -name '*.py' -exec awk -f scan.awk {} \; | sort | uniq
As noted, it doesn't take care of a few potential cases, such as lines joined with ';' or split with '\', or grouped with '()', but it would cover the majority of Python code.
Upvotes: 0
Reputation: 214979
grepping source code for specific constructs is pretty fragile, there are many situations where it may fail. Consider, for example:
import foo ; print 123
or
import foo, \
bar
or
str = '''
import foo
'''
etc.
If you're interested in a more robust approach, this is how you can reliably parse out imported names using the python's own compiler:
import ast
def list_imports(source):
for node in ast.walk(ast.parse(source)):
if isinstance(node, ast.Import):
for name in node.names:
yield name.name
if isinstance(node, ast.ImportFrom):
yield node.module
Usage:
for name in sorted(set(list_imports(some_source))):
print name
Upvotes: 2
Reputation:
Pretty clever setup to start with, but there are a couple places where it can be cleaned up:
1: find ~/svn/ -name *.py
2: | xargs grep -hn "^import\|^from"
3: | awk -F ":" '{print $2}'
4: | awk '{if (/from/) print $2; else {$1 = ""; print $0} }'
5: | sed 's/,\| /\n/g'
6: | sort
7: | uniq > /tmp/pythonpkgs.txt
Line 3: You don't need the first awk split/print -- just don't include -n
on the grep so that you don't include the line number in the output.
time find ./<<my_large_project>> -name *.py
| xargs grep -hn "^import\|^from"
| awk '{if (/from/) print $2; else {$1 = ""; print $0} }'
| sed 's/,\| /\n/g'
| sort
| uniq
~~snip~~
real 0m0.492s
user 0m0.208s
sys 0m0.116s
Lines 6-7, and 4-5: If you have a lot of duplicate lines, you can speed up your execution by sort
ing and uniq
-ing before running your awk
and sed
time find ./<<my_large_project>> -name *.py
| xargs grep -hn "^import\|^from"
| sort
| uniq
| awk '{if (/from/) print $2; else {$1 = ""; print $0} }'
| sed 's/,\| /\n/g'
~~snip~~
real 0m0.464s
user 0m0.224s
sys 0m0.140s
Note that this will miss multiple line imports as described in PEP 0328. Support for these imports would make your regex search relatively non-trivial as you would have to look for optional parenthesis and make note of prior whitespace.
Upvotes: 2