I was recently tasked with auditing all of the python modules my team uses across our entire production code repository. I came up with the following: find ~/svn/ -name *.py | xargs grep -hn "^import\|^from" | awk -F ":" '{print $2}' | awk '{if (/from/) print $2; else {$1 = ""; print $0} }' | sed 's/,\| /\n/g' | sort | uniq > /tmp/pythonpkgs.txt In sequence, it Finds all the python files Of those, greps for lines beginning with import or from splits on the : character and uses what follows, so the file name and the number of the output aren't included if the line is of the form from foo import bar , print foo , else if it's of form import foo print foo replace commas and spaces with line breaks, for lines like import a, b, c sort output and take uniques I hacked this together on my own, but I imagine it could be better. How would you do it? Consolidate the awk s?

Reputation: 2538

Make my awk shell script more efficient (parsing python)

I was recently tasked with auditing all of the python modules my team uses across our entire production code repository.

I came up with the following:

find ~/svn/ -name *.py 
| xargs grep -hn "^import\|^from"
| awk -F ":" '{print $2}' 
| awk '{if (/from/) print $2; else {$1 = ""; print $0} }' 
| sed 's/,\| /\n/g' 
| sort 
| uniq > /tmp/pythonpkgs.txt

In sequence, it

Finds all the python files
Of those, greps for lines beginning with import or from
splits on the : character and uses what follows, so the file name and the number of the output aren't included
if the line is of the form from foo import bar, print foo, else if it's of form import foo print foo
replace commas and spaces with line breaks, for lines like import a, b, c
sort output and take uniques

I hacked this together on my own, but I imagine it could be better. How would you do it? Consolidate the awks?

Upvotes: 0

Answers (3)

swstephe

Reputation: 1910

Here is my consolidated awk:

/^[ \t]*import .* as/  {
    sub("^[ \t]+","");          # remove leading whitespace
    sub("[ \t]*#.*","");        # remove comments
    print $2;
    next;
}
/^[ \t]*from (.*) import/ {
    sub("^[ \t]+","");          # remove leading whitespace
    sub("[ \t]*#.*","");        # remove comments
    print $2;
    next;
}
/^[ \t]*import (.*)/  {
    sub("^[ \t]+","");          # remove leading whitespace
    sub("[ \t]*#.*","");        # remove comments
    split(substr($0,7),a,",");  # split on commas
    for (i=1;i<=length(a);i++) {
        gsub("[ \t]+","",a[i]); # trim whitespace
        print a[i];
    }
    next;
}

Call with:

find . -name '*.py' -exec awk -f scan.awk {} \; | sort | uniq

As noted, it doesn't take care of a few potential cases, such as lines joined with ';' or split with '\', or grouped with '()', but it would cover the majority of Python code.

Upvotes: 0

georg

Reputation: 214979

grepping source code for specific constructs is pretty fragile, there are many situations where it may fail. Consider, for example:

import foo ; print 123

import foo, \
   bar

 str = '''
 import foo
 '''

etc.

If you're interested in a more robust approach, this is how you can reliably parse out imported names using the python's own compiler:

import ast

def list_imports(source):
    for node in ast.walk(ast.parse(source)):
        if isinstance(node, ast.Import):
            for name in node.names:
                yield name.name
        if isinstance(node, ast.ImportFrom):
            yield node.module

Usage:

 for name in sorted(set(list_imports(some_source))):
     print name

Upvotes: 2

user559633

Reputation:

Pretty clever setup to start with, but there are a couple places where it can be cleaned up:

1: find ~/svn/ -name *.py 
2: | xargs grep -hn "^import\|^from"
3: | awk -F ":" '{print $2}' 
4: | awk '{if (/from/) print $2; else {$1 = ""; print $0} }' 
5: | sed 's/,\| /\n/g' 
6: | sort 
7: | uniq > /tmp/pythonpkgs.txt

Line 3: You don't need the first awk split/print -- just don't include -n on the grep so that you don't include the line number in the output.

time find ./<<my_large_project>> -name *.py 
| xargs grep -hn "^import\|^from" 
| awk '{if (/from/) print $2; else {$1 = ""; print $0} }' 
| sed 's/,\| /\n/g' 
| sort 
| uniq
~~snip~~
real    0m0.492s
user    0m0.208s
sys     0m0.116s

Lines 6-7, and 4-5: If you have a lot of duplicate lines, you can speed up your execution by sorting and uniq-ing before running your awk and sed

time find ./<<my_large_project>> -name *.py 
| xargs grep -hn "^import\|^from" 
| sort 
| uniq 
| awk '{if (/from/) print $2; else {$1 = ""; print $0} }' 
| sed 's/,\| /\n/g'
~~snip~~
real    0m0.464s
user    0m0.224s
sys     0m0.140s

Note that this will miss multiple line imports as described in PEP 0328. Support for these imports would make your regex search relatively non-trivial as you would have to look for optional parenthesis and make note of prior whitespace.

Upvotes: 2

Make my awk shell script more efficient (parsing python)

Answers (3)

Related Questions