Patrick McCarthy
Patrick McCarthy

Reputation: 2538

Make my awk shell script more efficient (parsing python)

I was recently tasked with auditing all of the python modules my team uses across our entire production code repository.

I came up with the following:

find ~/svn/ -name *.py 
| xargs grep -hn "^import\|^from"
| awk -F ":" '{print $2}' 
| awk '{if (/from/) print $2; else {$1 = ""; print $0} }' 
| sed 's/,\| /\n/g' 
| sort 
| uniq > /tmp/pythonpkgs.txt 

In sequence, it

I hacked this together on my own, but I imagine it could be better. How would you do it? Consolidate the awks?

Upvotes: 0

Views: 162

Answers (3)

swstephe
swstephe

Reputation: 1910

Here is my consolidated awk:

/^[ \t]*import .* as/  {
    sub("^[ \t]+","");          # remove leading whitespace
    sub("[ \t]*#.*","");        # remove comments
    print $2;
    next;
}
/^[ \t]*from (.*) import/ {
    sub("^[ \t]+","");          # remove leading whitespace
    sub("[ \t]*#.*","");        # remove comments
    print $2;
    next;
}
/^[ \t]*import (.*)/  {
    sub("^[ \t]+","");          # remove leading whitespace
    sub("[ \t]*#.*","");        # remove comments
    split(substr($0,7),a,",");  # split on commas
    for (i=1;i<=length(a);i++) {
        gsub("[ \t]+","",a[i]); # trim whitespace
        print a[i];
    }
    next;
}

Call with:

find . -name '*.py' -exec awk -f scan.awk {} \; | sort | uniq

As noted, it doesn't take care of a few potential cases, such as lines joined with ';' or split with '\', or grouped with '()', but it would cover the majority of Python code.

Upvotes: 0

georg
georg

Reputation: 214979

grepping source code for specific constructs is pretty fragile, there are many situations where it may fail. Consider, for example:

import foo ; print 123

or

import foo, \
   bar

or

 str = '''
 import foo
 '''

etc.

If you're interested in a more robust approach, this is how you can reliably parse out imported names using the python's own compiler:

import ast

def list_imports(source):
    for node in ast.walk(ast.parse(source)):
        if isinstance(node, ast.Import):
            for name in node.names:
                yield name.name
        if isinstance(node, ast.ImportFrom):
            yield node.module

Usage:

 for name in sorted(set(list_imports(some_source))):
     print name

Upvotes: 2

user559633
user559633

Reputation:

Pretty clever setup to start with, but there are a couple places where it can be cleaned up:

1: find ~/svn/ -name *.py 
2: | xargs grep -hn "^import\|^from"
3: | awk -F ":" '{print $2}' 
4: | awk '{if (/from/) print $2; else {$1 = ""; print $0} }' 
5: | sed 's/,\| /\n/g' 
6: | sort 
7: | uniq > /tmp/pythonpkgs.txt 

Line 3: You don't need the first awk split/print -- just don't include -n on the grep so that you don't include the line number in the output.

time find ./<<my_large_project>> -name *.py 
| xargs grep -hn "^import\|^from" 
| awk '{if (/from/) print $2; else {$1 = ""; print $0} }' 
| sed 's/,\| /\n/g' 
| sort 
| uniq
~~snip~~
real    0m0.492s
user    0m0.208s
sys     0m0.116s

Lines 6-7, and 4-5: If you have a lot of duplicate lines, you can speed up your execution by sorting and uniq-ing before running your awk and sed

time find ./<<my_large_project>> -name *.py 
| xargs grep -hn "^import\|^from" 
| sort 
| uniq 
| awk '{if (/from/) print $2; else {$1 = ""; print $0} }' 
| sed 's/,\| /\n/g'
~~snip~~
real    0m0.464s
user    0m0.224s
sys     0m0.140s

Note that this will miss multiple line imports as described in PEP 0328. Support for these imports would make your regex search relatively non-trivial as you would have to look for optional parenthesis and make note of prior whitespace.

Upvotes: 2

Related Questions