Issues calling awk from within Python using subprocess.call

Question

Having some issues calling awk from within Python. Normally, I'd do the following to call the command in awk from the command line.

Open up command line, in admin mode or not.
Change my directory to awk.exe, namely cd R\GnuWin32\bin
Call awk -F "," "{ print > ("split-" $10 ".csv") }" large.csv

My command is used to split up the large.csv file based on the 10th column into a number of files named split-[COL VAL HERE].csv. I have no issues running this command. I tried to run the same code in Python using subprocess.call() but I'm having some issues. I run the following code:

def split_ByInputColumn():
     subprocess.call(['C:/R/GnuWin32/bin/awk.exe', '-F', '","', 
              '"{ print > (\"split-\" $10 \".csv\") }"', 'large.csv'],
                  cwd = 'C:/R/GnuWin32/bin/')

and clearly, something is running when I execute the function (CPU usage, etc) but when I go to check C:/R/GnuWin32/bin/ there are no split files in the directory. Any idea on what's going wrong?

Jean-Fran&#231;ois Fabre · Accepted Answer

As I stated in my previous answer that was downvoted, you overprotect the arguments, making awk argument parsing fail.

Since there was no comment, I supposed there was a typo but it worked... So I suppose that's because I should have strongly suggested a full-fledged python solution, which is the best thing to do here (as stated in my previous answer)

Writing the equivalent in python is not trivial as we have to emulate the way awk opens files and appends to them afterwards. But it is more integrated, pythonic and handles quoting properly if quoting occurs in the input file.

I took the time to code & test it:

def split_ByInputColumn():
    # get rid of the old data from previous runs
    for f in glob.glob("split-*.csv"):
        os.remove(f)

    open_files = dict()

    with open('large.csv') as f:
        cr = csv.reader(f,delimiter=',')
        for r in cr:
            tenth_row = r[9]
            filename = "split-{}.csv".format(tenth_row)
            if not filename in open_files:
                handle = open(filename,"wb")
                open_files[filename] = (handle,csv.writer(handle,delimiter=','))
            open_files[filename][1].writerow(r)

    for f,_ in open_files.values():
        f.close()

split_ByInputColumn()

in detail:

read the big file as csv (advantage: quoting is handled properly)
compute the destination filename
if filename not in dictionary, open it and create csv.writer object
write the row in the corresponding dictionary
in the end, close file handles

Aside: My old solution, using awk properly:

import subprocess

def split_ByInputColumn():
     subprocess.call(['awk.exe', '-F', ',',
              '{ print > ("split-" $10 ".csv") }', 'large.csv'],cwd = 'some_directory')

Issues calling awk from within Python using subprocess.call

Answers (2)

Related Questions