genap
genap

Reputation: 265

Issues calling awk from within Python using subprocess.call

Having some issues calling awk from within Python. Normally, I'd do the following to call the command in awk from the command line.

  1. Open up command line, in admin mode or not.
  2. Change my directory to awk.exe, namely cd R\GnuWin32\bin
  3. Call awk -F "," "{ print > (\"split-\" $10 \".csv\") }" large.csv

My command is used to split up the large.csv file based on the 10th column into a number of files named split-[COL VAL HERE].csv. I have no issues running this command. I tried to run the same code in Python using subprocess.call() but I'm having some issues. I run the following code:

def split_ByInputColumn():
     subprocess.call(['C:/R/GnuWin32/bin/awk.exe', '-F', '\",\"', 
              '\"{ print > (\\"split-\\" $10 \\".csv\\") }\"', 'large.csv'],
                  cwd = 'C:/R/GnuWin32/bin/')

and clearly, something is running when I execute the function (CPU usage, etc) but when I go to check C:/R/GnuWin32/bin/ there are no split files in the directory. Any idea on what's going wrong?

Upvotes: 1

Views: 300

Answers (2)

Jean-François Fabre
Jean-François Fabre

Reputation: 140266

As I stated in my previous answer that was downvoted, you overprotect the arguments, making awk argument parsing fail.

Since there was no comment, I supposed there was a typo but it worked... So I suppose that's because I should have strongly suggested a full-fledged python solution, which is the best thing to do here (as stated in my previous answer)

Writing the equivalent in python is not trivial as we have to emulate the way awk opens files and appends to them afterwards. But it is more integrated, pythonic and handles quoting properly if quoting occurs in the input file.

I took the time to code & test it:

def split_ByInputColumn():
    # get rid of the old data from previous runs
    for f in glob.glob("split-*.csv"):
        os.remove(f)

    open_files = dict()

    with open('large.csv') as f:
        cr = csv.reader(f,delimiter=',')
        for r in cr:
            tenth_row = r[9]
            filename = "split-{}.csv".format(tenth_row)
            if not filename in open_files:
                handle = open(filename,"wb")
                open_files[filename] = (handle,csv.writer(handle,delimiter=','))
            open_files[filename][1].writerow(r)

    for f,_ in open_files.values():
        f.close()

split_ByInputColumn()

in detail:

  • read the big file as csv (advantage: quoting is handled properly)
  • compute the destination filename
  • if filename not in dictionary, open it and create csv.writer object
  • write the row in the corresponding dictionary
  • in the end, close file handles

Aside: My old solution, using awk properly:

import subprocess

def split_ByInputColumn():
     subprocess.call(['awk.exe', '-F', ',',
              '{ print > ("split-" $10 ".csv") }', 'large.csv'],cwd = 'some_directory')

Upvotes: 1

genap
genap

Reputation: 265

Someone else posted an answer (and then subsequently deleted it), but the issue was that I was over-protecting my arguments. The following code works:

def split_ByInputColumn():
 subprocess.call(['C:/R/GnuWin32/bin/awk.exe', '-F', ',', 
          '{ print > (\"split-\" $10 \".csv\") }', 'large.csv'],
              cwd = 'C:/R/GnuWin32/bin/')

Upvotes: 1

Related Questions