Reputation: 265
Having some issues calling awk from within Python. Normally, I'd do the following to call the command in awk from the command line.
awk.exe
, namely cd R\GnuWin32\bin
awk -F "," "{ print > (\"split-\" $10 \".csv\") }" large.csv
My command is used to split up the large.csv
file based on the 10th column into a number of files named split-[COL VAL HERE].csv
. I have no issues running this command. I tried to run the same code in Python using subprocess.call()
but I'm having some issues. I run the following code:
def split_ByInputColumn():
subprocess.call(['C:/R/GnuWin32/bin/awk.exe', '-F', '\",\"',
'\"{ print > (\\"split-\\" $10 \\".csv\\") }\"', 'large.csv'],
cwd = 'C:/R/GnuWin32/bin/')
and clearly, something is running when I execute the function (CPU usage, etc) but when I go to check C:/R/GnuWin32/bin/
there are no split files in the directory. Any idea on what's going wrong?
Upvotes: 1
Views: 300
Reputation: 140266
As I stated in my previous answer that was downvoted, you overprotect the arguments, making awk
argument parsing fail.
Since there was no comment, I supposed there was a typo but it worked... So I suppose that's because I should have strongly suggested a full-fledged python solution, which is the best thing to do here (as stated in my previous answer)
Writing the equivalent in python is not trivial as we have to emulate the way awk opens files and appends to them afterwards. But it is more integrated, pythonic and handles quoting properly if quoting occurs in the input file.
I took the time to code & test it:
def split_ByInputColumn():
# get rid of the old data from previous runs
for f in glob.glob("split-*.csv"):
os.remove(f)
open_files = dict()
with open('large.csv') as f:
cr = csv.reader(f,delimiter=',')
for r in cr:
tenth_row = r[9]
filename = "split-{}.csv".format(tenth_row)
if not filename in open_files:
handle = open(filename,"wb")
open_files[filename] = (handle,csv.writer(handle,delimiter=','))
open_files[filename][1].writerow(r)
for f,_ in open_files.values():
f.close()
split_ByInputColumn()
in detail:
csv.writer
objectAside: My old solution, using awk
properly:
import subprocess
def split_ByInputColumn():
subprocess.call(['awk.exe', '-F', ',',
'{ print > ("split-" $10 ".csv") }', 'large.csv'],cwd = 'some_directory')
Upvotes: 1
Reputation: 265
Someone else posted an answer (and then subsequently deleted it), but the issue was that I was over-protecting my arguments. The following code works:
def split_ByInputColumn():
subprocess.call(['C:/R/GnuWin32/bin/awk.exe', '-F', ',',
'{ print > (\"split-\" $10 \".csv\") }', 'large.csv'],
cwd = 'C:/R/GnuWin32/bin/')
Upvotes: 1