youpi
youpi

Reputation: 73

Python subprocess - saving output in a new file

I use the following command to reformat a file and it creates a new file:

sed -e '1s/^/[/' -e 's/$/,/' -e '$s/,$/]/' toto> toto.json

It works fine on the command line.

I try to use it through a python script, but it does not create a new file.

I try:

subprocess.call(["sed", "-e","1s/^/[/","-e", "s/$/,/","-e","$s/,$/]/ ",sys.argv[1], " > ",sys.argv[2]]) 

The issue is: it gives me the output in the stdout and raise an error:

sed: can't read >: No such file or directory
Traceback (most recent call last):
File "test.py", line 14, in <module>
subprocess.call(["sed", "-e","1s/^/[/","-e", "s/$/,/","-e","$s/,$/]/", 
sys.argv[1], ">",sys.argv[2])
File "C:\Users\Anaconda3\lib\subprocess.py", line 291, in 
check_call raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['sed', '-e', '1s/^/[/', '-e', 
's/$/,/', '-e', '$s/,$/]/', 'toto.txt, '>', 'toto.json']' returned non-zero 
exit status 2.

I read the other issues with the subprocess and try other commands with the option shell=True but, it did not work either. I use python 3.6

For information, the command add a bracket in the first and last line and add a comma at the end of each line except the last one. So, it does:

from
a
b
c

to:

[a,
b,
c]

Upvotes: 2

Views: 4152

Answers (3)

zwer
zwer

Reputation: 25829

I had a hunch that Python can do this much faster than sed but I didn't have the time to check until now, so... Based on your comment to Arount's answer:

my real file is actually quite big, the command line is way faster than a python script

That's not necessarily true and in fact, in your case, I suspected that Python could do it many, many times faster than sed because with Python you're not limited to iterating over your file through a line buffer nor you need a full blown regex engine just to get the line separators.

I'm not sure how big your file is, but I generated my test example as:

with open("example.txt", "w") as f:
    for i in range(10**8):  # I would consider 100M lines as "big" enough for testing
        print(i, file=f)

Which essentially creates a 100M lines long (888.9MB) file with a different number on each line.

Now, timing your sed command alone, running at the highest priority (chrt -f 99) results in:

[zwer@testbed ~]$ sudo chrt -f 99 /usr/bin/time --verbose \
> sed -e '1s/^/[/' -e 's/$/,/' -e '$s/,$/]/' example.txt > output.txt
    Command being timed: "sed -e 1s/^/[/ -e s/$/,/ -e $s/,$/]/ example.txt"
    User time (seconds): 56.89
    System time (seconds): 1.74
    Percent of CPU this job got: 98%
    Elapsed (wall clock) time (h:mm:ss or m:ss): 0:59.28
    Average shared text size (kbytes): 0
    Average unshared data size (kbytes): 0
    Average stack size (kbytes): 0
    Average total size (kbytes): 0
    Maximum resident set size (kbytes): 1044
    Average resident set size (kbytes): 0
    Major (requiring I/O) page faults: 1
    Minor (reclaiming a frame) page faults: 313
    Voluntary context switches: 7
    Involuntary context switches: 29
    Swaps: 0
    File system inputs: 1140560
    File system outputs: 1931424
    Socket messages sent: 0
    Socket messages received: 0
    Signals delivered: 0
    Page size (bytes): 4096
    Exit status: 0

The result would be even worse if you were actually to call it from Python as it would also come with the subprocess and STDOUT redirecting overheads.

However, if we leave it to Python to do all the work instead of sed:

import sys

CHUNK_SIZE = 1024 * 64  # 64k, tune this to the FS block size / platform for best performance

with open(sys.argv[2], "w") as f_out:  # open the file from second argument for writing
    f_out.write("[")  # start the JSON array
    with open(sys.argv[1], "r") as f_in:  # open the file from the first argument for reading
        chunk = None
        last_chunk = ''  # keep a track of the last chunk so we can remove the trailing comma
        while True:
            chunk = f_in.read(CHUNK_SIZE)  # read the next chunk
            if chunk:
                f_out.write(last_chunk)  # write out the last chunk
                last_chunk = chunk.replace("\n", ",\n")  # process the new chunk
            else:  # EOF
                break
    last_chunk = last_chunk.rstrip()  # clear out the trailing whitespace
    if last_chunk[-1] == ",":  # clear out the trailing comma
        last_chunk = last_chunk[:-1]
    f_out.write(last_chunk)  # write the last chunk
    f_out.write("]")  # end the JSON array

without ever touching the shell results in:

[zwer@testbed ~]$ sudo chrt -f 99 /usr/bin/time --verbose \
> python process_file.py example.txt output.txt
    Command being timed: "python process_file.py example.txt output.txt"
    User time (seconds): 1.75
    System time (seconds): 0.72
    Percent of CPU this job got: 93%
    Elapsed (wall clock) time (h:mm:ss or m:ss): 0:02.65
    Average shared text size (kbytes): 0
    Average unshared data size (kbytes): 0
    Average stack size (kbytes): 0
    Average total size (kbytes): 0
    Maximum resident set size (kbytes): 4716
    Average resident set size (kbytes): 0
    Major (requiring I/O) page faults: 3
    Minor (reclaiming a frame) page faults: 14835
    Voluntary context switches: 16
    Involuntary context switches: 0
    Swaps: 0
    File system inputs: 3120
    File system outputs: 1931424
    Socket messages sent: 0
    Socket messages received: 0
    Signals delivered: 0
    Page size (bytes): 4096
    Exit status: 0

And given the utilization, the bottleneck is actually I/O, left to its own devices (or working from a very fast storage instead of a virtualized HDD as on my testbed) Python could do it even faster.

So, it took sed 32.5 times longer to do the same task that Python did. Even if you were to optimize your sed a bit, Python will still work faster because sed is limited to a line buffer so a lot of time will be wasted on the input I/O (compare the numbers in the above benchmark) and there's no (easy) way around that.

Conclusion: Python is way faster than sed for this particular task.

Upvotes: 1

Arount
Arount

Reputation: 10431

Don't do that. Don't use any OS calls if you can avoid it.

If you are using Python, just do pythonic Python script.

Something like:

input_filename = 'toto'
output_filename = 'toto.json'

with open(input_filename, 'r') as inputf:
    lines = ['{},\n'.format(line.rstrip()) for line in inputf]
    lines = ['['] + lines + [']']

    with open(output_filename, 'w') as outputf:
        outputf.writelines(lines)

It basically does the same as your command line.

Trusts this piece of code is kind of dirty and only for example purpose. I advise you to do your own and avoid oneliners like I did.

Upvotes: 0

Serge Ballesta
Serge Ballesta

Reputation: 149175

On Linux and other Unix systems, the redirection characters are not part of the command but are interpreted by the shell, so it does not make sense to pass it as parameters to a subprocess.

Hopefully, subprocess.call allows the stdout parameter to be a file object. So you should do:

subprocess.call(["sed", "-e","1s/^/[/","-e", "s/$/,/","-e","$s/,$/]/ ",sys.argv[1]],
    stdout=open(sys.argv[2], "w"))

Upvotes: 3

Related Questions