Reputation: 73
I use the following command to reformat a file and it creates a new file:
sed -e '1s/^/[/' -e 's/$/,/' -e '$s/,$/]/' toto> toto.json
It works fine on the command line.
I try to use it through a python script, but it does not create a new file.
I try:
subprocess.call(["sed", "-e","1s/^/[/","-e", "s/$/,/","-e","$s/,$/]/ ",sys.argv[1], " > ",sys.argv[2]])
The issue is: it gives me the output in the stdout and raise an error:
sed: can't read >: No such file or directory
Traceback (most recent call last):
File "test.py", line 14, in <module>
subprocess.call(["sed", "-e","1s/^/[/","-e", "s/$/,/","-e","$s/,$/]/",
sys.argv[1], ">",sys.argv[2])
File "C:\Users\Anaconda3\lib\subprocess.py", line 291, in
check_call raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['sed', '-e', '1s/^/[/', '-e',
's/$/,/', '-e', '$s/,$/]/', 'toto.txt, '>', 'toto.json']' returned non-zero
exit status 2.
I read the other issues with the subprocess and try other commands with the option shell=True but, it did not work either. I use python 3.6
For information, the command add a bracket in the first and last line and add a comma at the end of each line except the last one. So, it does:
from
a
b
c
to:
[a,
b,
c]
Upvotes: 2
Views: 4152
Reputation: 25829
I had a hunch that Python can do this much faster than sed
but I didn't have the time to check until now, so... Based on your comment to Arount's answer:
my real file is actually quite big, the command line is way faster than a python script
That's not necessarily true and in fact, in your case, I suspected that Python could do it many, many times faster than sed
because with Python you're not limited to iterating over your file through a line buffer nor you need a full blown regex engine just to get the line separators.
I'm not sure how big your file is, but I generated my test example as:
with open("example.txt", "w") as f:
for i in range(10**8): # I would consider 100M lines as "big" enough for testing
print(i, file=f)
Which essentially creates a 100M lines long (888.9MB) file with a different number on each line.
Now, timing your sed
command alone, running at the highest priority (chrt -f 99
) results in:
[zwer@testbed ~]$ sudo chrt -f 99 /usr/bin/time --verbose \
> sed -e '1s/^/[/' -e 's/$/,/' -e '$s/,$/]/' example.txt > output.txt
Command being timed: "sed -e 1s/^/[/ -e s/$/,/ -e $s/,$/]/ example.txt"
User time (seconds): 56.89
System time (seconds): 1.74
Percent of CPU this job got: 98%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:59.28
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 1044
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 1
Minor (reclaiming a frame) page faults: 313
Voluntary context switches: 7
Involuntary context switches: 29
Swaps: 0
File system inputs: 1140560
File system outputs: 1931424
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0
The result would be even worse if you were actually to call it from Python as it would also come with the subprocess
and STDOUT redirecting overheads.
However, if we leave it to Python to do all the work instead of sed
:
import sys
CHUNK_SIZE = 1024 * 64 # 64k, tune this to the FS block size / platform for best performance
with open(sys.argv[2], "w") as f_out: # open the file from second argument for writing
f_out.write("[") # start the JSON array
with open(sys.argv[1], "r") as f_in: # open the file from the first argument for reading
chunk = None
last_chunk = '' # keep a track of the last chunk so we can remove the trailing comma
while True:
chunk = f_in.read(CHUNK_SIZE) # read the next chunk
if chunk:
f_out.write(last_chunk) # write out the last chunk
last_chunk = chunk.replace("\n", ",\n") # process the new chunk
else: # EOF
break
last_chunk = last_chunk.rstrip() # clear out the trailing whitespace
if last_chunk[-1] == ",": # clear out the trailing comma
last_chunk = last_chunk[:-1]
f_out.write(last_chunk) # write the last chunk
f_out.write("]") # end the JSON array
without ever touching the shell results in:
[zwer@testbed ~]$ sudo chrt -f 99 /usr/bin/time --verbose \
> python process_file.py example.txt output.txt
Command being timed: "python process_file.py example.txt output.txt"
User time (seconds): 1.75
System time (seconds): 0.72
Percent of CPU this job got: 93%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:02.65
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 4716
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 3
Minor (reclaiming a frame) page faults: 14835
Voluntary context switches: 16
Involuntary context switches: 0
Swaps: 0
File system inputs: 3120
File system outputs: 1931424
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0
And given the utilization, the bottleneck is actually I/O, left to its own devices (or working from a very fast storage instead of a virtualized HDD as on my testbed) Python could do it even faster.
So, it took sed
32.5 times longer to do the same task that Python did. Even if you were to optimize your sed
a bit, Python will still work faster because sed
is limited to a line buffer so a lot of time will be wasted on the input I/O (compare the numbers in the above benchmark) and there's no (easy) way around that.
Conclusion: Python
is way faster than sed
for this particular task.
Upvotes: 1
Reputation: 10431
Don't do that. Don't use any OS calls if you can avoid it.
If you are using Python, just do pythonic Python script.
Something like:
input_filename = 'toto'
output_filename = 'toto.json'
with open(input_filename, 'r') as inputf:
lines = ['{},\n'.format(line.rstrip()) for line in inputf]
lines = ['['] + lines + [']']
with open(output_filename, 'w') as outputf:
outputf.writelines(lines)
It basically does the same as your command line.
Trusts this piece of code is kind of dirty and only for example purpose. I advise you to do your own and avoid oneliners like I did.
Upvotes: 0
Reputation: 149175
On Linux and other Unix systems, the redirection characters are not part of the command but are interpreted by the shell, so it does not make sense to pass it as parameters to a subprocess.
Hopefully, subprocess.call
allows the stdout
parameter to be a file object. So you should do:
subprocess.call(["sed", "-e","1s/^/[/","-e", "s/$/,/","-e","$s/,$/]/ ",sys.argv[1]],
stdout=open(sys.argv[2], "w"))
Upvotes: 3