Reputation: 19
I am doing a task of parse/processing a "large" raw data generated from unix shell. This raw data needs to be parsed to clean it from some special chars.
What finally I want to do is avoiding the need of big temporary file and do that on the fly instead.
Way 1 generates a big 8GB temporary text file (not desired) but is fast (8 minutes complete execution): First I generate a temporary raw text file (I put the shell output into a txt file) and then is parsed using the following code: Execution time 8 minutes output file size 800MB:
f = open(filepath, 'r')
fOut= open(filepathOut,"w+")
for line in f:
if len(line) > 0:
if "RNC" in line:
rncflag = 1
#quito enter, quito cadena a la izquierda y quito comillas a la derecha
currline = line.strip('\n').strip('\t').replace(".",",").replace(" ","").lstrip(";")
else:
currline = line.strip('\n').strip('\t').replace(".",",").replace(" ","").lstrip(";")
if rncflag == 1:
if lineNumOne == 1:
processedline = currline
lineNumOne = 0
else:
processedline = '\n' + currline
rncflag = 0
else:
processedline = currline
fOut.write(processedline)
fOut.close()
Way 2, on the fly directly from stdout (~1,5 hours complete execution): Is the one I would prefer since I don't need to generate the previous raw file to parse. I use subprocess library to parse/process the stdout unix shell directly line by line while it's being generated (like if it where the lines of the txt file). The problem is that is infinitely slower than the previous way. Execution time more than 1,5hours to get the same output file (size 800MB):
cmd = subprocess.Popen(isqlCmd, shell=True, stdout=subprocess.PIPE)
for line in cmd.stdout:
if len(line) > 0:
if "RNC" in line:
rncflag = 1
#quito enter, quito cadena a la izquierda y quito comillas a la derecha
currline = line.strip('\n').strip('\t').replace(".",",").replace(" ","").lstrip(";")
else:
currline = line.strip('\n').strip('\t').replace(".",",").replace(" ","").lstrip(";")
if rncflag == 1:
if lineNumOne == 1:
processedline = currline
lineNumOne = 0
else:
processedline = '\n' + currline
rncflag = 0
else:
processedline = currline
fOut.write(processedline)
fOut.close()
I am not python expert, but I'm sure that there is a way to speed up the processing if unix stdout on the fly instead of previously generating the raw file, to parse it afterwards once generated.
The purpose of the program is to clean up / parse a sybase isql query output. Note: the sybase library cannot be installed.
Python version is -> Python 2.6.4 an can not be change it
Thanks in advance, any improvement is welcome.
Upvotes: 0
Views: 76
Reputation: 295766
Without the ability to reproduce the problem, a canonical answer isn't feasible -- but it's possible to provide the tools needed to narrow in on the problem.
If you switch from using subprocess.Popen(..., stdout=subprocess.PIPE)
to just reading from sys.stdin
unconditionally, that means we can use the same code in both the reading-from-a-file case (in which case you'll want to run ./yourscript <inputfile
), and in the pipe-from-a-process case (./runIsqlCommand | ./yourscript
), so we can be confident that we're testing like-for-like.
Once that's done, it also gives us room to put buffering in place, to prevent the sides of the pipeline from blocking on each other unnecessarily. To do this might look like:
./runIsqlCommand | pv | ./yourscript
...where pv
is Pipe Viewer, a tool that both provides a progress bar (when the total amount of content is known), a throughput indicator, and -- critically for our purposes -- a much larger buffer than the operating system's default, and room to adjust that size further (and monitor consumption).
To determine whether the Python script is running slower than the SQL code, tell pv
to display buffer consumption with the -T
argument. (If this shows ----
, then pv
is using the splice()
syscall to transfer content between the processes directly without actually performing buffering; the -C
argument will increase pv
's overhead, but ensure that it's actually able to perform buffering and report on buffer content). If the buffer is 100% full almost all of the time, then we know that the SQL is being generated faster than the Python can read it; if it's usually empty, we know the Python is keeping up.
Upvotes: 1