Rohit
Rohit

Reputation: 4168

profiling the code with memory_profiler increasing the execution time

I am writing a simple application which splits a large text file into smaller files, and i have written 2 versions of it, one using lists and one using generators. I profiled both the versions using memory_profiler module and it clearly showed the better memory efficiency of the generators version, however strangely enough when then version using generators was profiled, it increases the execution time. Below demonstration explains what i mean

Version using Lists

from memory_profiler import profile


@profile()
def main():
    file_name = input("Enter the full path of file you want to split into smaller inputFiles: ")
    input_file = open(file_name).readlines()
    num_lines_orig = len(input_file)
    parts = int(input("Enter the number of parts you want to split in: "))
    output_files = [(file_name + str(i)) for i in range(1, parts + 1)]
    st = 0
    p = int(num_lines_orig / parts)
    ed = p
    for i in range(parts-1):
        with open(output_files[i], "w") as OF:
            OF.writelines(input_file[st:ed])
        st = ed
        ed = st + p

    with open(output_files[-1], "w") as OF:
        OF.writelines(input_file[st:])


if __name__ == "__main__":
    main()

when run with profiler

$ time py36 Splitting\ text\ files_BAD_usingLists.py                                                                                                               

Enter the full path of file you want to split into smaller inputFiles: /apps/nttech/rbhanot/Downloads/test.txt
Enter the number of parts you want to split in: 3
Filename: Splitting text files_BAD_usingLists.py

Line #    Mem usage    Increment   Line Contents
================================================
     6     47.8 MiB      0.0 MiB   @profile()
     7                             def main():
     8     47.8 MiB      0.0 MiB       file_name = input("Enter the full path of file you want to split into smaller inputFiles: ")
     9    107.3 MiB     59.5 MiB       input_file = open(file_name).readlines()
    10    107.3 MiB      0.0 MiB       num_lines_orig = len(input_file)
    11    107.3 MiB      0.0 MiB       parts = int(input("Enter the number of parts you want to split in: "))
    12    107.3 MiB      0.0 MiB       output_files = [(file_name + str(i)) for i in range(1, parts + 1)]
    13    107.3 MiB      0.0 MiB       st = 0
    14    107.3 MiB      0.0 MiB       p = int(num_lines_orig / parts)
    15    107.3 MiB      0.0 MiB       ed = p
    16    108.1 MiB      0.7 MiB       for i in range(parts-1):
    17    107.6 MiB     -0.5 MiB           with open(output_files[i], "w") as OF:
    18    108.1 MiB      0.5 MiB               OF.writelines(input_file[st:ed])
    19    108.1 MiB      0.0 MiB           st = ed
    20    108.1 MiB      0.0 MiB           ed = st + p
    21                             
    22    108.1 MiB      0.0 MiB       with open(output_files[-1], "w") as OF:
    23    108.1 MiB      0.0 MiB           OF.writelines(input_file[st:])



real    0m6.115s
user    0m0.764s
sys     0m0.052s

When run without profiler

$ time py36 Splitting\ text\ files_BAD_usingLists.py 
Enter the full path of file you want to split into smaller inputFiles: /apps/nttech/rbhanot/Downloads/test.txt
Enter the number of parts you want to split in: 3

real    0m5.916s
user    0m0.696s
sys     0m0.080s

Now the one using generators

@profile()
def main():
    file_name = input("Enter the full path of file you want to split into smaller inputFiles: ")
    input_file = open(file_name)
    num_lines_orig = sum(1 for _ in input_file)
    input_file.seek(0)
    parts = int(input("Enter the number of parts you want to split in: "))
    output_files = ((file_name + str(i)) for i in range(1, parts + 1))
    st = 0
    p = int(num_lines_orig / parts)
    ed = p
    for i in range(parts-1):
        file = next(output_files)
        with open(file, "w") as OF:
            for _ in range(st, ed):
                OF.writelines(input_file.readline())

            st = ed
            ed = st + p
            if num_lines_orig - ed < p:
                ed = st + (num_lines_orig - ed) + p
            else:
                ed = st + p

    file = next(output_files)
    with open(file, "w") as OF:
        for _ in range(st, ed):
            OF.writelines(input_file.readline())


if __name__ == "__main__":
    main()

When run with profiler option

$ time py36 -m memory_profiler Splitting\ text\ files_GOOD_usingGenerators.py                                                                                                                                      
Enter the full path of file you want to split into smaller inputFiles: /apps/nttech/rbhanot/Downloads/test.txt
Enter the number of parts you want to split in: 3
Filename: Splitting text files_GOOD_usingGenerators.py

Line #    Mem usage    Increment   Line Contents
================================================
     4   47.988 MiB    0.000 MiB   @profile()
     5                             def main():
     6   47.988 MiB    0.000 MiB       file_name = input("Enter the full path of file you want to split into smaller inputFiles: ")
     7   47.988 MiB    0.000 MiB       input_file = open(file_name)
     8   47.988 MiB    0.000 MiB       num_lines_orig = sum(1 for _ in input_file)
     9   47.988 MiB    0.000 MiB       input_file.seek(0)
    10   47.988 MiB    0.000 MiB       parts = int(input("Enter the number of parts you want to split in: "))
    11   48.703 MiB    0.715 MiB       output_files = ((file_name + str(i)) for i in range(1, parts + 1))
    12   47.988 MiB   -0.715 MiB       st = 0
    13   47.988 MiB    0.000 MiB       p = int(num_lines_orig / parts)
    14   47.988 MiB    0.000 MiB       ed = p
    15   48.703 MiB    0.715 MiB       for i in range(parts-1):
    16   48.703 MiB    0.000 MiB           file = next(output_files)
    17   48.703 MiB    0.000 MiB           with open(file, "w") as OF:
    18   48.703 MiB    0.000 MiB               for _ in range(st, ed):
    19   48.703 MiB    0.000 MiB                   OF.writelines(input_file.readline())
    20                             
    21   48.703 MiB    0.000 MiB               st = ed
    22   48.703 MiB    0.000 MiB               ed = st + p
    23   48.703 MiB    0.000 MiB               if num_lines_orig - ed < p:
    24   48.703 MiB    0.000 MiB                   ed = st + (num_lines_orig - ed) + p
    25                                         else:
    26   48.703 MiB    0.000 MiB                   ed = st + p
    27                             
    28   48.703 MiB    0.000 MiB       file = next(output_files)
    29   48.703 MiB    0.000 MiB       with open(file, "w") as OF:
    30   48.703 MiB    0.000 MiB           for _ in range(st, ed):
    31   48.703 MiB    0.000 MiB               OF.writelines(input_file.readline())



real    1m48.071s
user    1m13.144s
sys     0m19.652s

When run without profiler

$ time py36  Splitting\ text\ files_GOOD_usingGenerators.py 
Enter the full path of file you want to split into smaller inputFiles: /apps/nttech/rbhanot/Downloads/test.txt
Enter the number of parts you want to split in: 3

real    0m10.429s
user    0m3.160s
sys     0m0.016s

So why profiling is making my code slow first of all ? Secondly if at profiling impacts execution speed, then why this effect is not showing on version of the code using lists.

Upvotes: 2

Views: 1965

Answers (1)

Rohit
Rohit

Reputation: 4168

I cpu_profiled the code using line_profiler and i got the answer this time, the reason why generator's version takes more time is because of below lines

19         2      11126.0   5563.0      0.2          with open(file, "w") as OF:
    20    379886     200418.0      0.5      3.0              for _ in range(st, ed):
    21    379884    2348653.0      6.2     35.1                  OF.writelines(input_file.readline())

And why it does not slows down for lists version is because

   19         2       9419.0   4709.5      0.4          with open(output_files[i], "w") as OF:
    20         2    1654165.0 827082.5     65.1              OF.writelines(input_file[st:ed])

For lists, the new file is being written by simply taking a copy of the list by slicing it and that is infact a single statement. However for generators version, the new file is being populated by reading the input file line by line and this makes the memory profiler profile for every single line which amounts to increased cpu time.

Upvotes: 2

Related Questions