Marcin Ziajkowski
Marcin Ziajkowski

Reputation: 11

best way to writing to file

Which way of writing data to file is better?

# 1 way
whole_data = ""
for file_name in list_of_files:
    r_file = open(file_name, 'r')
    whole_data += r_file.read()
    r_file.close()
with open("destination_file.txt", 'w') as w_file:
    w_file.write(whole_data)


# 2 way
for file_name in list_of_files:
    r_file = open(file_name, 'r')
    with open("destination_file.txt", 'a') as w_file:
        w_file.write(r_file.read())
    r_file.close()

# separate open/colse for write
w_file = open("destination_file.txt", 'w')
for file_name in list_of_files:
    with open(file_name, 'r') as r_file:
        w_file.write(r_file.read())
w_file.close()

1 Way first save whole data in to super string and then wite it to destination file. 2 way read from file and immediately append data to destination file. I used to work with both ways in code but I am not sure which is better. Do you know any cons and pros of both ways. If you know even better practices please share. // Edited: added 3rd way

Upvotes: 1

Views: 694

Answers (7)

Marcin Ziajkowski
Marcin Ziajkowski

Reputation: 11

Looks like the second way is safer and faster for many files.

from os import listdir
from os.path import isfile, join
import timeit

my_path = r"./"
list_of_files = [f for f in listdir(my_path) if isfile(join(my_path, f))]
test_files = list_of_files
print(len(test_files), "files ~6.5kB per file")


def whole_data(amount):
    data = ""
    for file in test_files[:amount]:
        with open(file, 'rb') as fr:
            data += str(fr.read())
        
    with open('whole_data_file.txt', 'w') as fw:
        fw.write(data)


def file_by_file(amount):
    with open('line_by_line_file.txt', 'w') as fw:
        for file in test_files[:amount]:
            with open(file, 'rb') as fr:
                fw.write(str(fr.read()))


# all files are taken from game wither 3
# 100 files ~6.5kB per file
print('Whole data method 20/number=100:', timeit.timeit("whole_data(20)",  globals=globals(), number=100))
print('Whole data method 100/number=20:', timeit.timeit("whole_data(100)", globals=globals(), number=20 ))
# Whole data method 20/number=100: 285.6315555
# Whole data method 100/number=20: 495.45210849999995

print('File by file method 20/number=100:', timeit.timeit("file_by_file(20)",  globals=globals(), number=100))
print('File by file method 100/number=20:', timeit.timeit("file_by_file(100)", globals=globals(), number=20 ))
# File by file method 20/number=100: 212.43927700000006
# File by file method 100/number=20: 205.07520319999992

Upvotes: 0

Marcin Ziajkowski
Marcin Ziajkowski

Reputation: 11

Ok I did some tests. And result is not as I expected. First of all I thought operator += in first test will raise memory problem as first. The memory problem occurs only in second "way" of saving files. I wanted make operations more complex so I added replacing characters in input files.

The only expected result was time difference between 1way and 3way (separated). For 30 files only 1way: 0.26499 sec 2way: 0.648 sec 3way: 0.242 sec

over 30 files in 2way there was "MemoryError" so I excluded this way from tests.

For all (222 huge files) results are pretty predictable: 1way: 39 sec 3way: 1.577 sec

Code:

from os import listdir
from os.path import isfile, join
import time

my_path = r"path_to_files"
list_of_files = [f for f in listdir(my_path) if isfile(join(my_path, f))]
print(len(list_of_files))

repeat = 20


if True:  # 39.381000042 sec
    # flash destination_file.txt
    with open("destination_file.txt", 'w') as w_file:
        w_file.write("")

    now = time.time()  # start counting
    for i in range(repeat):
        # 1 way
        whole_data = ""
        for file_name in list_of_files:
            with open(file_name, 'r') as r_file:
                tmp = r_file.read().replace('d', 'A')
                whole_data += tmp
        with open("destination_file.txt", 'w') as w_file:
            w_file.write(whole_data)

    print(time.time() - now)  # print time elapsed
    # --------------- 1 way ---------------


if True:  # MemoryError
    # flash destination_file.txt
    with open("destination_file.txt", 'w') as w_file:
        w_file.write("")

    now = time.time()  # start counting
    for i in range(repeat):
        # 2 way
        for file_name in list_of_files:
            with open(file_name, 'r') as r_file:
                with open("destination_file.txt", 'a') as w_file:
                    tmp = r_file.read().replace('d', 'A')  # MemoryError
                    w_file.write(tmp)

    print(time.time() - now)  # print time elapsed
    # --------------- 3 way ---------------


if True:  # 1.53500008583 sec
    # flash destination_file.txt
    with open("destination_file.txt", 'w') as w_file:
        w_file.write("")

    now = time.time()  # start counting
    for i in range(repeat):
        # separate open/close for write
        w_file = open("destination_file.txt", 'w')
        for file_name in list_of_files:
            with open(file_name, 'r') as r_file:
                tmp = r_file.read().replace('d', 'A')
                w_file.write(tmp)
        w_file.close()

    print(time.time() - now)  # print time elapsed
    # --------------- separate ---------------

Upvotes: 0

Brendano257
Brendano257

Reputation: 613

Here's a best-effort test harness, but I can't stress enough how little it proves in reality. Out of 3-4 runs (each of 10K trials), each one has come out on top at least once, and only by .1s - .2s (over 10K trials!). That said, I'm running some IO heavy ML models on my workstation, so someone else may produce more reliable figures. In any case, I'd say it's a syntactic choice and performance is not a major concern.

I've made a couple syntax changes (nested with's where appropriate), and moved each method to a function after setting up some files. You may also find different numbers if you change the number of lines in each file as @gimix has said. Per his answer, the whole-data method will also use lots of memory unnecessarily, so this may be the deciding factor in writing clean, performant, and future-proof code.

import timeit

test_files = []

for n in range(100):
    file_name = f'test_file_{n}.txt'
    with open(file_name, 'w') as f:
        for i in range(10):
            f.write(f'{i}\n')
        test_files.append(file_name)


def whole_data():
    data = ""
    for file in test_files:
        with open(file, 'r') as fr:
            data += fr.read()
        
    with open('whole_data_file.txt', 'w') as fw:
        fw.write(data)


def file_by_file():
    with open('line_by_line_file.txt', 'w') as fw:
        for file in test_files:
            with open(file, 'r') as fr:
                fw.write(fr.read())


print('Whole data method:', timeit.timeit("whole_data()", globals=globals(), number=10_000))
# Whole data method: 10.38545351603534
# Whole data method: 10.356000136991497

print('File by file method:', timeit.timeit("file_by_file()", globals=globals(), number=10_000))
# File by file method: 10.356590001960285
# File by file method: 10.507033439003862

Note that all of the above will take probably upwards of a minute if not run on an SSD (I'm using a performant NVME SSD)

Upvotes: 0

gimix
gimix

Reputation: 3833

If you have a limited number of small files I presume you will notice no difference, but if you were dealing with many, very large files with the first method you would be consuming a huge amount of ram, basically for no reason, so the second method is surely more scalable.

That said you probably don't need to reopen (and implicitly close) the output file at every iteration, which may slow things down depending on OS, disk/network performance, etc. You may restructure your code like

with open("destination_file.txt", 'a') as w_file:
    for file_name in list_of_files:
       with open(file_name, 'r') as r_file
          w_file.write(r_file.read())

Upvotes: 0

PCM
PCM

Reputation: 3011

The timeit module takes in two strings, a statement (stmt) and a setup. It then runs the setup code and runs the stmt code some n number of times and reports back average length of time it took.

def func_one(n):
    setup = '''
    whole_data = ""
    for file_name in list_of_files:
        r_file = open(file_name, 'r')
        whole_data += r_file.read()
        r_file.close()
    with open("destination_file.txt", 'w') as w_file:
        w_file.write(whole_data) '''

stmt = 'func_one(10)'

timeit.timeit(10) # Shows the time taken to do this func 10 times

The reason I am writing to a file 10 times, so that timeit can find the exact value and not rounded off value

Similarly, you can do the 2nd way -

setup = '''
def func_two(n):
    for file_name in list_of_files:
    r_file = open(file_name, 'r')
    with open("destination_file.txt", 'a') as w_file:
        w_file.write(r_file.read())
    r_file.close()'''

stmt = 'func_one(10)'
    
timeit.timeit(10) # Shows the time taken to do this func 10 times

Then you can compare the time printed out.

I know this is too much. But sometimes, looking at the code cannot tell which takes faster

Upvotes: 0

Eze Onyekachukwu
Eze Onyekachukwu

Reputation: 31

"The with statement automatically takes care of closing the file once it leaves the with block, even in cases of error. I highly recommend that you use the with statement as much as possible, as it allows for cleaner code and makes handling any unexpected errors easier for you."

check this out

Upvotes: 1

haytaylan
haytaylan

Reputation: 64

Intuitively, 2nd way "feels" faster but you can always try and time them both.

Upvotes: 0

Related Questions