Reputation: 11
Which way of writing data to file is better?
# 1 way
whole_data = ""
for file_name in list_of_files:
r_file = open(file_name, 'r')
whole_data += r_file.read()
r_file.close()
with open("destination_file.txt", 'w') as w_file:
w_file.write(whole_data)
# 2 way
for file_name in list_of_files:
r_file = open(file_name, 'r')
with open("destination_file.txt", 'a') as w_file:
w_file.write(r_file.read())
r_file.close()
# separate open/colse for write
w_file = open("destination_file.txt", 'w')
for file_name in list_of_files:
with open(file_name, 'r') as r_file:
w_file.write(r_file.read())
w_file.close()
1 Way first save whole data in to super string and then wite it to destination file. 2 way read from file and immediately append data to destination file. I used to work with both ways in code but I am not sure which is better. Do you know any cons and pros of both ways. If you know even better practices please share. // Edited: added 3rd way
Upvotes: 1
Views: 694
Reputation: 11
Looks like the second way is safer and faster for many files.
from os import listdir
from os.path import isfile, join
import timeit
my_path = r"./"
list_of_files = [f for f in listdir(my_path) if isfile(join(my_path, f))]
test_files = list_of_files
print(len(test_files), "files ~6.5kB per file")
def whole_data(amount):
data = ""
for file in test_files[:amount]:
with open(file, 'rb') as fr:
data += str(fr.read())
with open('whole_data_file.txt', 'w') as fw:
fw.write(data)
def file_by_file(amount):
with open('line_by_line_file.txt', 'w') as fw:
for file in test_files[:amount]:
with open(file, 'rb') as fr:
fw.write(str(fr.read()))
# all files are taken from game wither 3
# 100 files ~6.5kB per file
print('Whole data method 20/number=100:', timeit.timeit("whole_data(20)", globals=globals(), number=100))
print('Whole data method 100/number=20:', timeit.timeit("whole_data(100)", globals=globals(), number=20 ))
# Whole data method 20/number=100: 285.6315555
# Whole data method 100/number=20: 495.45210849999995
print('File by file method 20/number=100:', timeit.timeit("file_by_file(20)", globals=globals(), number=100))
print('File by file method 100/number=20:', timeit.timeit("file_by_file(100)", globals=globals(), number=20 ))
# File by file method 20/number=100: 212.43927700000006
# File by file method 100/number=20: 205.07520319999992
Upvotes: 0
Reputation: 11
Ok I did some tests. And result is not as I expected. First of all I thought operator += in first test will raise memory problem as first. The memory problem occurs only in second "way" of saving files. I wanted make operations more complex so I added replacing characters in input files.
The only expected result was time difference between 1way and 3way (separated). For 30 files only 1way: 0.26499 sec 2way: 0.648 sec 3way: 0.242 sec
over 30 files in 2way there was "MemoryError" so I excluded this way from tests.
For all (222 huge files) results are pretty predictable: 1way: 39 sec 3way: 1.577 sec
Code:
from os import listdir
from os.path import isfile, join
import time
my_path = r"path_to_files"
list_of_files = [f for f in listdir(my_path) if isfile(join(my_path, f))]
print(len(list_of_files))
repeat = 20
if True: # 39.381000042 sec
# flash destination_file.txt
with open("destination_file.txt", 'w') as w_file:
w_file.write("")
now = time.time() # start counting
for i in range(repeat):
# 1 way
whole_data = ""
for file_name in list_of_files:
with open(file_name, 'r') as r_file:
tmp = r_file.read().replace('d', 'A')
whole_data += tmp
with open("destination_file.txt", 'w') as w_file:
w_file.write(whole_data)
print(time.time() - now) # print time elapsed
# --------------- 1 way ---------------
if True: # MemoryError
# flash destination_file.txt
with open("destination_file.txt", 'w') as w_file:
w_file.write("")
now = time.time() # start counting
for i in range(repeat):
# 2 way
for file_name in list_of_files:
with open(file_name, 'r') as r_file:
with open("destination_file.txt", 'a') as w_file:
tmp = r_file.read().replace('d', 'A') # MemoryError
w_file.write(tmp)
print(time.time() - now) # print time elapsed
# --------------- 3 way ---------------
if True: # 1.53500008583 sec
# flash destination_file.txt
with open("destination_file.txt", 'w') as w_file:
w_file.write("")
now = time.time() # start counting
for i in range(repeat):
# separate open/close for write
w_file = open("destination_file.txt", 'w')
for file_name in list_of_files:
with open(file_name, 'r') as r_file:
tmp = r_file.read().replace('d', 'A')
w_file.write(tmp)
w_file.close()
print(time.time() - now) # print time elapsed
# --------------- separate ---------------
Upvotes: 0
Reputation: 613
Here's a best-effort test harness, but I can't stress enough how little it proves in reality. Out of 3-4 runs (each of 10K trials), each one has come out on top at least once, and only by .1s - .2s (over 10K trials!). That said, I'm running some IO heavy ML models on my workstation, so someone else may produce more reliable figures. In any case, I'd say it's a syntactic choice and performance is not a major concern.
I've made a couple syntax changes (nested with's where appropriate), and moved each method to a function after setting up some files. You may also find different numbers if you change the number of lines in each file as @gimix has said. Per his answer, the whole-data method will also use lots of memory unnecessarily, so this may be the deciding factor in writing clean, performant, and future-proof code.
import timeit
test_files = []
for n in range(100):
file_name = f'test_file_{n}.txt'
with open(file_name, 'w') as f:
for i in range(10):
f.write(f'{i}\n')
test_files.append(file_name)
def whole_data():
data = ""
for file in test_files:
with open(file, 'r') as fr:
data += fr.read()
with open('whole_data_file.txt', 'w') as fw:
fw.write(data)
def file_by_file():
with open('line_by_line_file.txt', 'w') as fw:
for file in test_files:
with open(file, 'r') as fr:
fw.write(fr.read())
print('Whole data method:', timeit.timeit("whole_data()", globals=globals(), number=10_000))
# Whole data method: 10.38545351603534
# Whole data method: 10.356000136991497
print('File by file method:', timeit.timeit("file_by_file()", globals=globals(), number=10_000))
# File by file method: 10.356590001960285
# File by file method: 10.507033439003862
Note that all of the above will take probably upwards of a minute if not run on an SSD (I'm using a performant NVME SSD)
Upvotes: 0
Reputation: 3833
If you have a limited number of small files I presume you will notice no difference, but if you were dealing with many, very large files with the first method you would be consuming a huge amount of ram, basically for no reason, so the second method is surely more scalable.
That said you probably don't need to reopen (and implicitly close) the output file at every iteration, which may slow things down depending on OS, disk/network performance, etc. You may restructure your code like
with open("destination_file.txt", 'a') as w_file:
for file_name in list_of_files:
with open(file_name, 'r') as r_file
w_file.write(r_file.read())
Upvotes: 0
Reputation: 3011
The timeit module takes in two strings, a statement (stmt) and a setup. It then runs the setup code and runs the stmt code some n number of times and reports back average length of time it took.
def func_one(n):
setup = '''
whole_data = ""
for file_name in list_of_files:
r_file = open(file_name, 'r')
whole_data += r_file.read()
r_file.close()
with open("destination_file.txt", 'w') as w_file:
w_file.write(whole_data) '''
stmt = 'func_one(10)'
timeit.timeit(10) # Shows the time taken to do this func 10 times
The reason I am writing to a file 10 times, so that timeit can find the exact value and not rounded off value
Similarly, you can do the 2nd way -
setup = '''
def func_two(n):
for file_name in list_of_files:
r_file = open(file_name, 'r')
with open("destination_file.txt", 'a') as w_file:
w_file.write(r_file.read())
r_file.close()'''
stmt = 'func_one(10)'
timeit.timeit(10) # Shows the time taken to do this func 10 times
Then you can compare the time printed out.
I know this is too much. But sometimes, looking at the code cannot tell which takes faster
Upvotes: 0
Reputation: 31
"The with statement automatically takes care of closing the file once it leaves the with block, even in cases of error. I highly recommend that you use the with statement as much as possible, as it allows for cleaner code and makes handling any unexpected errors easier for you."
Upvotes: 1
Reputation: 64
Intuitively, 2nd way "feels" faster but you can always try and time them both.
Upvotes: 0