Reputation: 356
Which is the most efficient way of reading a file formatted like this:
0 0 1 1 0 1 0 1
0 1 0 0 0 1 1 1
1 1 1 0 1 1 0 0
and storing it as a matrix like this?:
[[0, 0, 1, 1, 0, 1, 0, 1],
[0, 1, 0, 0, 0, 1, 1, 1],
[1, 1, 1, 0, 1, 1, 0, 0]]
Please note that each line in the file is read as a string, e.g. the 1st one is:
"0 0 1 1 0 1 0 1"
Therefore, the characters of the string have to be split and converted to integers.
I have tried several ways, and the one that I found to be faster involves using map():
code a)
with open(filename, "r") as file:
matrix = []
for line in file:
matrix.append([value for value in map(int, line.split())])
I found multiprocessing to be much slower, but I am sure I am doing something wrong:
code b)
from multiprocessing.dummy import Pool
with open(filename, "r") as file:
# splitting function
def f(file):
values = [int(char) for line in file for char in line.split()]
return values
# 4 threads
with Pool(4) as pool:
matrix = pool.map(f, file)
Do you know any more efficient way to achieve this?
Extra: If you know about multi-threading/multiprocessing, I'd appreciate any insight in why code b) is actually slower than code a)!
Thanks!
Upvotes: 2
Views: 206
Reputation: 301
If you want grab numbers from file, I would definitely check pandas documentation, as I was meant for reading csv and other stuff, or go with answer provided by Sebastien
For storing data I am using shelve, its very easy, and it can story most python objects.
Quote from documentation:
A “shelf” is a persistent, dictionary-like object. The difference with “dbm” databases is that the values (not the keys!) in a shelf can be essentially arbitrary Python objects — anything that the pickle module can handle. This includes most class instances, recursive data types, and objects containing lots of shared sub-objects. The keys are ordinary strings.
Its fast, from my experiance at least (maybe I need bigger data to find better libs) I just measured time for writing 100k elements, each having about 100 random integers. And it went under 2s.
Size of files can by a little larger than just raw text, but it is saved as dictionary.
import numpy as np
import shelve
deck = np.arange(10)
np.random.shuffle(deck)
print(deck)
with shelve.open('dummy', 'n') as file:
file['my_data'] = deck
with shelve.open('dummy') as file:
print(file['my_data'])
[2 0 5 6 8 1 4 9 7 3]
[2 0 5 6 8 1 4 9 7 3]
https://docs.python.org/3/library/shelve.html
Upvotes: 1
Reputation: 4482
You could simply use numpy:
import numpy as np
matrix = np.loadtxt(open("test.txt", "rb"), delimiter=" ", dtype=int).tolist()
print(matrix)
output:
[[0, 0, 1, 1, 0, 1, 0, 1],
[0, 1, 0, 0, 0, 1, 1, 1],
[1, 1, 1, 0, 1, 1, 0, 0]]
Upvotes: 3