Reputation: 356

Efficient way to store a file as a matrix of integers

Which is the most efficient way of reading a file formatted like this:

0 0 1 1 0 1 0 1
0 1 0 0 0 1 1 1
1 1 1 0 1 1 0 0

and storing it as a matrix like this?:

[[0, 0, 1, 1, 0, 1, 0, 1],
[0, 1, 0, 0, 0, 1, 1, 1],
[1, 1, 1, 0, 1, 1, 0, 0]]

Please note that each line in the file is read as a string, e.g. the 1st one is:

"0 0 1 1 0 1 0 1"

Therefore, the characters of the string have to be split and converted to integers.

I have tried several ways, and the one that I found to be faster involves using map():

code a)

with open(filename, "r") as file:
    matrix = []
    for line in file:
        matrix.append([value for value in map(int, line.split())])

I found multiprocessing to be much slower, but I am sure I am doing something wrong:

code b)

from multiprocessing.dummy import Pool

with open(filename, "r") as file:
    # splitting function
    def f(file):
        values = [int(char) for line in file for char in line.split()]
        return values
    # 4 threads
    with Pool(4) as pool:
        matrix = pool.map(f, file)

Do you know any more efficient way to achieve this?

Extra: If you know about multi-threading/multiprocessing, I'd appreciate any insight in why code b) is actually slower than code a)!

Thanks!

Upvotes: 2

Answers (2)

Grzegorz Krug

Reputation: 301

If you want grab numbers from file, I would definitely check pandas documentation, as I was meant for reading csv and other stuff, or go with answer provided by Sebastien

For storing data I am using shelve, its very easy, and it can story most python objects.

Quote from documentation:

A “shelf” is a persistent, dictionary-like object. The difference with “dbm” databases is that the values (not the keys!) in a shelf can be essentially arbitrary Python objects — anything that the pickle module can handle. This includes most class instances, recursive data types, and objects containing lots of shared sub-objects. The keys are ordinary strings.

Pros

Its fast, from my experiance at least (maybe I need bigger data to find better libs) I just measured time for writing 100k elements, each having about 100 random integers. And it went under 2s.

Cons:

Size of files can by a little larger than just raw text, but it is saved as dictionary.

Example Code:

import numpy as np
import shelve

deck = np.arange(10)
np.random.shuffle(deck)
print(deck)

with shelve.open('dummy', 'n') as file: 
    file['my_data'] = deck


with shelve.open('dummy') as file:
    print(file['my_data'])

Out:

[2 0 5 6 8 1 4 9 7 3]
[2 0 5 6 8 1 4 9 7 3]

Doc:

https://docs.python.org/3/library/shelve.html

Upvotes: 1

Sebastien D

Reputation: 4482

You could simply use numpy:

import numpy as np
matrix = np.loadtxt(open("test.txt", "rb"), delimiter=" ", dtype=int).tolist()
print(matrix)

output:

[[0, 0, 1, 1, 0, 1, 0, 1],
 [0, 1, 0, 0, 0, 1, 1, 1],
 [1, 1, 1, 0, 1, 1, 0, 0]]

Upvotes: 3