Muhammad Adeel Zahid
Muhammad Adeel Zahid

Reputation: 17784

Loading each line of text file as numpy array without looping

I have datasets containing data for frequent rule mining where each row has a different number of items like

9 10 5
8 9 10 5 12 15
7 3 5

Is there a way that we could read the files with the above contents at once and convert it to numpy array of arrays like np.array(np.array([

array([array([ 9, 10,  5]), array([ 8,  9, 10,  5, 12, 15]),
       array([7, 3, 5])], dtype=object)

I have come across numpy.loadtxt function but it does not cater the different number of columns the way I want. With different numbers of columns, loadtxt requires mentioning the columns to be used for reading the data. But, I want to to read all the values in each row.
One way to achieve this could be to manually read the files and convert each line into numpy 'array` but I don't want to take that route because the actual datasets will be a lot bigger than the tiny example shown here. For instance, I am planning to use datasets from FIMI repository. One sample data is accident data.
Edit: I used the following code to achieve what I want

data = []
# d = np.loadtxt('datasets/grocery.dat')
with open('datasets/accidents.dat', 'r') as f:
    for l in f.readlines():
        ar = np.genfromtxt(StringIO(l))
        data.append(ar)
print(data)
data = np.array(data)
print(data)

But, this is what I want to avoid: looping in the python code because it took more than four minutes to just read the data and convert it into numpy arrays

Upvotes: 2

Views: 1003

Answers (1)

hpaulj
hpaulj

Reputation: 231385

In [401]: txt="""9 10 5 
     ...: 8 9 10 5 12 15 
     ...: 7 3 5 
     ...: 9 10 5 
     ...: 8 9 10 5 12 15 
     ...: 7 3 5 
     ...: 9 10 5 
     ...: 8 9 10 5 12 15 
     ...: 7 3 5""".splitlines()                                                                        

(this approximates what we'd get with readlines)

Collecting a list of lists is straight forward, but converting the strings to numbers would require list comprehension:

In [402]: alist = []                                                                                   
In [403]: for line in txt: 
     ...:     alist.append(line.split()) 
     ...:                                                                                              
In [404]: alist                                                                                        
Out[404]: 
[['9', '10', '5'],
 ['8', '9', '10', '5', '12', '15'],
 ['7', '3', '5'],
 ['9', '10', '5'],
 ['8', '9', '10', '5', '12', '15'],
 ['7', '3', '5'],
 ['9', '10', '5'],
 ['8', '9', '10', '5', '12', '15'],
 ['7', '3', '5']]
In [405]: np.array(alist)                                                                              
Out[405]: 
array([list(['9', '10', '5']), list(['8', '9', '10', '5', '12', '15']),
       list(['7', '3', '5']), list(['9', '10', '5']),
       list(['8', '9', '10', '5', '12', '15']), list(['7', '3', '5']),
       list(['9', '10', '5']), list(['8', '9', '10', '5', '12', '15']),
       list(['7', '3', '5'])], dtype=object)

It might be faster to convert each line to an integer array (but that's just a guess):

In [406]: alist = [] 
     ...: for line in txt: 
     ...:     alist.append(np.array(line.split(), dtype=int)) 
     ...:      
     ...:                                                                                              
In [407]: alist                                                                                        
Out[407]: 
[array([ 9, 10,  5]),
 array([ 8,  9, 10,  5, 12, 15]),
 array([7, 3, 5]),
 array([ 9, 10,  5]),
 array([ 8,  9, 10,  5, 12, 15]),
 array([7, 3, 5]),
 array([ 9, 10,  5]),
 array([ 8,  9, 10,  5, 12, 15]),
 array([7, 3, 5])]
In [408]: np.array(alist)                                                                              
Out[408]: 
array([array([ 9, 10,  5]), array([ 8,  9, 10,  5, 12, 15]),
       array([7, 3, 5]), array([ 9, 10,  5]),
       array([ 8,  9, 10,  5, 12, 15]), array([7, 3, 5]),
       array([ 9, 10,  5]), array([ 8,  9, 10,  5, 12, 15]),
       array([7, 3, 5])], dtype=object)

Given the irregular nature of the text, and mix of array lengths in the result, there isn't much of an alternative. Arrays or lists of diverse size is a pretty good indicator that fast multidimensional array operations are not possible.

We could load all numbers as a 1d array with:

In [413]: np.fromstring(' '.join(txt), sep=' ', dtype=int)                                             
Out[413]: 
array([ 9, 10,  5,  8,  9, 10,  5, 12, 15,  7,  3,  5,  9, 10,  5,  8,  9,
       10,  5, 12, 15,  7,  3,  5,  9, 10,  5,  8,  9, 10,  5, 12, 15,  7,
        3,  5])

but splitting that into line arrays still requires some sort of line count followed by an array split. So I doubt if it would save any time.

Upvotes: 2

Related Questions