Reputation: 17784
I have datasets containing data for frequent rule mining where each row has a different number of items like
9 10 5
8 9 10 5 12 15
7 3 5
Is there a way that we could read the files with the above contents at once and convert it to numpy
array
of arrays
like
np.array(np.array([
array([array([ 9, 10, 5]), array([ 8, 9, 10, 5, 12, 15]),
array([7, 3, 5])], dtype=object)
I have come across numpy.loadtxt
function but it does not cater the different number of columns the way I want. With different numbers of columns, loadtxt
requires mentioning the columns to be used for reading the data. But, I want to to read all the values in each row.
One way to achieve this could be to manually read the files and convert each line into numpy
'array` but I don't want to take that route because the actual datasets will be a lot bigger than the tiny example shown here. For instance, I am planning to use datasets from FIMI repository. One sample data is accident data.
Edit:
I used the following code to achieve what I want
data = []
# d = np.loadtxt('datasets/grocery.dat')
with open('datasets/accidents.dat', 'r') as f:
for l in f.readlines():
ar = np.genfromtxt(StringIO(l))
data.append(ar)
print(data)
data = np.array(data)
print(data)
But, this is what I want to avoid: looping in the python code because it took more than four minutes to just read the data and convert it into numpy
arrays
Upvotes: 2
Views: 1003
Reputation: 231385
In [401]: txt="""9 10 5
...: 8 9 10 5 12 15
...: 7 3 5
...: 9 10 5
...: 8 9 10 5 12 15
...: 7 3 5
...: 9 10 5
...: 8 9 10 5 12 15
...: 7 3 5""".splitlines()
(this approximates what we'd get with readlines
)
Collecting a list of lists is straight forward, but converting the strings to numbers would require list comprehension:
In [402]: alist = []
In [403]: for line in txt:
...: alist.append(line.split())
...:
In [404]: alist
Out[404]:
[['9', '10', '5'],
['8', '9', '10', '5', '12', '15'],
['7', '3', '5'],
['9', '10', '5'],
['8', '9', '10', '5', '12', '15'],
['7', '3', '5'],
['9', '10', '5'],
['8', '9', '10', '5', '12', '15'],
['7', '3', '5']]
In [405]: np.array(alist)
Out[405]:
array([list(['9', '10', '5']), list(['8', '9', '10', '5', '12', '15']),
list(['7', '3', '5']), list(['9', '10', '5']),
list(['8', '9', '10', '5', '12', '15']), list(['7', '3', '5']),
list(['9', '10', '5']), list(['8', '9', '10', '5', '12', '15']),
list(['7', '3', '5'])], dtype=object)
It might be faster to convert each line to an integer array (but that's just a guess):
In [406]: alist = []
...: for line in txt:
...: alist.append(np.array(line.split(), dtype=int))
...:
...:
In [407]: alist
Out[407]:
[array([ 9, 10, 5]),
array([ 8, 9, 10, 5, 12, 15]),
array([7, 3, 5]),
array([ 9, 10, 5]),
array([ 8, 9, 10, 5, 12, 15]),
array([7, 3, 5]),
array([ 9, 10, 5]),
array([ 8, 9, 10, 5, 12, 15]),
array([7, 3, 5])]
In [408]: np.array(alist)
Out[408]:
array([array([ 9, 10, 5]), array([ 8, 9, 10, 5, 12, 15]),
array([7, 3, 5]), array([ 9, 10, 5]),
array([ 8, 9, 10, 5, 12, 15]), array([7, 3, 5]),
array([ 9, 10, 5]), array([ 8, 9, 10, 5, 12, 15]),
array([7, 3, 5])], dtype=object)
Given the irregular nature of the text, and mix of array lengths in the result, there isn't much of an alternative. Arrays or lists of diverse size is a pretty good indicator that fast multidimensional array operations are not possible.
We could load all numbers as a 1d array with:
In [413]: np.fromstring(' '.join(txt), sep=' ', dtype=int)
Out[413]:
array([ 9, 10, 5, 8, 9, 10, 5, 12, 15, 7, 3, 5, 9, 10, 5, 8, 9,
10, 5, 12, 15, 7, 3, 5, 9, 10, 5, 8, 9, 10, 5, 12, 15, 7,
3, 5])
but splitting that into line arrays still requires some sort of line count followed by an array split. So I doubt if it would save any time.
Upvotes: 2