Ohad Sharet
Ohad Sharet

Reputation: 1142

genfromtxt read data of different types as array or arrays

I am trying to import data from a text file with a varying number of columns and insert it into an array of arrays. I know that the first column will always be a string and the next three columns will be integers, but so far I have only managed to read the file as an array of tuples

i have tried using dtype=(object,int,int,int)

from io import StringIO
import numpy as np

new_string = StringIO("01/23/2020, 32, 0, 2 \n01/31/2020' ,436 ,0 ,10")
new_result = np.genfromtxt(new_string, dtype=(object,int,int,int), encoding="unicode"
                           , delimiter=",")

print("File data:",new_result )


File data: [('01/23/2020',  32, 0,  2) ("01/31/2020' ", 436, 0, 10)]

I want the output tolook like this

[['01/23/2020' 32 0 2]
 ['01/31/2020'  436 0 10]]

to that

new_result == np.array( [['01/23/2020',32,0,2],
                         ['01/31/2020', 436, 0, 10]],dtype=object)

will be True

Upvotes: 0

Views: 253

Answers (2)


Reputation: 231665

Specifying a dtype like that produces a structured array https://numpy.org/doc/stable/user/basics.rec.html

In [40]: new_string = StringIO("01/23/2020, 32, 0, 2 \n01/31/2020' ,436 ,0 ,10")
    ...: new_result = np.genfromtxt(new_string, dtype=(object,int,int,int), encoding="unicode"
    ...:                            , delimiter=",")

This is a 1d array, with a compound dtype. The print display just shows the elements, or records, as tuples, but the repr display shows the dtype as well:

In [41]: new_result
array([(b'01/23/2020',  32, 0,  2), (b"01/31/2020' ", 436, 0, 10)],
      dtype=[('f0', 'O'), ('f1', '<i4'), ('f2', '<i4'), ('f3', '<i4')])

In [42]: new_result.dtype
Out[42]: dtype([('f0', 'O'), ('f1', '<i4'), ('f2', '<i4'), ('f3', '<i4')])

Fields are accessed by name:

In [43]: new_result['f0']
Out[43]: array([b'01/23/2020', b"01/31/2020' "], dtype=object)

In [44]: new_result['f1']
Out[44]: array([ 32, 436])

The main structured array doc page suggests using a recfunctions function to convert dtypes:

In [46]: import numpy.lib.recfunctions as rf

Unfortunately the object field is giving that problems:

In [48]: arr = rf.structured_to_unstructured(new_result, dtype=object)
TypeError                                 Traceback (most recent call last)
Input In [48], in <cell line: 1>()
----> 1 arr = rf.structured_to_unstructured(new_result, dtype=object)

File <__array_function__ internals>:5, in structured_to_unstructured(*args, **kwargs)

File ~\anaconda3\lib\site-packages\numpy\lib\recfunctions.py:980, in structured_to_unstructured(arr, dtype, copy, casting)
    978 with suppress_warnings() as sup:  # until 1.16 (gh-12447)
    979     sup.filter(FutureWarning, "Numpy has detected")
--> 980     arr = arr.view(flattened_fields)
    982 # next cast to a packed format with all fields converted to new dtype
    983 packed_fields = np.dtype({'names': names,
    984                           'formats': [(out_dtype, dt.shape) for dt in dts]})

File ~\anaconda3\lib\site-packages\numpy\core\_internal.py:494, in _view_is_safe(oldtype, newtype)
    491     return
    493 if newtype.hasobject or oldtype.hasobject:
--> 494     raise TypeError("Cannot change data-type for object array.")
    495 return

Let's try the dtype=None option (and clean up the string a bit):

In [49]: new_string = StringIO("01/23/2020, 32, 0, 2 \n01/31/2020 ,436 ,0 ,10")
    ...: new_result = np.genfromtxt(new_string, dtype=None, encoding="unicode"
    ...:                            , delimiter=",")

In [50]: new_result
array([('01/23/2020',  32, 0,  2), ('01/31/2020 ', 436, 0, 10)],
      dtype=[('f0', '<U11'), ('f1', '<i4'), ('f2', '<i4'), ('f3', '<i4')])

Same as your case except the string dtype field.

But that doesn't help; it must be the target dtype that the function doesn't like (or both):

In [51]: arr = rf.structured_to_unstructured(new_result, dtype=object)
TypeError: Cannot change data-type for object array.

But we can convert the numeric fields, producing a 2d int array:

In [52]: arr = rf.structured_to_unstructured(new_result[['f1','f2','f3']], dtype=int)

In [53]: arr
array([[ 32,   0,   2],
       [436,   0,  10]])

Assigning fields to object array

In [65]: new_string = "01/23/2020, 32, 0, 2 \n01/31/2020, 436 ,0 ,10".splitlines()
    ...: new_result = np.genfromtxt(new_string, dtype='O,i,i,i', encoding="unicode"
    ...:                            , delimiter=",")

In [66]: new_result
array([(b'01/23/2020',  32, 0,  2), (b'01/31/2020', 436, 0, 10)],
      dtype=[('f0', 'O'), ('f1', '<i4'), ('f2', '<i4'), ('f3', '<i4')])

Create a target array:

In [67]: arr = np.empty((2,4),object)    
In [68]: for i,f in enumerate(new_result.dtype.fields):
    ...:     arr[:,i] = new_result[f]

In [69]: arr
array([[b'01/23/2020', 32, 0, 2],
       [b'01/31/2020', 436, 0, 10]], dtype=object)

Many of the recfunctions do something like this - create a target array, and copy data by field name. Usually a structured array has many more records than fields, so this iteration by field is relatively efficient.


If you specify unpack, the result is separate arrays for each column/field

In [74]: new_string = "01/23/2020, 32, 0, 2 \n01/31/2020, 436 ,0 ,10".splitlines()
    ...: new_result = np.genfromtxt(new_string, dtype='O,i,i,i', unpack=True
    ...:                            , delimiter=",")

In [75]: new_result
[array([b'01/23/2020', b'01/31/2020'], dtype=object),
 array([ 32, 436], dtype=int32),
 array([0, 0], dtype=int32),
 array([ 2, 10], dtype=int32)]

They can then be concatenated with stack:

In [77]: np.stack(new_result, axis=1)
array([[b'01/23/2020', 32, 0, 2],
       [b'01/31/2020', 436, 0, 10]], dtype=object)

Upvotes: 1

Mathias Graabeck
Mathias Graabeck

Reputation: 73

This should work for your problem

import numpy as np
example_string = "01/23/2020, 32, 0, 2 \n01/31/2020' ,436 ,0 ,10"
example_string_filtered = example_string.replace(' ','').replace("'",'')
newline_split = example_string_filtered.split('\n')

result = []
for line in newline_split:
    line_split = line.split(',')
    result.append([line_split[0], int(line_split[1]), int(line_split[2]) ,int(line_split[3])])
result = np.array(result, dtype='O')

result: [['01/23/2020', 32, 0, 2], ['01/31/2020', 436, 0, 10]]

Upvotes: 2

Related Questions