Reputation: 3781
I have a tab delimited csv dataset as following:
1 2 3 4 5 6 [0, 1, 2, 3, 4, 5]
3 1 2 6 4 5 [2, 0, 1, 5, 3, 4]
7 8 9 10 11 6 [0, 1, 2, 3, 4, 5]
10 11 9 8 7 6 [3, 4, 2, 1, 0, 5]
12 13 4 14 15 6 [0, 1, 2, 3, 4, 5]
13 4 14 12 15 6 [1, 2, 3, 0, 4, 5]
16 17 18 19 20 6 [0, 1, 2, 3, 4, 5]
6 18 20 17 16 19 [5, 2, 4, 1, 0, 3]
7 21 22 23 24 6 [0, 1, 2, 3, 4, 5]
23 6 21 7 22 24 [3, 5, 1, 0, 2, 4]
25 7 21 22 23 6 [0, 1, 2, 3, 4, 5]
6 21 7 22 25 23 [5, 2, 1, 3, 0, 4]
16 26 3 27 28 6 [0, 1, 2, 3, 4, 5]
26 6 27 3 28 16 [1, 5, 3, 2, 4, 0]
7 29 24 30 31 6 [0, 1, 2, 3, 4, 5]
30 24 6 7 29 31 [3, 2, 5, 0, 1, 4]
32 33 13 34 35 36 [0, 1, 2, 3, 4, 5]
34 32 36 35 13 33 [3, 0, 5, 4, 2, 1]
7 37 38 39 40 6 [0, 1, 2, 3, 4, 5]
39 38 40 6 37 7 [3, 2, 4, 5, 1, 0]
7 41 42 43 44 6 [0, 1, 2, 3, 4, 5]
41 6 44 43 42 7 [1, 5, 4, 3, 2, 0]
7 45 46 47 48 6 [0, 1, 2, 3, 4, 5]
6 47 45 7 46 48 [5, 3, 1, 0, 2, 4]
49 2 50 51 52 6 [0, 1, 2, 3, 4, 5]
When I want to import such csv file into a numpy array as following;
dataset = numpy.loadtxt('dataset/demo_dataset.csv', delimiter='\t', dtype='str')
I obtain a numpy array with (25,)
shape.
I want to import this csv file into two numpy arrays, called X and Y.
X will include the first 6 columns, and Y will include last column as list values, not str.
How can I manage it?
Upvotes: 3
Views: 3483
Reputation: 231325
Some options using genfromtxt
:
In [1047]: txt=b"""7\t8\t9\t10\t11\t6\t [0, 1, 2, 3, 4, 5]"""
In [1048]: txt=[txt,txt,txt]
In [1049]: txt
Out[1049]:
[b'7\t8\t9\t10\t11\t6\t [0, 1, 2, 3, 4, 5]',
b'7\t8\t9\t10\t11\t6\t [0, 1, 2, 3, 4, 5]',
b'7\t8\t9\t10\t11\t6\t [0, 1, 2, 3, 4, 5]']
load as the default floats - the last column is nan
:
In [1050]: np.genfromtxt(txt,delimiter='\t')
Out[1050]:
array([[ 7., 8., 9., 10., 11., 6., nan],
[ 7., 8., 9., 10., 11., 6., nan],
[ 7., 8., 9., 10., 11., 6., nan]])
As strings
In [1051]: np.genfromtxt(txt,delimiter='\t',dtype='str')
Out[1051]:
array([['7', '8', '9', '10', '11', '6', ' [0, 1, 2, 3, 4, 5]'],
['7', '8', '9', '10', '11', '6', ' [0, 1, 2, 3, 4, 5]'],
['7', '8', '9', '10', '11', '6', ' [0, 1, 2, 3, 4, 5]']],
dtype='<U19')
Letting it decide the best fit - result is a structured array, with int fields and one string field.
In [1052]: np.genfromtxt(txt,delimiter='\t',dtype=None)
Out[1052]:
array([(7, 8, 9, 10, 11, 6, b' [0, 1, 2, 3, 4, 5]'),
(7, 8, 9, 10, 11, 6, b' [0, 1, 2, 3, 4, 5]'),
(7, 8, 9, 10, 11, 6, b' [0, 1, 2, 3, 4, 5]')],
dtype=[('f0', '<i4'), ('f1', '<i4'), ('f2', '<i4'), ('f3', '<i4'), ('f4', '<i4'), ('f5', '<i4'), ('f6', 'S19')])
In [1053]: _['f6']
Out[1053]:
array([b' [0, 1, 2, 3, 4, 5]', b' [0, 1, 2, 3, 4, 5]',
b' [0, 1, 2, 3, 4, 5]'],
dtype='|S19')
Refine the dtype - one field has 6 columns, the other string:
In [1055]: np.genfromtxt(txt,delimiter='\t',dtype='6int,S20')
Out[1055]:
array([([7, 8, 9, 10, 11, 6], b' [0, 1, 2, 3, 4, 5]'),
([7, 8, 9, 10, 11, 6], b' [0, 1, 2, 3, 4, 5]'),
([7, 8, 9, 10, 11, 6], b' [0, 1, 2, 3, 4, 5]')],
dtype=[('f0', '<i4', (6,)), ('f1', 'S20')])
The first field is your desired X
; the strings in the last field need further processing (as per your other question):
In [1060]: _['f0']
Out[1060]:
array([[ 7, 8, 9, 10, 11, 6],
[ 7, 8, 9, 10, 11, 6],
[ 7, 8, 9, 10, 11, 6]])
In [1061]: __['f1']
Out[1061]:
array([b' [0, 1, 2, 3, 4, 5]', b' [0, 1, 2, 3, 4, 5]',
b' [0, 1, 2, 3, 4, 5]'],
This last field can be converted as @chefarov
does with the Y
variable.
Come to think of, I could process that string field with another call to genfromtxt
. I still need to remove the []
.
In [1101]: data=np.genfromtxt(txt,delimiter='\t',dtype='6int,S20')
In [1102]: data['f1']
Out[1102]:
array([b'[0, 1, 2, 3, 4, 5]', b'[0, 1, 2, 3, 4, 5]', b'[0, 1, 2, 3, 4, 5]'],
dtype='|S20')
In [1103]: np.genfromtxt([l.strip(b'[]') for l in data['f1']],delimiter=',',dtype=int)
Out[1103]:
array([[0, 1, 2, 3, 4, 5],
[0, 1, 2, 3, 4, 5],
[0, 1, 2, 3, 4, 5]])
I've recommended, in other cases, preprocessing the lines to remove quotes, brackets and problem delimiters. genfromtxt
accepts input from anything that feeds it lines. But in this case, the required processing is most of what @chefarov
proposes - minus the int
conversions.
genfromtxt
also accepts converters, but I found in other questions that a converter can't change one field into multiple ones.
Also genfromtxt
iterates on the file lines, and decodes each and collects the values in a list. So it doesn't have any speed advantage for a custom reader.
Upvotes: 2
Reputation: 3893
If you use pandas
, there's an (arguably) simpler approach to read the data in. First, a pandas.DataFrame
instance can be constructed, and we can apply a custom function to its last column to convert string elements to np.ndarray
types:
import pandas as pd
import numpy as np
df = pd.read_table('dataset/demo_dataset.csv', delimiter='\t', names='abcdefg')
convert = lambda a: np.fromstring(a[1:-1], count = a.count(',') + 1, sep = ', ', dtype=int)
df.g = df.g.apply(convert)
Once a mixed dataframe is constructed, X
and Y
can be extracted as arrays in a simple fashion:
X = df.values[:, :-1].astype(int)
Y = np.vstack(df.values[:, -1])
Upvotes: 1
Reputation: 2845
I managed to achieve this only via a custom method:
import numpy
with open('dataset/demo_dataset.csv', 'r') as fin:
lines = fin.readlines()
# remove '\n' characters
clean_lines = [l.strip('\n') for l in lines]
# split on tab so that we get lists from strings
A = [cl.split('\t') for cl in clean_lines]
# get lists of ints instead of lists of strings
X = [map(int, row[0:6]) for row in A]
# last column in Y
Y = [row[6] for row in A]
# convert string to int values
for i in xrange(len(Y)):
Y[i] = map(int, Y[i].strip('[]').split(','))
Upvotes: 3