Reputation: 3781

Reading tab delimited csv into numpy array with different data types

I have a tab delimited csv dataset as following:

1       2       3       4       5       6       [0, 1, 2, 3, 4, 5]
3       1       2       6       4       5       [2, 0, 1, 5, 3, 4]
7       8       9       10      11      6       [0, 1, 2, 3, 4, 5]
10      11      9       8       7       6       [3, 4, 2, 1, 0, 5]
12      13      4       14      15      6       [0, 1, 2, 3, 4, 5]
13      4       14      12      15      6       [1, 2, 3, 0, 4, 5]
16      17      18      19      20      6       [0, 1, 2, 3, 4, 5]
6       18      20      17      16      19      [5, 2, 4, 1, 0, 3]
7       21      22      23      24      6       [0, 1, 2, 3, 4, 5]
23      6       21      7       22      24      [3, 5, 1, 0, 2, 4]
25      7       21      22      23      6       [0, 1, 2, 3, 4, 5]
6       21      7       22      25      23      [5, 2, 1, 3, 0, 4]
16      26      3       27      28      6       [0, 1, 2, 3, 4, 5]
26      6       27      3       28      16      [1, 5, 3, 2, 4, 0]
7       29      24      30      31      6       [0, 1, 2, 3, 4, 5]
30      24      6       7       29      31      [3, 2, 5, 0, 1, 4]
32      33      13      34      35      36      [0, 1, 2, 3, 4, 5]
34      32      36      35      13      33      [3, 0, 5, 4, 2, 1]
7       37      38      39      40      6       [0, 1, 2, 3, 4, 5]
39      38      40      6       37      7       [3, 2, 4, 5, 1, 0]
7       41      42      43      44      6       [0, 1, 2, 3, 4, 5]
41      6       44      43      42      7       [1, 5, 4, 3, 2, 0]
7       45      46      47      48      6       [0, 1, 2, 3, 4, 5]
6       47      45      7       46      48      [5, 3, 1, 0, 2, 4]
49      2       50      51      52      6       [0, 1, 2, 3, 4, 5]

When I want to import such csv file into a numpy array as following;

dataset = numpy.loadtxt('dataset/demo_dataset.csv', delimiter='\t', dtype='str')

I obtain a numpy array with (25,) shape.

I want to import this csv file into two numpy arrays, called X and Y.

X will include the first 6 columns, and Y will include last column as list values, not str.

How can I manage it?

Upvotes: 3

Answers (3)

hpaulj

Reputation: 231325

Some options using genfromtxt:

In [1047]: txt=b"""7\t8\t9\t10\t11\t6\t [0, 1, 2, 3, 4, 5]"""
In [1048]: txt=[txt,txt,txt]
In [1049]: txt
Out[1049]: 
[b'7\t8\t9\t10\t11\t6\t [0, 1, 2, 3, 4, 5]',
 b'7\t8\t9\t10\t11\t6\t [0, 1, 2, 3, 4, 5]',
 b'7\t8\t9\t10\t11\t6\t [0, 1, 2, 3, 4, 5]']

load as the default floats - the last column is nan:

In [1050]: np.genfromtxt(txt,delimiter='\t')
Out[1050]: 
array([[  7.,   8.,   9.,  10.,  11.,   6.,  nan],
       [  7.,   8.,   9.,  10.,  11.,   6.,  nan],
       [  7.,   8.,   9.,  10.,  11.,   6.,  nan]])

As strings

In [1051]: np.genfromtxt(txt,delimiter='\t',dtype='str')
Out[1051]: 
array([['7', '8', '9', '10', '11', '6', ' [0, 1, 2, 3, 4, 5]'],
       ['7', '8', '9', '10', '11', '6', ' [0, 1, 2, 3, 4, 5]'],
       ['7', '8', '9', '10', '11', '6', ' [0, 1, 2, 3, 4, 5]']], 
      dtype='<U19')

Letting it decide the best fit - result is a structured array, with int fields and one string field.

In [1052]: np.genfromtxt(txt,delimiter='\t',dtype=None)
Out[1052]: 
array([(7, 8, 9, 10, 11, 6, b' [0, 1, 2, 3, 4, 5]'),
       (7, 8, 9, 10, 11, 6, b' [0, 1, 2, 3, 4, 5]'),
       (7, 8, 9, 10, 11, 6, b' [0, 1, 2, 3, 4, 5]')], 
      dtype=[('f0', '<i4'), ('f1', '<i4'), ('f2', '<i4'), ('f3', '<i4'), ('f4', '<i4'), ('f5', '<i4'), ('f6', 'S19')])
In [1053]: _['f6']
Out[1053]: 
array([b' [0, 1, 2, 3, 4, 5]', b' [0, 1, 2, 3, 4, 5]',
       b' [0, 1, 2, 3, 4, 5]'], 
      dtype='|S19')

Refine the dtype - one field has 6 columns, the other string:

In [1055]: np.genfromtxt(txt,delimiter='\t',dtype='6int,S20')
Out[1055]: 
array([([7, 8, 9, 10, 11, 6], b' [0, 1, 2, 3, 4, 5]'),
       ([7, 8, 9, 10, 11, 6], b' [0, 1, 2, 3, 4, 5]'),
       ([7, 8, 9, 10, 11, 6], b' [0, 1, 2, 3, 4, 5]')], 
      dtype=[('f0', '<i4', (6,)), ('f1', 'S20')])

The first field is your desired X; the strings in the last field need further processing (as per your other question):

In [1060]: _['f0']
Out[1060]: 
array([[ 7,  8,  9, 10, 11,  6],
       [ 7,  8,  9, 10, 11,  6],
       [ 7,  8,  9, 10, 11,  6]])
In [1061]: __['f1']
Out[1061]: 
array([b' [0, 1, 2, 3, 4, 5]', b' [0, 1, 2, 3, 4, 5]',
       b' [0, 1, 2, 3, 4, 5]'],

This last field can be converted as @chefarov does with the Y variable.

Come to think of, I could process that string field with another call to genfromtxt. I still need to remove the [].

In [1101]: data=np.genfromtxt(txt,delimiter='\t',dtype='6int,S20')
In [1102]: data['f1']
Out[1102]: 
array([b'[0, 1, 2, 3, 4, 5]', b'[0, 1, 2, 3, 4, 5]', b'[0, 1, 2, 3, 4, 5]'], 
      dtype='|S20')
In [1103]: np.genfromtxt([l.strip(b'[]') for l in data['f1']],delimiter=',',dtype=int)
Out[1103]: 
array([[0, 1, 2, 3, 4, 5],
       [0, 1, 2, 3, 4, 5],
       [0, 1, 2, 3, 4, 5]])

I've recommended, in other cases, preprocessing the lines to remove quotes, brackets and problem delimiters. genfromtxt accepts input from anything that feeds it lines. But in this case, the required processing is most of what @chefarov proposes - minus the int conversions.

genfromtxt also accepts converters, but I found in other questions that a converter can't change one field into multiple ones.

Also genfromtxt iterates on the file lines, and decodes each and collects the values in a list. So it doesn't have any speed advantage for a custom reader.

Upvotes: 2

Vlas Sokolov

Reputation: 3893

If you use pandas, there's an (arguably) simpler approach to read the data in. First, a pandas.DataFrame instance can be constructed, and we can apply a custom function to its last column to convert string elements to np.ndarray types:

import pandas as pd
import numpy as np

df = pd.read_table('dataset/demo_dataset.csv', delimiter='\t', names='abcdefg')
convert = lambda a: np.fromstring(a[1:-1], count = a.count(',') + 1, sep = ', ', dtype=int)
df.g = df.g.apply(convert)

Once a mixed dataframe is constructed, X and Y can be extracted as arrays in a simple fashion:

X = df.values[:, :-1].astype(int)
Y = np.vstack(df.values[:, -1])

Upvotes: 1

stelios

Reputation: 2845

I managed to achieve this only via a custom method:

import numpy

with open('dataset/demo_dataset.csv', 'r') as fin:
    lines = fin.readlines()
    # remove '\n' characters
    clean_lines = [l.strip('\n') for l in lines]
    # split on tab so that we get lists from strings
    A = [cl.split('\t') for cl in clean_lines]
    # get lists of ints instead of lists of strings
    X = [map(int, row[0:6]) for row in A]
    # last column in Y
    Y = [row[6] for row in A]

    # convert string to int values
    for i in xrange(len(Y)):
        Y[i] = map(int, Y[i].strip('[]').split(','))

Upvotes: 3

Reading tab delimited csv into numpy array with different data types

Answers (3)

Related Questions