user1966176
user1966176

Reputation: 311

Load text file as strings using numpy.loadtxt()

I would like to load a big text file (around 1 GB with 3*10^6 rows and 10 - 100 columns) as a 2D np-array containing strings. However, it seems like numpy.loadtxt() only takes floats as default. Is it possible to specify another data type for the entire array? I've tried the following without luck:

loadedData = np.loadtxt(address, dtype=np.str)

I get the following error message:

/Library/Python/2.7/site-packages/numpy-1.8.0.dev_20224ea_20121123-py2.7-macosx-10.8-x86_64.egg/numpy/lib/npyio.pyc in loadtxt(fname, dtype, comments, delimiter, converters, skiprows, usecols, unpack, ndmin)
    833             fh.close()
    834
--> 835     X = np.array(X, dtype)
    836     # Multicolumn data are returned with shape (1, N, M), i.e.
    837     # (1, 1, M) for a single row - remove the singleton dimension there

ValueError: cannot set an array element with a sequence

Any ideas? (I don't know the exact number of columns in my file on beforehand.)

Upvotes: 30

Views: 162627

Answers (4)

Peng Zheng
Peng Zheng

Reputation: 77

np.loadtxt(file_path, dtype=str) enter image description here

Upvotes: 3

There is also read_csv in Pandas, which is fast and supports non-comma column separators and automatic typing by column:

import pandas as pd
df = pd.read_csv('your_file',sep='\t')

It can be converted to a NumPy array if you prefer that type with:

import numpy as np
arr = np.array(df)

This is by far the easiest and most mature text import approach I've come across.

Upvotes: 17

flonk
flonk

Reputation: 3876

Is it essential that you need a NumPy array? Otherwise you could speed things up by loading the data as a nested list.

def load(fname):
    ''' Load the file using std open'''
    f = open(fname,'r')

    data = []
    for line in f.readlines():
        data.append(line.replace('\n','').split(' '))

    f.close()

    return data

For a text file with 4000x4000 words this is about 10 times faster than loadtxt.

Upvotes: 2

Hooked
Hooked

Reputation: 88118

Use genfromtxt instead. It's a much more general method than loadtxt:

import numpy as np
print np.genfromtxt('col.txt',dtype='str')

Using the file col.txt:

foo bar
cat dog
man wine

This gives:

[['foo' 'bar']
 ['cat' 'dog']
 ['man' 'wine']]

If you expect that each row has the same number of columns, read the first row and set the attribute filling_values to fix any missing rows.

Upvotes: 60

Related Questions