Reputation: 3484
I'm trying to read in a file that looks like this:
1, 2,
3, 4,
I'm using the following line:
l1,l2 = numpy.loadtxt('file.txt',unpack=True,delimiter=', ')
This gives me an error because the end comma in each row is lumped together as the last element (e.g. "2" is read as "2,"). Is there a way to ignore the last comma in each row, with loadtxt or another function?
Upvotes: 7
Views: 8477
Reputation: 121
I wanted a solution that:
I went with using numpy.genfromtxt instead, overwriting its delimiting behaviour to ignore the last element if it's empty:
import numpy as np
from numpy.lib import npyio
def _cutoff_last(func, *args, **kwargs) -> list:
line = func(*args, **kwargs)
if line and line[-1] == '':
line = line[:-1]
return line
if __name__ == '__main__':
# overwrite delimiting behavior
_delim_splitter_original = npyio.LineSplitter._delimited_splitter
npyio.LineSplitter._delimited_splitter = lambda *args: _cutoff_last(_delim_splitter_original, *args)
mat = np.genfromtxt('mat.txt', delimiter=',')
This probably should not be used in large code-bases (as it changes the behaviour of numpy), but is perfect for many use-cases.
Upvotes: 1
Reputation: 172
Depending on your needs this solution might be overkill but when working with large sets of data files from external sources (especially excel, but also binary, csv, tsv, or others) I found the pandas
module to be a very convenient and efficient way to read and process data.
Given a data file test-data.txt
having the following content
1, 2,
2, 3,
4, 5,
you can read the file by using
import pandas as pd
data = pd.read_csv("test-data.txt", names = ("col1", "col2"), usecols=(0,1))
in[25]: data
Out[25]:
col1 col2
0 1 2
1 2 3
2 4 5
In[26]: data.col1
Out[26]:
0 1
1 2
2 4
The result is a DataFrame
object with indexed lines and column labels that can be used for data access. If your data file contains a header it is directly used for labeling the columns. Otherwise you can specify the label for each column with the names
argument. The usecols
argument allows to avoid the 3rd column that would otherwise be read as a column with nan
values.
Upvotes: 2
Reputation: 231385
usecols
also works with loadtxt
:
Simulate a file with text split into lines:
In [162]: txt=b"""1, 2,
3,4,"""
In [163]: txt=txt.splitlines()
In [164]: txt
Out[164]: [b'1, 2,', b'3,4,']
In [165]: x,y=np.loadtxt(txt,delimiter=',',usecols=[0,1],unpack=True)
In [166]: x
Out[166]: array([ 1., 3.])
In [167]: y
Out[167]: array([ 2., 4.])
loadtxt
and genfromtxt
don't work well with multicharacter delimiters.
loadtxt
and genfromtxt
accept any iterable, including a generator. Thus you could open the file and process the lines one by one, removing the extra character.
In [180]: def g(txt):
.....: t = txt.splitlines()
.....: for l in t:
.....: yield l[:-1]
In [181]: list(g(txt))
Out[181]: [b'1, 2', b'3,4']
A generator that yields the lines one by one, stripped of the last character. This could be changed to read a file line by line:
In [182]: x,y=np.loadtxt(g(txt),delimiter=',',unpack=True)
In [183]: x,y
Out[183]: (array([ 1., 3.]), array([ 2., 4.]))
Upvotes: 3
Reputation: 114811
numpy.genfromtxt
is a bit more robust. If you use the default dtype (which is np.float64
), it thinks there is a third column with missing values, so it creates a third column containing nan
. If you give it dtype=None
(which tells it to figure out the data type from the file), it returns a third column containing all zeros. Either way, you can ignore the last column by using usecols=[0, 1]
:
In [14]: !cat trailing_comma.csv
1, 2,
3, 4,
Important note: I use delimiter=','
, not delimiter=', '
.
In [15]: np.genfromtxt('trailing_comma.csv', delimiter=',', dtype=None, usecols=[0,1])
Out[15]:
array([[1, 2],
[3, 4]])
In [16]: col1, col2 = np.genfromtxt('trailing_comma.csv', delimiter=',', dtype=None, usecols=[0,1], unpack=True)
In [17]: col1
Out[17]: array([1, 3])
In [18]: col2
Out[18]: array([2, 4])
Upvotes: 8
Reputation: 15349
It's fairly easy to roll your own file-reader in Python, rather than having to rely on the constraints of numpy.loadtxt
:
content = [ [ float( x ) for x in row.split(',') if x.strip() ] for row in open( filename, 'rt' ) ]
Upvotes: 0