ylangylang
ylangylang

Reputation: 3484

Python - numpy.loadtxt how to ignore end commas?

I'm trying to read in a file that looks like this:

1, 2,
3, 4,

I'm using the following line:

l1,l2 = numpy.loadtxt('file.txt',unpack=True,delimiter=', ')

This gives me an error because the end comma in each row is lumped together as the last element (e.g. "2" is read as "2,"). Is there a way to ignore the last comma in each row, with loadtxt or another function?

Upvotes: 7

Views: 8477

Answers (5)

phil
phil

Reputation: 121

I wanted a solution that:

  1. doesn't require manually specifying columns (as accepted answer)
  2. doesn't use different packages (e.g. pandas)
  3. doesn't require preprocessing
  4. dynamically works for inputs with and without trailing commas (without specifying)

I went with using numpy.genfromtxt instead, overwriting its delimiting behaviour to ignore the last element if it's empty:

import numpy as np
from numpy.lib import npyio


def _cutoff_last(func, *args, **kwargs) -> list:
    line = func(*args, **kwargs)
    if line and line[-1] == '':
        line = line[:-1]
    return line


if __name__ == '__main__':
    # overwrite delimiting behavior
    _delim_splitter_original = npyio.LineSplitter._delimited_splitter
    npyio.LineSplitter._delimited_splitter = lambda *args: _cutoff_last(_delim_splitter_original, *args)

    mat = np.genfromtxt('mat.txt', delimiter=',')

This probably should not be used in large code-bases (as it changes the behaviour of numpy), but is perfect for many use-cases.

Upvotes: 1

MrCyclophil
MrCyclophil

Reputation: 172

Depending on your needs this solution might be overkill but when working with large sets of data files from external sources (especially excel, but also binary, csv, tsv, or others) I found the pandas module to be a very convenient and efficient way to read and process data.

Given a data file test-data.txt having the following content

1, 2,
2, 3,
4, 5,

you can read the file by using

import pandas as pd
data = pd.read_csv("test-data.txt", names = ("col1", "col2"), usecols=(0,1))
in[25]: data
Out[25]: 
   col1  col2
0     1     2
1     2     3
2     4     5
In[26]: data.col1
Out[26]: 
0    1
1    2
2    4

The result is a DataFrame object with indexed lines and column labels that can be used for data access. If your data file contains a header it is directly used for labeling the columns. Otherwise you can specify the label for each column with the names argument. The usecols argument allows to avoid the 3rd column that would otherwise be read as a column with nan values.

Upvotes: 2

hpaulj
hpaulj

Reputation: 231385

usecols also works with loadtxt:

Simulate a file with text split into lines:

In [162]: txt=b"""1, 2,
3,4,"""
In [163]: txt=txt.splitlines()
In [164]: txt
Out[164]: [b'1, 2,', b'3,4,']

In [165]: x,y=np.loadtxt(txt,delimiter=',',usecols=[0,1],unpack=True)
In [166]: x
Out[166]: array([ 1.,  3.])
In [167]: y
Out[167]: array([ 2.,  4.])

loadtxt and genfromtxt don't work well with multicharacter delimiters.

loadtxt and genfromtxt accept any iterable, including a generator. Thus you could open the file and process the lines one by one, removing the extra character.

In [180]: def g(txt):
   .....:     t = txt.splitlines()
   .....:     for l in t:
   .....:         yield l[:-1]

In [181]: list(g(txt))
Out[181]: [b'1, 2', b'3,4']

A generator that yields the lines one by one, stripped of the last character. This could be changed to read a file line by line:

In [182]: x,y=np.loadtxt(g(txt),delimiter=',',unpack=True)
In [183]: x,y
Out[183]: (array([ 1.,  3.]), array([ 2.,  4.]))

Upvotes: 3

Warren Weckesser
Warren Weckesser

Reputation: 114811

numpy.genfromtxt is a bit more robust. If you use the default dtype (which is np.float64), it thinks there is a third column with missing values, so it creates a third column containing nan. If you give it dtype=None (which tells it to figure out the data type from the file), it returns a third column containing all zeros. Either way, you can ignore the last column by using usecols=[0, 1]:

In [14]: !cat trailing_comma.csv
1, 2,
3, 4,

Important note: I use delimiter=',', not delimiter=', '.

In [15]: np.genfromtxt('trailing_comma.csv', delimiter=',', dtype=None, usecols=[0,1])
Out[15]: 
array([[1, 2],
       [3, 4]])

In [16]: col1, col2 = np.genfromtxt('trailing_comma.csv', delimiter=',', dtype=None, usecols=[0,1], unpack=True)

In [17]: col1
Out[17]: array([1, 3])

In [18]: col2
Out[18]: array([2, 4])

Upvotes: 8

jez
jez

Reputation: 15349

It's fairly easy to roll your own file-reader in Python, rather than having to rely on the constraints of numpy.loadtxt:

content = [ [ float( x ) for x in row.split(',') if x.strip() ] for row in open( filename, 'rt' ) ]

Upvotes: 0

Related Questions