user2186862
user2186862

Reputation: 223

pandas read_csv ignore separator in last column

I have a file with the following structure (first row is the header, filename is test.dat):

ID_OBS          LAT     LON     ALT TP TO LT_min LT_max STATIONNAME
ALT_NOA_000   82.45  -62.52   210.0 FM  0    0.0   24.0 Alert, Nunavut, Canada

How do I instruct pandas to read the entire station name (in this example, Alert, Nunavut, Canada) as a single element? I use delim_whitespace=True in my code, but that does not work, since the station name contains whitespace characters.

Running:

import pandas as pd
test = pd.read_csv('./test.dat', delim_whitespace=True, header=1)
print(test.to_string())

Produces:

                   ID_OBS    LAT LON  ALT   TP    TO  LT_min    LT_max STATIONNAME
ALT_NOA_000 82.45  -62.52  210.0  FM    0  0.0  24.0  Alert,  Nunavut,      Canada

Quickly reading through the tutorials did not help. What am I missing here?

Upvotes: 4

Views: 3896

Answers (2)

Ami Tavory
Ami Tavory

Reputation: 76297

Your pasted sample file is a bit ambiguous: it's not possible to tell by eye if something that looks like a few spaces is a tab or not, for example.

In general, though, note that plain old Python is more expressive than Pandas, or CSV modules (Pandas's strength is elseswhere). E.g., there are even Python modules for recursive-descent parsers, which Pandas obviously lacks. You can use regular Python to manipulate the file into an easier form for Pandas to parse. For example:

import re
>>> ['@'.join(re.split(r'[ \t]+', l.strip(), maxsplit=8)) for l in open('stuff.tsv') if l.strip()]
['ID_OBS@LAT@LON@ALT@TP@TO@LT_min@LT_max@STATIONNAME',
 '[email protected]@[email protected]@FM@[email protected]@24.0@Alert, Nunavut, Canada']

changes the delimiter to '@', which, if you write back to a file, for example, you can parse using delimiter='@'.

Upvotes: 1

Stephen Rauch
Stephen Rauch

Reputation: 49784

I often approach these by writing my own little parser. In general there are ways to bend pandas to your will, but I find this way is often easier:

Code:

import re

def parse_my_file(filename):
    with open(filename) as f:
        for line in f:
            yield re.split(r'\s+', line.strip(), 8)

# build the generator        
my_parser = parse_my_file('test.dat')

# first element returned is the columns
columns = next(my_parser)

# build the data frame
df = pd.DataFrame(my_parser, columns=columns)
print(df)

Results:

        ID_OBS    LAT     LON    ALT  TP TO LT_min LT_max  \
0  ALT_NOA_000  82.45  -62.52  210.0  FM  0    0.0   24.0   

              STATIONNAME  
0  Alert, Nunavut, Canada 

Upvotes: 3

Related Questions