Reputation: 223
I have a file with the following structure (first row is the header, filename is test.dat
):
ID_OBS LAT LON ALT TP TO LT_min LT_max STATIONNAME
ALT_NOA_000 82.45 -62.52 210.0 FM 0 0.0 24.0 Alert, Nunavut, Canada
How do I instruct pandas to read the entire station name (in this example, Alert, Nunavut, Canada) as a single element? I use delim_whitespace=True
in my code, but that does not work, since the station name contains whitespace characters.
Running:
import pandas as pd
test = pd.read_csv('./test.dat', delim_whitespace=True, header=1)
print(test.to_string())
Produces:
ID_OBS LAT LON ALT TP TO LT_min LT_max STATIONNAME
ALT_NOA_000 82.45 -62.52 210.0 FM 0 0.0 24.0 Alert, Nunavut, Canada
Quickly reading through the tutorials did not help. What am I missing here?
Upvotes: 4
Views: 3896
Reputation: 76297
Your pasted sample file is a bit ambiguous: it's not possible to tell by eye if something that looks like a few spaces is a tab or not, for example.
In general, though, note that plain old Python is more expressive than Pandas, or CSV modules (Pandas's strength is elseswhere). E.g., there are even Python modules for recursive-descent parsers, which Pandas obviously lacks. You can use regular Python to manipulate the file into an easier form for Pandas to parse. For example:
import re
>>> ['@'.join(re.split(r'[ \t]+', l.strip(), maxsplit=8)) for l in open('stuff.tsv') if l.strip()]
['ID_OBS@LAT@LON@ALT@TP@TO@LT_min@LT_max@STATIONNAME',
'[email protected]@[email protected]@FM@[email protected]@24.0@Alert, Nunavut, Canada']
changes the delimiter to '@'
, which, if you write back to a file, for example, you can parse using delimiter='@'
.
Upvotes: 1
Reputation: 49784
I often approach these by writing my own little parser. In general there are ways to bend pandas to your will, but I find this way is often easier:
Code:
import re
def parse_my_file(filename):
with open(filename) as f:
for line in f:
yield re.split(r'\s+', line.strip(), 8)
# build the generator
my_parser = parse_my_file('test.dat')
# first element returned is the columns
columns = next(my_parser)
# build the data frame
df = pd.DataFrame(my_parser, columns=columns)
print(df)
Results:
ID_OBS LAT LON ALT TP TO LT_min LT_max \
0 ALT_NOA_000 82.45 -62.52 210.0 FM 0 0.0 24.0
STATIONNAME
0 Alert, Nunavut, Canada
Upvotes: 3