Reputation: 586
I'm trying to parse a long .txt
text table to a pd.DataFrame
, or any other readable form in Python. I tried successfully a way to do it, but I'm not completely happy with it and I do want to improve it.
This is a subset of my .txt
:
USAF WBAN STATION NAME CTRY ST CALL LAT LON ELEV(M) BEGIN END
007018 99999 WXPOD 7018 +00.000 +000.000 +7018.0 20110309 20130730
007026 99999 WXPOD 7026 AF +00.000 +000.000 +7026.0 20120713 20170822
007070 99999 WXPOD 7070 AF +00.000 +000.000 +7070.0 20140923 20150926
008260 99999 WXPOD8270 +00.000 +000.000 +0000.0 19960101 20100731
008268 99999 WXPOD8278 AF +32.950 +065.567 +1156.7 20100519 20120323
008307 99999 WXPOD 8318 AF +00.000 +000.000 +8318.0 20100421 20100421
008411 99999 XM20 20160217 20160217
008414 99999 XM18 20160216 20160217
008415 99999 XM21 20160217 20160217
008418 99999 XM24 20160217 20160217
010000 99999 BOGUS NORWAY NO ENRS 20010927 20041019
010010 99999 JAN MAYEN(NOR-NAVY) NO ENJA +70.933 -008.667 +0009.0 19310101 20190203
I tried the following:
Test 1: Use re
to parse the spaces:
with open('test.txt') as f:
lines = f.readlines()
parsed_lines = [re.split("\s+", line) for line in lines]
print(test)
['007018',
'99999',
'WXPOD',
'7018',
'+00.000',
'+000.000',
'+7018.0',
'20110309',
'20130730',
'']
Which is ok, but far from optimal, the station name is splitted, into another list element, which only shows my bad regex
abilities.
Test 2: Use string positions:
with open('/Users/ivan/weather_isd_noaa/data/isd-history.txt') as f:
lines = f.readlines()
colum_names = lines[0]
usaf_code = []
wban = []
station_name = []
country = []
us_state = []
call = []
lat = []
lon = []
elevation = []
begin = []
end = []
for line in lines:
usaf_code.append(line[:6])
wban.append(line[7:12])
station_name.append(line[13:43])
country.append(line[43:45])
us_state.append(line[46:50])
call.append(line[51:55])
call.append(line[57:64])
lat.append(line[57:64])
lon.append(line[56:73])
elevation.append(line[74:81])
begin.append(line[82:90])
end.append(line[91:100])
df = pd.DataFrame({
'usaf_code': usaf_code,
'wban': wban,
'station_name': station_name,
'country': country,
'us_state': us_state,
'lat': lat,
'lon': lon,
'elevation': elevation,
'begin': begin,
'end': end
})
Which yields a nice pd.DataFrame
that is fairly easy to clean. This is a good approach, but I am sure is not an excellent one, there is some way to improve this function? I want this code to be flawless and I do not trust those hardcoded positions.
I know that some tools like sed
or awk
are extremely useful, but for now I can only run python on this; sure subprocess
can do the trick, but I want to rely on Python for this.
Upvotes: 1
Views: 75
Reputation: 10860
Imo, this definitely looks like a fixed width format, i.e. every column has its own, fixed number of characters.
So, iiuc what your problem is, you should try pd.read_fwf()
.
https://pandas.pydata.org/pandas-docs/version/0.20/generated/pandas.read_fwf.html
Upvotes: 3