topcat
topcat

Reputation: 586

Parse text file to python with unquoted strings

I'm trying to parse a long .txt text table to a pd.DataFrame, or any other readable form in Python. I tried successfully a way to do it, but I'm not completely happy with it and I do want to improve it.

This is a subset of my .txt:

USAF   WBAN  STATION NAME                  CTRY ST CALL  LAT     LON      ELEV(M) BEGIN    END

007018 99999 WXPOD 7018                                  +00.000 +000.000 +7018.0 20110309 20130730
007026 99999 WXPOD 7026                    AF            +00.000 +000.000 +7026.0 20120713 20170822
007070 99999 WXPOD 7070                    AF            +00.000 +000.000 +7070.0 20140923 20150926
008260 99999 WXPOD8270                                   +00.000 +000.000 +0000.0 19960101 20100731
008268 99999 WXPOD8278                     AF            +32.950 +065.567 +1156.7 20100519 20120323
008307 99999 WXPOD 8318                    AF            +00.000 +000.000 +8318.0 20100421 20100421
008411 99999 XM20                                                                 20160217 20160217
008414 99999 XM18                                                                 20160216 20160217
008415 99999 XM21                                                                 20160217 20160217
008418 99999 XM24                                                                 20160217 20160217
010000 99999 BOGUS NORWAY                  NO      ENRS                           20010927 20041019
010010 99999 JAN MAYEN(NOR-NAVY)           NO      ENJA  +70.933 -008.667 +0009.0 19310101 20190203

I tried the following:

Test 1: Use re to parse the spaces:

with open('test.txt') as f:
   lines = f.readlines()
   parsed_lines = [re.split("\s+", line) for line in lines]
print(test)

['007018',
 '99999',
 'WXPOD',
 '7018',
 '+00.000',
 '+000.000',
 '+7018.0',
 '20110309',
 '20130730',
 '']

Which is ok, but far from optimal, the station name is splitted, into another list element, which only shows my bad regex abilities.

Test 2: Use string positions:

with open('/Users/ivan/weather_isd_noaa/data/isd-history.txt') as f:
    lines = f.readlines()

    colum_names = lines[0]
    usaf_code = []
    wban = []
    station_name = []
    country = []
    us_state = []
    call = []
    lat = []
    lon = []
    elevation = []
    begin = []
    end = []

    for line in lines:
        usaf_code.append(line[:6])
        wban.append(line[7:12])
        station_name.append(line[13:43])
        country.append(line[43:45])
        us_state.append(line[46:50])
        call.append(line[51:55])
        call.append(line[57:64])
        lat.append(line[57:64])
        lon.append(line[56:73])
        elevation.append(line[74:81])
        begin.append(line[82:90])
        end.append(line[91:100])

    df = pd.DataFrame({
        'usaf_code': usaf_code,
        'wban': wban,
        'station_name': station_name,
        'country': country,
        'us_state': us_state,
        'lat': lat,
        'lon': lon,
        'elevation': elevation,
        'begin': begin,
        'end': end
    })

Which yields a nice pd.DataFrame that is fairly easy to clean. This is a good approach, but I am sure is not an excellent one, there is some way to improve this function? I want this code to be flawless and I do not trust those hardcoded positions.

I know that some tools like sed or awk are extremely useful, but for now I can only run python on this; sure subprocess can do the trick, but I want to rely on Python for this.

Upvotes: 1

Views: 75

Answers (1)

SpghttCd
SpghttCd

Reputation: 10860

Imo, this definitely looks like a fixed width format, i.e. every column has its own, fixed number of characters.

So, iiuc what your problem is, you should try pd.read_fwf().
https://pandas.pydata.org/pandas-docs/version/0.20/generated/pandas.read_fwf.html

Upvotes: 3

Related Questions