Monty
Monty

Reputation: 1311

Finding text in a string matching patterns

I have a text/csv file that contains , amongst others, rows that look like this:

05:21:20PM   Driving 46 84.0         Some Road; Some Ext 1; in SomePLace; Long 38 12 40.6 E Lat 29 2 47.2 S

There are other rows containing data that I am not after.

I am only looking to extract the timestamp, and then the LatLong .

The only thing constant in the rows I am interested in is the timstamp at the beginning, that is always 8 characters long and ends with PM or AM, and then the Lat/Long that starts with the word "Long" and ends in an "S".

Is there any way that I can run through this file and only strip out these two peices of text, concatenate them into a new row, and ignoring all other rows that does not have the timestamp as first entry AND the Lat/Long part at the end ( some rows have a timestamp in beginning but not the lat/long)

Upvotes: 0

Views: 68

Answers (3)

hochl
hochl

Reputation: 12960

I do not recommend using regular expressions if your data is in CSV format because this is not going to be pretty and regular expressions are the wrong tool for CSV. But because your data does not look like a true CSV format, parsing it using regular expressions might be an option and this code would work for the sample you have provided:

import re

with open('inputfilename', 'rU') as f:
    for line in f:
        mat = re.match("(\d+):(\d+):(\d+)([AP]M).*Long\s+([^EW]+[EW]).*Lat\s+([^NS]+[NS])", line)
        if mat is not None:
            print mat.groups()

result:

('05', '21', '20', 'PM', '38 12 40.6 E', '29 2 47.2 S')

Further processing of this result is left as an exercise, but it could look like this:

hour, minute, second, am_pm, long, lat = mat.groups()

Upvotes: 1

ceremcem
ceremcem

Reputation: 4360

>>> s = "05:21:20PM   Driving 46 84.0         Some Road; Some Ext 1; in SomePLace; Long 38 12 40.6 E Lat 29 2 47.2 S"
>>> date = s.split(" ")[0]
>>> date
'05:21:20PM'
>>> long_start = "Long"
>>> lat_start = "Lat"
>>> longtitude = s[s.find(long_start) + len(long_start): s.find(lat_start)]
>>> longtitude 
' 38 12 40.6 E '
>>> latitude = s[s.find(lat_start) + len(lat_start):]
>>> 
>>> latitude
' 29 2 47.2 S'
>>> latitude = s[s.find(lat_start) + len(lat_start):].strip()
>>> latitude
'29 2 47.2 S'
>>> 

Upvotes: 0

Martijn Pieters
Martijn Pieters

Reputation: 1124768

Use the csv module to parse out the rows, then split the last column on ; to get the lat/long coordinates:

with open(inputfilename, 'rb') as inputfh:
    reader = csv.reader(inputfh, delimiter='\t')
    for row in reader:
        timestamp = row[0]
        lat_long = row[2].rpartition(';')[-1].strip()

This assumes that the file is tab-separated and that the latitute/longitude entry is always the last ; semi-colon separated value in the 3rd column

Upvotes: 1

Related Questions