Reputation: 1311
I have a text/csv file that contains , amongst others, rows that look like this:
05:21:20PM Driving 46 84.0 Some Road; Some Ext 1; in SomePLace; Long 38 12 40.6 E Lat 29 2 47.2 S
There are other rows containing data that I am not after.
I am only looking to extract the timestamp, and then the LatLong .
The only thing constant in the rows I am interested in is the timstamp at the beginning, that is always 8 characters long and ends with PM or AM, and then the Lat/Long that starts with the word "Long" and ends in an "S".
Is there any way that I can run through this file and only strip out these two peices of text, concatenate them into a new row, and ignoring all other rows that does not have the timestamp as first entry AND the Lat/Long part at the end ( some rows have a timestamp in beginning but not the lat/long)
Upvotes: 0
Views: 68
Reputation: 12960
I do not recommend using regular expressions if your data is in CSV format because this is not going to be pretty and regular expressions are the wrong tool for CSV. But because your data does not look like a true CSV format, parsing it using regular expressions might be an option and this code would work for the sample you have provided:
import re
with open('inputfilename', 'rU') as f:
for line in f:
mat = re.match("(\d+):(\d+):(\d+)([AP]M).*Long\s+([^EW]+[EW]).*Lat\s+([^NS]+[NS])", line)
if mat is not None:
print mat.groups()
result:
('05', '21', '20', 'PM', '38 12 40.6 E', '29 2 47.2 S')
Further processing of this result is left as an exercise, but it could look like this:
hour, minute, second, am_pm, long, lat = mat.groups()
Upvotes: 1
Reputation: 4360
>>> s = "05:21:20PM Driving 46 84.0 Some Road; Some Ext 1; in SomePLace; Long 38 12 40.6 E Lat 29 2 47.2 S"
>>> date = s.split(" ")[0]
>>> date
'05:21:20PM'
>>> long_start = "Long"
>>> lat_start = "Lat"
>>> longtitude = s[s.find(long_start) + len(long_start): s.find(lat_start)]
>>> longtitude
' 38 12 40.6 E '
>>> latitude = s[s.find(lat_start) + len(lat_start):]
>>>
>>> latitude
' 29 2 47.2 S'
>>> latitude = s[s.find(lat_start) + len(lat_start):].strip()
>>> latitude
'29 2 47.2 S'
>>>
Upvotes: 0
Reputation: 1124768
Use the csv
module to parse out the rows, then split the last column on ;
to get the lat/long coordinates:
with open(inputfilename, 'rb') as inputfh:
reader = csv.reader(inputfh, delimiter='\t')
for row in reader:
timestamp = row[0]
lat_long = row[2].rpartition(';')[-1].strip()
This assumes that the file is tab-separated and that the latitute/longitude entry is always the last ;
semi-colon separated value in the 3rd column
Upvotes: 1