Reputation: 99
I am converting a text file into netCDF format. I am reading the data from text file into a dataframe in which two of the columns are latitude_GPS and longitude_GPS. Input looks like:
latitude_GPS, longitude_GPS
7537.4536, 3558.4985
7672.1534, 3214.9532
They are measured in ddmm.ssss units which means if we have value like 7537.4536, here '75' is degrees, '37' is minutes and '4536' is seconds. I want to convert them into degree decimal except for the missing values which have a value of 999.0
My current code looks like this:
header_rows = 1
df = pd.read_csv(args.input_file, delim_whitespace=True, skiprows=header_rows, skip_blank_lines=True, names = column_names)
num_rows = sum(1 for line in open(args.input_file) if len(line.strip()) != 0) - header_rows
def lat_lon_gps(col_index):
return ((int(col_index)/100) + round((int(col_index%100))/60, 4) + round(round(col_index%1, 4)/3600, 4))
check_na = 999.0
i = 0
while i < num_rows:
if df['latitude_GPS'][i] != check_na:
df['latitude_GPS'][i] = lat_lon_gps(df['latitude_GPS'][i])
if df['longitude_GPS'][i] != check_na:
df['longitude_GPS'][i] = lat_lon_gps(df['longitude_GPS'][i])
The return part calculates (75 + 37/60 + 4536/3600). The above code returns what I want but it takes around 50 minutes to run this part of code for a file having 10000 rows. Is there a faster way to do it. Any thoughts would be appreciated.
Upvotes: 1
Views: 223
Reputation: 1624
The problem is that you are iterating over every row. You should take advantage of vectorisation provided by pandas and numpy.
For example:
import numpy as np
import pandas as pd
df = pd.read_csv(args.input_file,
names=['latitude_GPS','longitude_GPS'],
skiprows=1)
check_na = 999.0
def lat_lon_gps(coords):
deg = np.floor(coords / 100)
minutes = np.floor(((coords / 100) - deg) * 100)
seconds = (((coords / 100) - deg) * 100 - minutes) * 100
return deg + minutes / 60 + seconds / 3600
# Exclude NAs
logic = df.latitude_GPS != check_na
df = df[logic]
df.latitude_GPS = lat_lon_gps(df.latitude_GPS)
df.longitude_GPS = lat_lon_gps(df.longitude_GPS)
Upvotes: 2