Reputation: 13
I have a database of every major airport's lat/long coords all across the world. I only need a portion of them (specifically in the USA) that are listed in a separate .csv file.
This csv file has two columns I extracted data from into two lists: The origin airport code (IATA code) and the destination airport code (also IATA).
My database has a column for IATA, and essentially I'm trying to query this database to get the respective lat/long coords for each airport in the two lists I have.
Here's my code:
import pandas as pd
from sqlalchemy import create_engine
engine = create_engine('sqlite:///airport_coordinates.db')
# The dataframe that contains the IATA codes for the airports I need
airport_relpath = "data/processed/%s_%s_combined.csv" % (file, airline)
script_dir = os.path.dirname(os.getcwd())
temp_file = os.path.join(script_dir, airport_relpath)
fields = ["Origin_Airport_Code", "Destination_Airport_Code"]
df_airports = pd.read_csv(temp_file, usecols=fields)
# the origin/destination IATA codes for the airports I need
origin = df_airports.Origin_Airport_Code.values
dest = df_airports.Destination_Airport_Code.values
# query the database for the lat/long coords of the airports I need
sql = ('SELECT lat, long FROM airportCoords WHERE iata IN %s' %(origin))
indexcols = ['lat', 'long']
df_origin = pd.read_sql(sql, engine)
# testing the origin coordinates
print(df_origin)
This is the error I'm getting:
sqlalchemy.exc.OperationalError: (sqlite3.OperationalError) no such
table: 'JFK' 'JFK' 'JFK' ... 'MIA' 'JFK' 'MIA' [SQL: "SELECT lat, long
FROM airportCoords WHERE iata IN ['JFK' 'JFK' 'JFK' ... 'MIA' 'JFK'
'MIA']"] (Background on this error at: http://sqlalche.me/e/e3q8)
It's definitely because I'm not querying it correctly (since it thinks my queries are supposed to tables).
I tried looping through the list to query each element individually, but the list contains over 604,885 elements and my computer was not able to come up with any output.
Upvotes: 0
Views: 161
Reputation: 1121844
Your error is in using string interpolation:
sql = ('SELECT lat, long FROM airportCoords WHERE iata IN %s' %(origin))
Because origin
is a Numpy array this results in a [....]
SQL identifier syntax in the query; see the SQLite documentation:
If you want to use a keyword as a name, you need to quote it. There are four ways of quoting keywords in SQLite:
[...]
[keyword]
A keyword enclosed in square brackets is an identifier. [...]
You asked SQLite to check if iata
is in a table named ['JFK' 'JFK' 'JFK' ... 'MIA' 'JFK' 'MIA']
because that's the string representation of a Numpy array.
You are already using SQLAlchemy, it would be easier if you used that library to generate all SQL for you, including the IN (....)
membership test:
from sqlalchemy import *
filter = literal_column('iata', String).in_(origin)
sql = select([
literal_column('lat', Float),
literal_column('long', Float),
]).select_from(table('airportCoords')).where(filter)
then pass sql
in as the query.
I used literal_column()
and table()
objects here to shortcut directly to the names of the objects, but you could also ask SQLAlchemy to reflect your database table directly from the engine
object you already created, then use the resulting table definition to generate the query:
metadata = MetaData()
airport_coords = Table('airportCoords', metadata, autoload=True, autoload_with=engine)
at which point the query would be defined as:
filter = airport_coords.c.iata.in_(origin)
sql = select([airport_coords.c.lat, airport_coords.c.long]).where(filter)
I'd also include the iata
code in the output, otherwise you will have no path back to connecting IATA code to the matching coordinates:
sql = select([airport_coords.c.lat, airport_coords.c.long, airport_coords.c.iata]).where(filter)
Next, as you say you have 604,885 elements in the list, so you probably want to load that CSV data into a temporary table to keep the query efficient:
engine = create_engine('sqlite:///airport_coordinates.db')
# code to read CSV file
# ...
df_airports = pd.read_csv(temp_file, usecols=fields)
# SQLAlchemy table wrangling
metadata = MetaData()
airport_coords = Table('airportCoords', metadata, autoload=True, autoload_with=engine)
temp = Table(
"airports_temp",
metadata,
*(Column(field, String) for field in fields),
prefixes=['TEMPORARY']
)
with engine.begin() as conn:
# insert CSV values into a temporary table in SQLite
temp.create(conn, checkfirst=True)
df_airports.to_sql(temp.name), engine, if_exists='append')
# Join the airport coords against the temporary table
joined = airport_coords.join(temp, airport_coords.c.iata==temp.c.Origin_Airport_Code)
# select coordinates per airport, include the iata code
sql = select([airport_coords.c.lat, airport_coords.c.long, airport_coords.c.iata]).select_from(joined)
df_origin = pd.read_sql(sql, engine)
Upvotes: 1