Reputation: 883
This is apparently a popular interview question.
There are 2 CSV files with dinosaur data. We need to query them to return dinosaurs satisfying a certain condition.
Note - We cannot use additional modules like q, fsql, csvkit etc.
file1.csv:
NAME,LEG_LENGTH,DIET
Hadrosaurus,1.2,herbivore
Struthiomimus,0.92,omnivore
Velociraptor,1.0,carnivore
Triceratops,0.87,herbivore
Euoplocephalus,1.6,herbivore
Stegosaurus,1.40,herbivore
Tyrannosaurus Rex,2.5,carnivore
file2.csv
NAME,STRIDE_LENGTH,STANCE
Euoplocephalus,1.87,quadrupedal
Stegosaurus,1.90,quadrupedal
Tyrannosaurus Rex,5.76,bipedal
Hadrosaurus,1.4,bipedal
Deinonychus,1.21,bipedal
Struthiomimus,1.34,bipedal
Velociraptor,2.72,bipedal
using the forumla : speed = ((STRIDE_LENGTH / LEG_LENGTH) - 1) * SQRT(LEG_LENGTH * g), where g = 9.8 m/s^2
Write a program to read csv files, and print only names of bipedal dinosaurs, sorted by speed from fastest to slowest.
In SQL, this would be simple:
select f2.name from
file1 f1 join file2 f2 on f1.name = f2.name
where f1.stance = 'bipedal'
order by (f2.stride_length/f1.leg_length - 1)*pow(f1.leg_length*9.8,0.5) desc
How can this be done in python ?
Upvotes: 6
Views: 21041
Reputation: 1631
I'd recommend duckdb
: https://duckdb.org/docs/api/python/overview
Example:
import duckdb
duckdb.sql("""
SELECT f1.NAME FROM 'file1.csv' as 'f1'
JOIN 'file2.csv' as 'f2' ON f1.NAME = f2.NAME
WHERE STANCE = 'bipedal'
ORDER BY (STRIDE_LENGTH/LEG_LENGTH - 1)*pow(LEG_LENGTH*9.8,0.5) DESC
""")
Upvotes: 3
Reputation: 1933
I've encountered the same problem at work and decided to build an offline Desktop app where you can load CSVs and start writing SQL. You can join, group by, and etc.
This is backed by C and SQLite and can handle GBs of CSVs file in ~10 seconds. It's very fast.
Here's the app: https://superintendent.app/
This is not Python though, but it is a lot more convenient to use.
Upvotes: -1
Reputation: 883
def csvtable(file): # Read CSV file into 2-D dictionary
table = {}
f = open(file)
columns = f.readline().strip().split(',') # Get column names
for line in f.readlines():
values = line.strip().split(',') # Get current row
for column,value in zip(columns,values):
if column == 'NAME': # table['TREX'] = {}
key = value
table[key] = {}
else:
table[key][column] = value # table['TREX']['LENGTH'] = 10
f.close()
return table
# READ
try:
table1 = csvtable('csv1.txt')
table2 = csvtable('csv2.txt')
except Exception as e:
print (e)
# JOIN, FILTER & COMPUTE
table3 = {}
for value in table1.keys():
if value in table2.keys() and table2[value]['STANCE'] == 'bipedal': # Join both tables on key (NAME) and filter (STANCE)
leg_length = float(table1[value]['LEG_LENGTH'])
stride_length = float(table2[value]['STRIDE_LENGTH'])
speed = ((stride_length / leg_length) - 1) * pow((leg_length * 9.8),0.5) # Compute SPEED
table3[value] = speed
# SORT
result = sorted(table3, key=lambda x:table3[x], reverse=True) # Sort descending by value
# WRITE
try:
f = open('result.txt', 'w')
for r in result:
f.write('%s\n' % r)
f.close()
except Exception as e:
print (e)
Upvotes: 2
Reputation: 607
You can do it in pandas,
import pandas as pd
df_1 = pd.read_csv('df_1.csv')
df_2 = pd.read_csv('df_2.csv')
df_comb = df_1.join(df_2.set_index('NAME'), on = 'NAME')
df_comb = df_comb.loc[df_comb.STANCE == 'bipedal']
df_comb['SPEED'] = (df_comb.STRIDE_LENGTH/df_comb.LEG_LENGTH - 1)*pd.Series.pow(df_comb.LEG_LENGTH*9.8,0.5)
df_comb.sort_values('SPEED', ascending = False)
Not as clean as SQL!
Upvotes: 7