d-_-b
d-_-b

Reputation: 883

Querying csv files in python like sql

This is apparently a popular interview question.
There are 2 CSV files with dinosaur data. We need to query them to return dinosaurs satisfying a certain condition.

Note - We cannot use additional modules like q, fsql, csvkit etc.

file1.csv:

NAME,LEG_LENGTH,DIET
Hadrosaurus,1.2,herbivore
Struthiomimus,0.92,omnivore
Velociraptor,1.0,carnivore
Triceratops,0.87,herbivore
Euoplocephalus,1.6,herbivore
Stegosaurus,1.40,herbivore
Tyrannosaurus Rex,2.5,carnivore

file2.csv

NAME,STRIDE_LENGTH,STANCE
Euoplocephalus,1.87,quadrupedal
Stegosaurus,1.90,quadrupedal
Tyrannosaurus Rex,5.76,bipedal
Hadrosaurus,1.4,bipedal
Deinonychus,1.21,bipedal
Struthiomimus,1.34,bipedal
Velociraptor,2.72,bipedal

using the forumla : speed = ((STRIDE_LENGTH / LEG_LENGTH) - 1) * SQRT(LEG_LENGTH * g), where g = 9.8 m/s^2

Write a program to read csv files, and print only names of bipedal dinosaurs, sorted by speed from fastest to slowest.

In SQL, this would be simple:

select f2.name from
file1 f1 join file2 f2 on f1.name = f2.name
where f1.stance = 'bipedal'
order by (f2.stride_length/f1.leg_length - 1)*pow(f1.leg_length*9.8,0.5) desc

How can this be done in python ?

Upvotes: 6

Views: 21041

Answers (5)

tlo
tlo

Reputation: 1631

I'd recommend duckdb: https://duckdb.org/docs/api/python/overview

Example:

import duckdb
duckdb.sql("""
  SELECT f1.NAME FROM 'file1.csv' as 'f1'
  JOIN 'file2.csv' as 'f2' ON f1.NAME = f2.NAME
  WHERE STANCE = 'bipedal'
  ORDER BY (STRIDE_LENGTH/LEG_LENGTH - 1)*pow(LEG_LENGTH*9.8,0.5) DESC
""")

Upvotes: 3

Tanin
Tanin

Reputation: 1933

I've encountered the same problem at work and decided to build an offline Desktop app where you can load CSVs and start writing SQL. You can join, group by, and etc.

This is backed by C and SQLite and can handle GBs of CSVs file in ~10 seconds. It's very fast.

Here's the app: https://superintendent.app/

Superintendent.app

This is not Python though, but it is a lot more convenient to use.

Upvotes: -1

d-_-b
d-_-b

Reputation: 883

def csvtable(file):     # Read CSV file into 2-D dictionary
    table = {}
    f = open(file)
    columns = f.readline().strip().split(',')       # Get column names
    
    for line in f.readlines():
        values = line.strip().split(',')            # Get current row
        for column,value in zip(columns,values):
            if column == 'NAME':                    # table['TREX'] = {}
                key = value
                table[key] = {}
            else:
                table[key][column] = value          # table['TREX']['LENGTH'] = 10
    
    f.close()
    return table


# READ
try:
    table1 = csvtable('csv1.txt')
    table2 = csvtable('csv2.txt')
except Exception as e:
    print (e)


# JOIN, FILTER & COMPUTE
table3 = {}
for value in table1.keys():
    if value in table2.keys() and table2[value]['STANCE'] == 'bipedal':             # Join both tables on key (NAME) and filter (STANCE)

        leg_length = float(table1[value]['LEG_LENGTH'])
        stride_length = float(table2[value]['STRIDE_LENGTH'])
        speed = ((stride_length / leg_length) - 1) * pow((leg_length * 9.8),0.5)    # Compute SPEED

        table3[value] = speed


# SORT
result = sorted(table3, key=lambda x:table3[x], reverse=True)                       # Sort descending by value

# WRITE
try:
    f = open('result.txt', 'w')
    for r in result:
        f.write('%s\n' % r)
    f.close()
except Exception as e:
    print (e)

Upvotes: 2

milihoosh
milihoosh

Reputation: 607

You can do it in pandas,

import pandas as pd
df_1 = pd.read_csv('df_1.csv')
df_2 = pd.read_csv('df_2.csv')

df_comb = df_1.join(df_2.set_index('NAME'), on = 'NAME')
df_comb = df_comb.loc[df_comb.STANCE == 'bipedal']
df_comb['SPEED'] = (df_comb.STRIDE_LENGTH/df_comb.LEG_LENGTH - 1)*pd.Series.pow(df_comb.LEG_LENGTH*9.8,0.5)
df_comb.sort_values('SPEED', ascending = False)

Not as clean as SQL!

Upvotes: 7

Kasalwe
Kasalwe

Reputation: 368

You can write SQL in python using pandasql.

Upvotes: 4

Related Questions