sfjac
sfjac

Reputation: 7294

Simple pandasql join is failing

I'm trying to understand why the following fails in pandasql:

import pandas as pd
import pandasql as pdsql
def pysql(q): return pdsql.sqldf(q, globals())

type_table = {
    "type_id" : [101, 102],
    "type_name" : ["Go", "Stop"]
}

types = pd.DataFrame(type_table)
print 'Table "types":'
print types

events = pd.DataFrame(
{
    "type" : [101, 101, 102, 101, 102, 102],
    "time" : [1, 2, 3, 4, 5, 6],
    "success" : [0, 0, 1, 1, 1, 1]
})
print '\nTable "events":'
print events

query = """SELECT t.type_name, count(e.success) as event_count 
FROM types AS t, events AS e
WHERE t.type_id = e.type
GROUP BY t.type_name;
"""
print "\nQuery:"
print query
print "Result:"
print pysql(query)

This returns None but I'm expecting it to return something like

  type_name  event_count
0   Go              3
1   Stop            3

Are there limitations to SQLite joins that I don't understand or is this an issue with pandasql? I've tried this example in Access and it works fine there, but perhaps there is a more portable way to do this with SQL (quite possible) or a simple way that just uses pandas?

Hmm - even the simpler join

SELECT types.type_name, events.time                                                                                     
FROM types, events                                                                                                      
WHERE types.type_id = events.type

fails. Definitely confused.


EDIT: Given @ari's suggestion, I tried this with my real data (a couple million rows) and found that for my actual example it was quite a bit faster to do

events[events.success == 0].replace(name_map).groupby('type').size()

than to do

events.replace(name_map).groupby('type').success.size()

where name_map is a map created from the types table above.

Both are way faster than the SQL queries, so I should probably just punt on that, but I'd still like to know why it doesn't work.

Upvotes: 0

Views: 1990

Answers (1)

dslack
dslack

Reputation: 845

It works if you include the JOIN statement:

query = """SELECT t.type_name, count(e.success) as event_count 
FROM types t JOIN events e ON t.type_id = e.type
GROUP BY t.type_name
"""

Upvotes: 1

Related Questions