Reputation: 6204
Anyone doing very basic statistical explorations on data living in a relational database has had to calculate cross tabulations,also known as contingency tables(wikipedia page). these are indispensable when you need to count how many items fall more than one category at the same time. For example: How many customers are female and like chocolate?
Scipy has ways to do this for matrices, using a variation of histogram2d But for meaningfull statistical analysis you need to be able to have a table (with variable names) from which you can specify which variables you would like to tabulate. Moreover, it needs to work for other types of variables, not only numeric. Actually, numeric tabulation is the more complicated one since it requires binning. R
naturally has such a function which is called table
, which could easily be ported to Python. However, remember I mention at the title that I'd like to use an ORM, why? Because cross-tabulations are much smaller that the data used to generate it, and you could have a 2x2 table calculated from billions of records on a database. My point is: In serious applications, you can't afford to bring all your data to memory and loop through it. So you'd have to have table design converted into a SQL query so that all the counting would be done by the database engine. And the ORM would take care of the necessary SQL dialect adjustments necessary so that you could run your code with any database backend.
A example of the SQL (in MySQL dialect) for a simple cross tabulation can be found here.
So now that I think I have motivated you about the problem, here are the questions: Is this functionality implemented in any Python ORM? How would you implement this using, say, SQLAlchemy or Django ORM?
Upvotes: 3
Views: 1518
Reputation: 6204
I hate to have to answer my own question, but sometimes we just can't wait for help. And since I have found the answer and a good one, I feel obliged to share with the community. So here it is:
table = self.session.query(Table.var1, Table.var2, func.count(Table)).group_by(Table.var1, Table.var2).all()
This will return a list of tuples like (row,column, count). From this list you can assemble your contingency table and calculate the margin totals if you want. It may be worth mentioning that calculating this cable took .28 seconds for a table with 296110 records, and var1 and var2 had 5 and 90 levels, respectively.
now a little function to assemble and print the (2d) table:
def pprint_table():
colnames = list(set([i[1] for i in table]))
rows = defaultdict(lambda:[0]*len(colnames))
for r in table:
rows[r[0]][colnames.index(r[1])] = r[2]
print colnames, 'total'
for rn, r in rows.items():
print rn, r, sum(r)
Upvotes: 2