user2578185
user2578185

Reputation: 437

Efficient way of creating count vectors from MySQL to Python

I have these kind of data (sample) in MySQL:

Table1:

ID     ITEM    CNT
--------------------
0001    AAB     5
0001    BBA     3
0001    BBB     8
0001    AAC     10
0002    BBA     2
0002    BBC     7
0003    FFG     2
0003    JPO     4
0003    PUI     22
..........

And i would like to find a way to import these data in Python in the form of count vectors, e.g.:

0001 = [5,10,3,8,0,0,0,0]
0002 = [0,0,2,0,7,0,0,0]
0003 = [0,0,0,0,0,0,4,22]

where the elements represent the counts for all the items for each id in this form: [AAB,AAC,BBA,BBB,BBC,FFG,JPO,PUI]

So I would like to ask, what is the best and most efficient way of implementing this? Is it better to do it from python or mysql and how?

Thank you

Upvotes: 0

Views: 107

Answers (2)

unutbu
unutbu

Reputation: 879201

It is usually more efficient -- when possible -- to manipulate the data in SQL rather than in Python.

With this setup:

import config
import MySQLdb
conn = MySQLdb.connect(
    host=config.HOST, user=config.USER,
    passwd=config.PASS, db='test')
cursor = conn.cursor()

sql = '''\
DROP TABLE IF EXISTS foo 
'''
cursor.execute(sql)

sql = '''\
CREATE TABLE foo (
    ID varchar(4),
    ITEM varchar(3),
    CNT int)
'''

cursor.execute(sql)

sql = '''\
INSERT INTO foo VALUES (%s,%s,%s)
'''

cursor.executemany(sql, [['0001', 'AAB', 5],
                         ['0001', 'BBA', 3],
                         ['0001', 'BBB', 8],
                         ['0002', 'BBA', 2]])

You could form the desired SQL with:

items = 'AAB AAC BBA BBB BBC FFG JPO PUI'.split()
fields = ', '.join('COALESCE({}.CNT, 0)'.format(item) for item in items)
joins = '\n'.join('''\
LEFT JOIN (SELECT ID, CNT FROM foo WHERE ITEM = '{i}') as {i}
    ON T.ID = {i}.ID'''.format(i=item) for item in items)
sql = '''\
SELECT T.ID, {f}
FROM (SELECT DISTINCT ID from foo) as T
{j}
'''.format(f=fields, j=joins)

print(sql)

And use it like this:

result = dict()
cursor.execute(sql)
for row in cursor:
    result[row[0]] = row[1:]
print(result)    

The SQL query used is:

SELECT T.ID, COALESCE(AAB.CNT, 0), COALESCE(AAC.CNT, 0), COALESCE(BBA.CNT, 0), COALESCE(BBB.CNT, 0), COALESCE(BBC.CNT, 0), COALESCE(FFG.CNT, 0), COALESCE(JPO.CNT, 0), COALESCE(PUI.CNT, 0)
FROM (SELECT DISTINCT ID from foo) as T
LEFT JOIN (SELECT ID, CNT FROM foo WHERE ITEM = 'AAB') as AAB
    ON T.ID = AAB.ID
LEFT JOIN (SELECT ID, CNT FROM foo WHERE ITEM = 'AAC') as AAC
    ON T.ID = AAC.ID
LEFT JOIN (SELECT ID, CNT FROM foo WHERE ITEM = 'BBA') as BBA
    ON T.ID = BBA.ID
LEFT JOIN (SELECT ID, CNT FROM foo WHERE ITEM = 'BBB') as BBB
    ON T.ID = BBB.ID
LEFT JOIN (SELECT ID, CNT FROM foo WHERE ITEM = 'BBC') as BBC
    ON T.ID = BBC.ID
LEFT JOIN (SELECT ID, CNT FROM foo WHERE ITEM = 'FFG') as FFG
    ON T.ID = FFG.ID
LEFT JOIN (SELECT ID, CNT FROM foo WHERE ITEM = 'JPO') as JPO
    ON T.ID = JPO.ID
LEFT JOIN (SELECT ID, CNT FROM foo WHERE ITEM = 'PUI') as PUI
    ON T.ID = PUI.ID

and the resultant dict looks like:

{'0001': (5L, 0L, 3L, 8L, 0L, 0L, 0L, 0L), '0002': (0L, 0L, 2L, 0L, 0L, 0L, 0L, 0L)}

I know you asked for

0001 = [5,10,3,8,0,0,0,0]
0002 = [0,0,2,0,7,0,0,0]
0003 = [0,0,0,0,0,0,4,22]

but there are at least two problems with this. First, 0001 is not a valid Python variable name. Variable names can not start with a digit. Second, you do not want dynamically defined variable names because it is hard to program with a bare variable name which is not known until run-time.

Instead, use the would-be variable names as keys in a dict, result. Then you can refer to the "variable" 0001 with result['0001'].

Upvotes: 1

Tarik
Tarik

Reputation: 11209

You can do it on access via a crosstab query where the row header will be the id and the column header will be the item and cnt the value to be aggregated. You can then loop through each column of each row to get your vectors. See here for help on crosstab queries: http://allenbrowne.com/ser-67.html

Upvotes: 0

Related Questions