Reputation: 437
I have these kind of data (sample) in MySQL:
Table1:
ID ITEM CNT
--------------------
0001 AAB 5
0001 BBA 3
0001 BBB 8
0001 AAC 10
0002 BBA 2
0002 BBC 7
0003 FFG 2
0003 JPO 4
0003 PUI 22
..........
And i would like to find a way to import these data in Python in the form of count vectors, e.g.:
0001 = [5,10,3,8,0,0,0,0]
0002 = [0,0,2,0,7,0,0,0]
0003 = [0,0,0,0,0,0,4,22]
where the elements represent the counts for all the items for each id in this form: [AAB,AAC,BBA,BBB,BBC,FFG,JPO,PUI]
So I would like to ask, what is the best and most efficient way of implementing this? Is it better to do it from python or mysql and how?
Thank you
Upvotes: 0
Views: 107
Reputation: 879201
It is usually more efficient -- when possible -- to manipulate the data in SQL rather than in Python.
With this setup:
import config
import MySQLdb
conn = MySQLdb.connect(
host=config.HOST, user=config.USER,
passwd=config.PASS, db='test')
cursor = conn.cursor()
sql = '''\
DROP TABLE IF EXISTS foo
'''
cursor.execute(sql)
sql = '''\
CREATE TABLE foo (
ID varchar(4),
ITEM varchar(3),
CNT int)
'''
cursor.execute(sql)
sql = '''\
INSERT INTO foo VALUES (%s,%s,%s)
'''
cursor.executemany(sql, [['0001', 'AAB', 5],
['0001', 'BBA', 3],
['0001', 'BBB', 8],
['0002', 'BBA', 2]])
You could form the desired SQL with:
items = 'AAB AAC BBA BBB BBC FFG JPO PUI'.split()
fields = ', '.join('COALESCE({}.CNT, 0)'.format(item) for item in items)
joins = '\n'.join('''\
LEFT JOIN (SELECT ID, CNT FROM foo WHERE ITEM = '{i}') as {i}
ON T.ID = {i}.ID'''.format(i=item) for item in items)
sql = '''\
SELECT T.ID, {f}
FROM (SELECT DISTINCT ID from foo) as T
{j}
'''.format(f=fields, j=joins)
print(sql)
And use it like this:
result = dict()
cursor.execute(sql)
for row in cursor:
result[row[0]] = row[1:]
print(result)
The SQL query used is:
SELECT T.ID, COALESCE(AAB.CNT, 0), COALESCE(AAC.CNT, 0), COALESCE(BBA.CNT, 0), COALESCE(BBB.CNT, 0), COALESCE(BBC.CNT, 0), COALESCE(FFG.CNT, 0), COALESCE(JPO.CNT, 0), COALESCE(PUI.CNT, 0)
FROM (SELECT DISTINCT ID from foo) as T
LEFT JOIN (SELECT ID, CNT FROM foo WHERE ITEM = 'AAB') as AAB
ON T.ID = AAB.ID
LEFT JOIN (SELECT ID, CNT FROM foo WHERE ITEM = 'AAC') as AAC
ON T.ID = AAC.ID
LEFT JOIN (SELECT ID, CNT FROM foo WHERE ITEM = 'BBA') as BBA
ON T.ID = BBA.ID
LEFT JOIN (SELECT ID, CNT FROM foo WHERE ITEM = 'BBB') as BBB
ON T.ID = BBB.ID
LEFT JOIN (SELECT ID, CNT FROM foo WHERE ITEM = 'BBC') as BBC
ON T.ID = BBC.ID
LEFT JOIN (SELECT ID, CNT FROM foo WHERE ITEM = 'FFG') as FFG
ON T.ID = FFG.ID
LEFT JOIN (SELECT ID, CNT FROM foo WHERE ITEM = 'JPO') as JPO
ON T.ID = JPO.ID
LEFT JOIN (SELECT ID, CNT FROM foo WHERE ITEM = 'PUI') as PUI
ON T.ID = PUI.ID
and the resultant dict looks like:
{'0001': (5L, 0L, 3L, 8L, 0L, 0L, 0L, 0L), '0002': (0L, 0L, 2L, 0L, 0L, 0L, 0L, 0L)}
I know you asked for
0001 = [5,10,3,8,0,0,0,0]
0002 = [0,0,2,0,7,0,0,0]
0003 = [0,0,0,0,0,0,4,22]
but there are at least two problems with this. First, 0001
is not a valid Python variable name. Variable names can not start with a digit. Second, you do not want dynamically defined variable names because it is hard to program with a bare variable name which is not known until run-time.
Instead, use the would-be variable names as keys in a dict, result
. Then you can refer to the "variable" 0001
with result['0001']
.
Upvotes: 1
Reputation: 11209
You can do it on access via a crosstab query where the row header will be the id and the column header will be the item and cnt the value to be aggregated. You can then loop through each column of each row to get your vectors. See here for help on crosstab queries: http://allenbrowne.com/ser-67.html
Upvotes: 0