How to unpack result of sub-query into list-type field to result of original query in peewee?

Question

How to make peewee put ids of related table rows into additional list-like field into resulting query?

I want to make duplicates detecting manager for media files. For each file on my PC I have record in database with fields like

File name, Size, Path, SHA3-512, Perceptual hash, Tags, Comment, Date added, Date changed, etc...

Depending on the situation I want to use different patterns to be used to consider records in table as duplicates.

In the most simple case I want just to see all records having the same hash, so I

subq = Record.select(Record.SHA).group_by(Record.SHA).having(peewee.fn.Count() > 1)
subq = subq.alias('jq')
q = Record.select().join(q, on=(Record.SHA == q.c.SHA)).order_by(Record.SHA)
for r in q:
    process_record_in_some_way(r)

and everything is fine. But there are lot of cases when I want to use different sets of table columns as grouping patterns. So in the worst case I use all of them except id and "Date added" column to detect exact duplicating rows in database, when I just readded the same file for few times which leads to the monster like

subq = Record.select(Record.SHA, Record.Name, Record.Date, Record.Size, Record.Tags).group_by(Record.SHA, Record.Name, Record.Date, Record.Size, Record.Tags).having(peewee.fn.Count() > 1)
subq = subq.alias('jq')
q = Record.select().join(q, on=(Record.SHA == q.c.SHA and Record.Name == q.c.Name and Record.Date == q.c.Date and Record.Size == q.c.Size and Record.Tags == q.c.Tags)).order_by(Record.SHA)
for r in q:
    process_record_in_some_way(r)

and this is not the full list of my fields, just example. Same thing I have to do for other patterns of sets of fields, i.e. duplicating it's list 3 times in select clause, grouping clause of subquery and then listing them all again in joining clause.

I wish I could just group the records with appropriate pattern and peewee would just list ids of all the members of each group into new list field like

q=Record.select(Record, SOME_MAJIC.alias('duplicates')).group_by(Record.SHA, Record.Name, Record.Date, Record.Size, Record.Tags).having(peewee.fn.Count() > 1).SOME_ANOTHER_MAJIC
for r in q:
    process_group_of_records(r) # r.duplicates == [23, 44, 45, 56, 100], for example

How can I do this? Listing the same parameters trice I really feel like I do something wrong.

coleifer · Accepted Answer

You can use GROUP_CONCAT (or for postgres, array_agg) to group and concatenate a list of ids/filenames, whatever.

So for files with the same hash:

query = (Record
         .select(Record.sha, fn.GROUP_CONCAT(Record.id).alias('id_list'))
         .group_by(Record.sha)
         .having(fn.COUNT(Record.id) > 1))

This is a relational database. So you're dealing all the time, everywhere, with tables consisting of rows and columns. There's no "nesting". GROUP_CONCAT is about as close as you can get.

How to unpack result of sub-query into list-type field to result of original query in peewee?

Answers (1)

Related Questions