Brian Yang
Brian Yang

Reputation: 61

How to get the distinct of multiple fields in MySQL?

I know that there're several existing questions like this one, but mine is little bit complicated (for me).

For example, I have index, data, log and flag these four tables.

index: id        program    compiler   flag_id   data_id
data:  id        machine    runtime    date      index_id
log:   index_id  data_id    log
flag:  id        flag_name

The flag_id, data_id and index_id indicate the corresponding table's main id. Those main id is auto incremental and unique in index and data tables. But flag table's id could be duplicated. For example, a program may use different flags to compile:

index: 123       jacobi     gcc        11                      345
data:  345       host1      3:21       2015-05-13 22:56:12     123
log:   123       345        "PASS"
flag:  11        "-g"
       11        "-mp"
       11        "-static"

So this program is compiled like:

gcc -g -mp -static jacobi.c

and the execution time is 3'21'' and finished by 2015-05-13 22:56:12.

The thing is, many people may run this program and may be due to various of reasons, the exact the same test may PASS or FAIL. And only the last run (based on the timestamp--date or the MAX(data_id) or the MAX(index_id) for the same test) will be showed in the webpage. How to determine whether tests are the same (DISTINCT)? These field should be the same:

program, compiler and all the flags (no order issue)

So, you could see, the flags are stored in the flag table. I don't know how to correlate the index and flag table to find the same test.

Final example to show my expected output:

index: 123       jacobi     gcc        11                      345
data:  345       host1      3:21       2015-05-13 22:56:12     123
log:   123       345        "PASS"
flag:  11        "-g"
       11        "-mp"
       11        "-static"

index: 126       jacobi     gcc        15                      397
data:  397       host2      3:14       2015-05-13 23:13:28     126
log:   126       397        "FAIL"
flag:  15        "-g"
       15        "-mp"
       15        "-static"

Here, for these two records, they are the same test since they're using the same compiler to compile the same program with the same flags. But the second one is latest one. Because we can tell either from index_id, data_id or finishing date.

If there're 100000 records in my database and maybe 30000 of them are the same tests, how could I get these 30000 tests with their last run result, PASS or FAIL? Like:

gcc -g -mp -static jacobi.c  FAIL
gcc -g -mp         jacobi.c  PASS
gcc -O3            mm.c      PASS
...

I know it's wordy, your patience and any help are really appreciated!

Upvotes: 1

Views: 70

Answers (1)

pala_
pala_

Reputation: 9010

I appear to be attracted to self-flagellation. Anyway, first thing is first - assembling your flags into a string, and getting the results.:

select i.compiler, group_concat(f.flag_name order by f.flag_name asc separator ' ') flags, i.program, d.date, l.log
  from `index` i
    inner join `flag` f
      on i.flag_id = f.id
    inner join `data` d
      on i.data_id = d.id
    inner join `log` l
      on i.id = l.index_id
  group by i.id

the order by in the group_concat ensures the same set of flags will always appear in the same order.

This should give us a result that looks like compiler | flag1 flag 2 ... | program | date

Next step is to group by compiler, the flags and the program, and get the max(date). We do this by querying the results of the first query.

select compiler, flags, program, max(`date`) max_date
  from (
    select i.compiler, group_concat(f.flag_name order by f.flag_name asc separator ' ') flags, i.program, d.date
      from `index` i
        inner join `flag` f
          on i.flag_id = f.id
        inner join `data` d
          on i.data_id = d.id
      group by i.id ) q
  group by compiler, flags, program;

We now know the most recent run time for each combo of compiler, flags and program.

We now need to join our first query back to this one, to find only the row with the max date, and the associated data:

select q1.* from (
  select i.compiler, group_concat(f.flag_name order by f.flag_name asc separator ' ') flags, i.program, d.date, l.log
      from `index` i
        inner join `flag` f
          on i.flag_id = f.id
        inner join `data` d
          on i.data_id = d.id
        inner join `log` l
          on i.id = l.index_id
      group by i.id
  ) q1
  inner join (
    select compiler, flags, program, max(`date`) max_date
      from (
        select i.compiler, group_concat(f.flag_name order by f.flag_name asc separator ' ') flags, i.program, d.date
          from `index` i
            inner join `flag` f
              on i.flag_id = f.id
            inner join `data` d
              on i.data_id = d.id
          group by i.id ) q
      group by compiler, flags, program ) q2
  on q1.compiler = q2.compiler 
    and q1.flags = q2.flags
    and q1.program = q2.program
    and q1.`date` = q2.max_date

sql fiddle is broken right now so i can't test this - but it should be pretty close.

Upvotes: 1

Related Questions