Worse_Username
Worse_Username

Reputation: 318

Which of the two approaches for the specified task is better performance-wise in Perl DBI module?

Here is a problem that I'm facing, which I need to solve using Perl DBI module:

Table:

    c1 c2 c3 c4 c5 c6 
   __________________
r1 | a  b  c  d  e  f 
r2 | h  i  j  k  x  m
r3 | n  x  p  q  r  x
r4 | k  l  m  n  p  q

Task: determine the name of the left-most column that has value 'x' in any of the rows. In the example it is c2.

I see two ways to do this:

First

  1. Select column c1 in all the rows;
  2. Loop through the retrieved fields, starting from top-most;
  3. If any of the fields have value 'x', return c1;
  4. Otherwise, repeat 1-4 for next column;

How I approximately imagine it to look in perl:

my @bind_values=\(my $field);
my $var;
for my $i (1..6) {
  $statement="select c$i from table"
  $dbh->selectcol_arrayref($statement, undef, @bind_values);
  if ($field eq 'x') {$var=$i;last;} 
}
return $field;

Second

  1. Set variable $var to 4;
  2. Select all columns from r1 to r$var.
  3. Loop through returned fields, starting from left-most;
  4. If a field has value 'x' and current column number is lower than x, assign the current column number to x;
  5. repeat 2-5 for next row
  6. return x

How I approximately imagine it to look in Perl:

my @bind_values;
my $var=6;
my @cols;
for my $i (1..6) {
  for (1..$var){push @cols, "c$_"; push @bind_values, my "c$_";}
  $statement="select @cols from table"
  $dbh->selectrow_array($statement, undef, @bind_values)
  for (@bind values){
    if ($$_<$var) $var=$$_;
  }
}
return $var;

If I understood the manual correctly, selectcol_array() actually performs a separate SQL call for each row in the table, so both approaches involve a two-level loop.

To people know more about the inner workings of Perl DBI module my question is the following:

Which of the approaches is better performance-wise?

If it's of any significance, I'm working with a MySQL database.

EDIT: Actual table dimensions are potentially c200 x r1000.

EDIT2:

Another idea: using LIMIT statement, to determine if a column contains a field with the statement SQL statement itself, for example:

SELECT c1
FROM table
WHERE c1='x'
LIMIT 0,1

This statement should allow to determine if c1 contains value 'x'. This would move some more of the performance load to DB engine, correct? Would this improve or worsen performance?

Upvotes: 1

Views: 99

Answers (4)

bohica
bohica

Reputation: 5992

Assuming your columns are c1 .. c6 you can use something like this to get it in sqlite:

select distinct (case when c1 = 'x' then 'c1' when c2 = 'x' then 'c2' when c3 = 'x' then 'c4' when c4 = 'x' then 'c4' when c5 = 'x' then 'c5' when c6 = 'x' then 'c6' else 'x' end) from mje order by 1 limit 1;

Upvotes: 0

Jamie McCarthy
Jamie McCarthy

Reputation: 765

The fact that you need to query your data this way tells me that it's stored in a bizarre and inappropriate way. Relational databases are meant to store relations, and the order of their columns should be irrelevant to how they logically function. Any need to refer to column order is a guaranteed sign that you're doing something wrong.

I understand that sometimes one needs to perform one-time queries to determine unusual things about data sets, but I stand by my assessment: this data is stored inappropriately.

My guess is that there are many columns that define related, sequential attributes, maybe something like "profits_1q2001", "profits_2q2001", etc. You'll want to create a separate table for those, maybe something like:

CREATE TABLE `department_profits` (
  `id` int(10) unsigned NOT NULL,
  `department_id` same_as_parent_table NOT NULL,
  `y` year(4) NOT NULL,
  `q` tinyint(3) unsigned NOT NULL,
  `profits` decimal(9,2) DEFAULT NULL,
  PRIMARY KEY (`id`),
  UNIQUE KEY `idx_dept_quarter` (`department_id`,`y`,`q`),
  KEY `idx_profits_y_q_dept` (`profits`,`y`,`q`,`department_id`)
) ENGINE=InnoDB;

Converting the data from its current format to the proper format is left as an exercise for the reader, but it might involve 200 script-generated queries that look like:

SELECT CONCAT(
  "INSERT INTO department_profits (department_id, y, q, profits) VALUES (",
    "'", department_id, "',",
    2001, ",",
    1, ",",
    profits_1q2001,
  ");"
)
FROM old_table;

If your question is then (say) when was the first time profits exceeded $10,000 and in which department, then finding the answer becomes something like:

SELECT department_id, y, q, profits
FROM department_profits
WHERE profits > 10000
ORDER BY y, q LIMIT 1;

For the actual question you asked -- if it really is a one-off -- since there are just 200,000 data points, I would do it manually. Export the whole table as tab-separated, drag it onto Excel, "Find/Replace" to change "x" to "-999" or some small value, then "Data -> Sort" by each column in turn until your answer pops to the top. Heck, plain old "Find" might tell you your answer. With just 200 columns, it won't take long, and you might learn something new about your data by seeing it all on the screen sorted various ways :)

Upvotes: 0

Jonathan Leffler
Jonathan Leffler

Reputation: 754450

Short of redesigning your table so that it can be queried more effectively, I think your optimal solution is likely to be a modified version of your Option 1. Instead of using fetchall_arrayref(), use fetchrow_arrayref() to collect 1 row at a time. Examine each row as you get it. Break the loop if the minimum column ever gets to column 1. This minimizes the memory used in the Perl code; it uses a single SQL statement (but multiple fetch operations — but then fetchall_arrayref() also uses multiple fetch operations).

Upvotes: 1

Borodin
Borodin

Reputation: 126742

Here is a version using SQLite. I expect the same code to work for MySQL with little or no change. It should work fine unless your detabase table is huge, but you don't mention its size so I presume it's not out of the ordinary.

It simply fetches the contents of the table into memory and checks each column, one by one, to see if any field is x, printing the name of the column once it is found.

use strict;
use warnings;

use DBI;
use List::Util qw/ any /;

my $dbh = DBI->connect('dbi:SQLite:test.sqlite');

my $sth = $dbh->prepare('SELECT * FROM "table"');
$sth->execute;
my $table = $sth->fetchall_arrayref;

my $first_column;
for my $i (0 .. $#{$table->[0]}) {
  my @column = map { $_->[$i] } @$table;
  if ( any { $_ eq 'x' } @column ) {
    $first_column = $sth->{NAME}[$i];
    last;
  }
}

print $first_column, "\n";

output

c2

Update

This way is likely to be faster, as it uses the database engine to search for columns that contain an x and very little data is loaded into memory

use strict;
use warnings;

use DBI;

my $dbh = DBI->connect('dbi:SQLite:test.sqlite');

my @names = do {
  my $sth = $dbh->prepare('SELECT * FROM "table"' LIMIT 0);
  $sth->execute;
  @{ $sth->{NAME_lc} };
};

my $first_column;

for my $col (@names) {

  my $sql = qq{SELECT $col from "table" WHERE $col = 'x' LIMIT 1};
  my $row = $dbh->selectrow_arrayref($sql);

  if ($row) {
    $first_column = $col;
    last;
  }
}

print $first_column, "\n";

Upvotes: 1

Related Questions