Reputation: 11755
I have two variables, id
and date
. There are millions of distinct id
s, but just a few hundred distinct dates. The id
s are sequential and the dates are increasing with id
. Something like this:
id date
1 1/1/2000
2 1/1/2000
3 1/1/2000
4 1/2/2000
5 1/2/2000
In Perl, I need to create a function that will return a date
given the id
. My first thought was just to make a hash table. This will work, but given that I have millions of records, I thought it might make more sense to work with date ranges. So in the example above, rather than storing 5 records, I could store 2 records: one for each date with the earliest and latest date that correspond to the id
:
date first_id last_id
1/1/2000 1 3
1/2/2000 4 5
(In my actual data, this will allow me to store just a few thousand records, rather than millions.)
My question is, given an id
, what is the best way to look up the date given this structure? So given id=2
, I want to return 1/1/2000
because 2 is between 1 and 3, and therefore corresponds to the first record.
Thanks for any advice.
Upvotes: 2
Views: 1582
Reputation: 118128
I would probably have put the data in an SQLite database, made the id
field the primary key for the table. Use DBD::SQLite through DBI.
If you first prepare
a query that contains a placeholder for id
and repeatedly execute it for various values of id
, performance should be adequate.
Upvotes: 2
Reputation: 107040
As others have stated, you might want to try a database. Another possibility: Use a more complex data structure.
For example, if your hash table is by dates, you could have each entry in the hash be a reference to an array of ids.
Using your example:
$hash{1/1/2000} = [ 1, 2, 3];
$hash{1/2/2000} = [ 4, 5 ];
That way, if you find a date, you can quickly find all IDs for that date. Sorting the keys will allow you to find a range of dates. This is especially true if you store the dates in a more sortable format. For example, in YYYYMMDD format or in standard Unix date/time format.
For example:
$hash{20000101} = [ 1, 2, 3];
$hash{20000102} = [ 4, 5];
You said there are a few hundred dates, so sorting your dates will be fairly quick.
Are you familiar with things like hashes of arrays? You can look at the Perl documentation for Mark's very short tutorial about references and perldsc which actually shows you how to setup a hashes of arrays.
Now, looking up a date via an id...
Imagine a more complex structure. The first level will have two elements DATES
and IDS
. Then, you can have the IDS part be a reference to a hash of IDs, and the DATES key be the same structure as mentioned above. You'll have to keep these two structures in sync, though...
$dataHash->{DATES}->{20020101}->[0] = 1;
$dataHash->{DATES}->{20020101}->[2] = 2;
$dataHash->{DATES}->{20020101}->[3] = 3;
$dateHash->{IDS}->{1} = 20020101;
$dateHash->{IDS}->{2} = 20020101;
$dateHash->{IDS}->{3} = 20020101;
Hmm... This is getting complex. Maybe you should look at the Perl tutorial on object oriented programming.
Writing the stuff off the top of my head without any testing:
package DataStruct;
sub new {
my $class = shift;
my $self = {};
bless $self, $class;
my $self->_Id;
my $self->_Date;
return $self;
}
sub _Id {
my $self = shift;
my $id = shift;
my $date = shift;
$self->{IDS} = {} if not exists $self->{IDS};
if (defined $id and defined $date) {
$self->{IDS}->{$id} = $date;
}
if (defined ($id) {
return $self->{IDS}->{$id};
else {
return keys %{self->{IDS}};
}
}
sub _Date {
my $self = shift;
my $date = shift;
my $id = shift;
$self->{DATES} = {} if not exists $self->{DATES};
if (defined $date and defined $id) {
$self->{DATES}->{$date} = [] if not defined $self->{DATES}->{$date};
push @{$self->{DATES}->{$date}}, $id;
};
if ($date) {
return @{$self->{DATES}->{$date}};
}
else {
return keys %{$self->{DATES};
}
}
sub Define {
my $self = shift;
my $id = shift;
my $date = shift;
$self->_Id($id, $date);
$self->_Date($date, $id);
return $self->_Date($date);
}
sub FetchId {
my $self = shift;
my $id = shift;
return $self->_Id($id);
}
sub FetchDate {
my $self = shift;
my $id = shift;
return $self->_Date;
}
In the above, you create your initial data structure with:
my $struct = DataStruct->new;
Now, to add a date and id, you'd call:
$struct->Define($id, $date);
This will in turn call $struct->_Id($id, $date);
and $struct->_Date($date, $Id);
. Since these begin with an underscore, they're private and can only be called by other DataStruct methods. You mainly use $struct-Set to put your data in.
To fetch a particular date (or an entire range of dates), you use the $dataStruct->FetchDate($date)
method, and to fetch a particular Id you use the $dataStruct->FetchId($id);
Now, the DataStruct
package will be used to keep both the IDs hash and the Dates hashes in sync with each other and keep the complexity out of the main part of your program.
There's everything you need! All you have to do is fix my numerous errors, and probably have some routine that will convert a M/D/Y
style date to a YYYYMMDD
style date or to and from the standard date/time internal storage structure. That way, you don't have to worry about fixing the date before calling these routines. Oh, and you'll probably want some sort of error handling too. What if I give you a bad date or Id number?
As others have stated, you're better off using a database structure even if you use a faux database structure like SQLite.
However, I wanted to let you know that Perl is actually quite capable of creating some very integrate data structures which can help in cases like this.
I'd assumed from the way you stated your question, you really weren't familiar with creating these complex data structures. If not, Perl has some excellent tutorials built into Perl itself. And, the command perldoc
(which is installed with Perl) can pull up all the Perl documentation. Try perldoc perlreftut
and see if it pulls up Mark's tutorial on references.
Once you start getting into more complex data structures, you will learn to use object oriented programming to simplify their handling. Again, there are some excellent tutorials built right into Perl on this (or you can go to the Perldoc webpage).
If you already knew all of this, I apologize. However, at least you have a basis for storing and working with your data.
Upvotes: 1
Reputation: 38745
An attempt to implement Frank's idea:
Given the
sub getDateForId {
use integer;
my ($id, $data) = @_;
my $lo = 0;
my $sz = scalar @$data;
my $hi = $sz - 1;
while ( $lo <= $hi ) {
my $mi = ($lo + $hi) / 2;
if ($data->[$mi]->[0] < $id) {
$lo = $mi + 1;
} elsif ($data->[$mi]->[0] > $id) {
$hi = $mi - 1;
} else {
return $data->[$mi]->[1];
}
}
# $lo > $hi: $id belongs to $hi range
if ($hi < 0) {
return sprintf "** id %d < first id %d **", $id, $data->[0]->[0];
} elsif ($lo >= $sz) {
return sprintf "** id %d > last id %d **", $id, $data->[$sz-1]->[0];
} else {
return sprintf "%s (<== lo %d hi %d)", $data->[$hi]->[1], $lo, $hi;
}
}
and the data
my @data = (
[2, '1/1/2000' ]
, [4, '1/2/2000' ]
, [5, '1/3/2000' ]
, [8, '1/4/2000' ]
);
, the test
for my $id (0..9) {
printf "%d => %s\n", $id, getDateForId( $id, \@data );
}
prints
0 => ** id 0 < first id 2 **
1 => ** id 1 < first id 2 **
2 => 1/1/2000
3 => 1/1/2000 (<== lo 1 hi 0)
4 => 1/2/2000
5 => 1/3/2000
6 => 1/3/2000 (<== lo 3 hi 2)
7 => 1/3/2000 (<== lo 3 hi 2)
8 => 1/4/2000
9 => ** id 9 > last id 8 **
Upvotes: 0
Reputation: 8467
Use a [semi] sparse array. Performance should be fine. You're looking at a few megabytes of memory usage per million records. If you convert the date to an integer epoch before storing it, even better.
use Time::Local;
my @date_by_id;
while (<FILE>) {
chomp;
my ($id, $date) = split /\s+/;
my ($mon, $mday, $year) = split /\//, $date;
$mon--;
$year -= 1900;
$date_by_id[$id] = timelocal 0, 0, 0,
$mday, $mon, $year;
}
Performance should be good enough that you won't need to wrap it in a function. Just use $date_by_id[<ID>]
where needed, keeping in mind that it can be undef
.
Upvotes: 2
Reputation:
If you are to go with an approach like this I would think it would make the most sense to do the querying at the database level. Then, with MySQL, for example, you could query using the BETWEEN
function with something like SELECT date WHERE $id BETWEEN first_id AND last_id
Then you can create a function in Perl where you pass the id and use the query to retrieve the date.
Upvotes: 0