user2117694
user2117694

Reputation: 13

Advice on chosing data structure for complicated sort

I'm looking for some help writing some Perl code to sort a log file.

I'm a relative newbie to coding and perl in general!

I need to write my code with just the core perl modules as much as possible, though if that proves impossible I'm open to CPAN modules. The log file contains a list logged of messages, that need to be re-arranged in order. Should be simple enough, but there are lots of gotchas, which is causing me trouble on how to design my data structures. The input file format is CSV and the output needs to be the same with the messages in timestamp order and concatenated messages grouped together with the first message part first.

Gotchas

  1. The messages need to be ordered by timestamp.
  2. If the message has been split over multiple lines it will have something like the following in the final field "(part 1 of 3 of message reference 1)". For a particular message reference all the parts need to be in order, so part 1, followed by part 2, followed by part 3, etc.
  3. The hex number at the start of this field tells me whether it's a 8 bit or 16 bit reference, and 8 bit reference with the same reference number does not match (as a duplicate) to a 16 bit reference with the same number. So I need to take this into account.
  4. It's possible for message parts to be missing, so we might only get parts 1 and 2, out of 3.
  5. Duplicate message reference numbers are possible, so each message reference needs to be tied to the from field to give it a unique identity.
  6. Even with the unique identity from (3), duplicates over time are still possible (as there are only only so many message reference numbers before they reset), so I need to check the time of the last part received with a duplicate message reference. If it's more than 3 days between messages parts then I can count it as a new message.
  7. Finally there's potentially hundreds of thousands of lines in the log file that need reordering, so loading this all into memory, probably isn't an option.

It's probably best if I just put some example input data and then how it needs to come out.

Input Data

#message uniqueID,From,To,Time,flag,content,IP,concatenation info   
1,"+1231231234","+15125562100","7 Sep 2012 22:08:33","","abcdefghijklmnopqrstuvwxyz",,
2,"+1231231234","+15125562100","7 Sep 2012 22:08:37","","abcdefghijklmnopqrstuvwxyz",,
3,"+1231231234","+15125562100","7 Sep 2012 22:08:41","","abcdefghijklmnopqrstuvwxyz",,
4,"+8888888888","+15125562100","7 Sep 2012 22:09:01","","SHORTUDH: Thus I sat engaged in guessing, but no syllable expressing To the fowl, whose fiery eyes now burned into my bosoms core; This and more I sat divining, wi",,"BQADAQMB  (part 1 of 3 of message reference 1)"
5,"+8888888888","+15125562100","7 Sep 2012 22:09:04","","h my head at ease reclining On the cushions velvet lining that the lamplight gloated oer, But whose velvet violet lining with the lamplight gloating oer She shall ",,"BQADAQMC  (part 2 of 3 of message reference 1)"
6,"+8888888888","+15125562100","7 Sep 2012 22:09:05","","ress, ah, nevermore!",,"BQADAQMD  (part 3 of 3 of message reference 1)"
7,"+8888888888","+15125562100","7 Sep 2012 22:09:06","","LONGUDH: Thus I sat engaged in guessing, but no syllable expressing To the fowl, whose fiery eyes now burned into my bosoms core; This and more I sat divining, wit",,"BggEAAIDAQ==  (part 1 of 3 of message reference 2)"
8,"+8888888888","+15125562100","7 Sep 2012 22:09:07",""," my head at ease reclining On the cushions velvet lining that the lamplight gloated oer, But whose velvet violet lining with the lamplight gloating oer She shall p",,"BggEAAIDAg==  (part 2 of 3 of message reference 2)"
10,"+1231231234","+15125562100","7 Sep 2012 22:09:46","","abcdefghijklmnopqrstuvwxyz",,
11,"+1231231234","+15125562100","7 Sep 2012 22:09:50","","abcdefghijklmnopqrstuvwxyz",,
12,"+1231231234","+15125562100","7 Sep 2012 22:09:55","","abcdefghijklmnopqrstuvwxyz",,
13,"+8888888888","+15125562100","13 Sep 2012 22:10:36","","SHORTUDH: Thus I sat engaged in guessing, but no syllable expressing To the fowl, whose fiery eyes now burned into my bosoms core; This and more I sat divining, wi",,"BQADAQMB  (part 1 of 3 of message reference 1)"
14,"+8888888888","+15125562100","13 Sep 2012 22:10:38","","h my head at ease reclining On the cushions velvet lining that the lamplight gloated oer, But whose velvet violet lining with the lamplight gloating oer She shall ",,"BQADAQMC  (part 2 of 3 of message reference 1)"
15,"+8888888888","+15125562100","13 Sep 2012 22:10:39","","ress, ah, nevermore!",,"BQADAQMD  (part 3 of 3 of message reference 1)"
16,"+8888888889","+15125562100","7 Sep 2012 22:09:06","","LONGUDH: Thus I sat engaged in guessing, but no syllable expressing To the fowl, whose fiery eyes now burned into my bosoms core; This and more I sat divining, wit",,"BggEAAIDAQ==  (part 1 of 3 of message reference 2)"
17,"+8888888889","+15125562100","7 Sep 2012 22:10:42",""," my head at ease reclining On the cushions velvet lining that the lamplight gloated oer, But whose velvet violet lining with the lamplight gloating oer She shall p",,"BggEAAIDAg==  (part 2 of 3 of message reference 2)"
18,"+8888888889","+15125562100","7 Sep 2012 22:10:43","","ess, ah, nevermore!",,"BggEAAIDAw==  (part 3 of 3 of message reference 2)"
19,"+1231231234","+15125562100","13 Sep 2012 20:12:52","","Deposit SMS with readreceiptrequest = false #0",,
20,"+1231231234","+15125562100","13 Sep 2012 20:12:53","","Deposit SMS with readreceiptrequest = false #1",,
21,"+1231231234","+15125562100","13 Sep 2012 20:12:54","","Deposit SMS with readreceiptrequest = false #2",,
22,"+8888888888","+15125562100","13 Sep 2012 20:12:55","","Deposit SMS with readreceiptrequest = false #0: Thus I sat engaged in guessing, but no syllable expressing To the fowl, whose fiery eyes now burned into my bosoms ",,"BQADAAMB  (part 1 of 3 of message reference 0)"
23,"+8888888888","+15125562100","13 Sep 2012 20:12:57","","ore; This and more I sat divining, with my head at ease reclining On the cushions velvet lining that the lamplight gloated oer, But whose velvet violet lining with",,"BQADAAMC  (part 2 of 3 of message reference 0)"
24,"+8888888888","+15125562100","13 Sep 2012 20:12:58","","the lamplight gloating oer She shall press, ah, nevermore!",,"BQADAAMD  (part 3 of 3 of message reference 0)"
25,"+8888888888","+15125562100","7 Sep 2012 22:10:40","","LONGUDH: Thus I sat engaged in guessing, but no syllable expressing To the fowl, whose fiery eyes now burned into my bosoms core; This and more I sat divining, wit",,"BggEAAIEAQ==  (part 1 of 2 of message reference 3)"
26,"+8888888888","+15125562100","7 Sep 2012 22:10:42","","LONGUDH: Thus I sat engaged in guessing, but no syllable expressing To the fowl, whose fiery eyes now burned into my bosoms core; This and more I sat divining, wit",,"BggEAAIEAQ==  (part 1 of 2 of message reference 3)"
27,"+8888888888","+15125562100","7 Sep 2012 22:10:43","","ess, ah, nevermore!",,"BggEAAIEAw==  (part 2 of 2 of message reference 3)"
28,"+8888888888","+15125562100","13 Sep 2012 20:13:02","","Deposit SMS with readreceiptrequest = false #2: Thus I sat engaged in guessing, but no syllable expressing To the fowl, whose fiery eyes now burned into my bosoms ",,"BQADAgMB  (part 1 of 3 of message reference 2)"
29,"+8888888888","+15125562100","13 Sep 2012 20:13:03","","ore; This and more I sat divining, with my head at ease reclining On the cushions velvet lining that the lamplight gloated oer, But whose velvet violet lining with",,"BQADAgMC  (part 2 of 3 of message reference 2)"
30,"+8888888888","+15125562100","13 Sep 2012 20:13:04","","the lamplight gloating oer She shall press, ah, nevermore!",,"BQADAgMD  (part 3 of 3 of message reference 2)"
31,"+1231231234","+15125562100","13 Sep 2012 20:13:08","","Deposit SMS with readreceiptrequest = true #0",  

Output Data

#message uniqueID,From,To,Time,flag,content,IP,concatenation info   
1,"+1231231234","+15125562100","7 Sep 2012 22:08:33","","abcdefghijklmnopqrstuvwxyz",,
2,"+1231231234","+15125562100","7 Sep 2012 22:08:37","","abcdefghijklmnopqrstuvwxyz",,
3,"+1231231234","+15125562100","7 Sep 2012 22:08:41","","abcdefghijklmnopqrstuvwxyz",,
4,"+8888888888","+15125562100","7 Sep 2012 22:09:01","","SHORTUDH: Thus I sat engaged in guessing, but no syllable expressing To the fowl, whose fiery eyes now burned into my bosoms core; This and more I sat divining, wi",,"BQADAQMB  (part 1 of 3 of message reference 1)"
5,"+8888888888","+15125562100","7 Sep 2012 22:09:04","","h my head at ease reclining On the cushions velvet lining that the lamplight gloated oer, But whose velvet violet lining with the lamplight gloating oer She shall ",,"BQADAQMC  (part 2 of 3 of message reference 1)"
6,"+8888888888","+15125562100","7 Sep 2012 22:09:05","","ress, ah, nevermore!",,"BQADAQMD  (part 3 of 3 of message reference 1)"
16,"+8888888889","+15125562100","7 Sep 2012 22:09:06","","LONGUDH: Thus I sat engaged in guessing, but no syllable expressing To the fowl, whose fiery eyes now burned into my bosoms core; This and more I sat divining, wit",,"BggEAAIDAQ==  (part 1 of 3 of message reference 2)"
17,"+8888888889","+15125562100","7 Sep 2012 22:10:42",""," my head at ease reclining On the cushions velvet lining that the lamplight gloated oer, But whose velvet violet lining with the lamplight gloating oer She shall p",,"BggEAAIDAg==  (part 2 of 3 of message reference 2)"
18,"+8888888889","+15125562100","7 Sep 2012 22:10:43","","ess, ah, nevermore!",,"BggEAAIDAw==  (part 3 of 3 of message reference 2)"
7,"+8888888888","+15125562100","7 Sep 2012 22:09:06","","LONGUDH: Thus I sat engaged in guessing, but no syllable expressing To the fowl, whose fiery eyes now burned into my bosoms core; This and more I sat divining, wit",,"BggEAAIDAQ==  (part 1 of 3 of message reference 2)"
8,"+8888888888","+15125562100","7 Sep 2012 22:09:07",""," my head at ease reclining On the cushions velvet lining that the lamplight gloated oer, But whose velvet violet lining with the lamplight gloating oer She shall p",,"BggEAAIDAg==  (part 2 of 3 of message reference 2)"
10,"+1231231234","+15125562100","7 Sep 2012 22:09:46","","abcdefghijklmnopqrstuvwxyz",,
11,"+1231231234","+15125562100","7 Sep 2012 22:09:50","","abcdefghijklmnopqrstuvwxyz",,
12,"+1231231234","+15125562100","7 Sep 2012 22:09:55","","abcdefghijklmnopqrstuvwxyz",,
25,"+8888888888","+15125562100","7 Sep 2012 22:10:40","","LONGUDH: Thus I sat engaged in guessing, but no syllable expressing To the fowl, whose fiery eyes now burned into my bosoms core; This and more I sat divining, wit",,"BggEAAIEAQ==  (part 1 of 2 of message reference 3)"
26,"+8888888888","+15125562100","7 Sep 2012 22:10:42","","LONGUDH: Thus I sat engaged in guessing, but no syllable expressing To the fowl, whose fiery eyes now burned into my bosoms core; This and more I sat divining, wit",,"BggEAAIEAQ==  (part 1 of 2 of message reference 3)"
27,"+8888888888","+15125562100","7 Sep 2012 22:10:43","","ess, ah, nevermore!",,"BggEAAIEAw==  (part 2 of 2 of message reference 3)"
19,"+1231231234","+15125562100","13 Sep 2012 20:12:52","","Deposit SMS with readreceiptrequest = false #0",,
20,"+1231231234","+15125562100","13 Sep 2012 20:12:53","","Deposit SMS with readreceiptrequest = false #1",,
21,"+1231231234","+15125562100","13 Sep 2012 20:12:54","","Deposit SMS with readreceiptrequest = false #2",,
22,"+8888888888","+15125562100","13 Sep 2012 20:12:55","","Deposit SMS with readreceiptrequest = false #0: Thus I sat engaged in guessing, but no syllable expressing To the fowl, whose fiery eyes now burned into my bosoms ",,"BQADAAMB  (part 1 of 3 of message reference 0)"
23,"+8888888888","+15125562100","13 Sep 2012 20:12:57","","ore; This and more I sat divining, with my head at ease reclining On the cushions velvet lining that the lamplight gloated oer, But whose velvet violet lining with",,"BQADAAMC  (part 2 of 3 of message reference 0)"
24,"+8888888888","+15125562100","13 Sep 2012 20:12:58","","the lamplight gloating oer She shall press, ah, nevermore!",,"BQADAAMD  (part 3 of 3 of message reference 0)"
28,"+8888888888","+15125562100","13 Sep 2012 20:13:02","","Deposit SMS with readreceiptrequest = false #2: Thus I sat engaged in guessing, but no syllable expressing To the fowl, whose fiery eyes now burned into my bosoms ",,"BQADAgMB  (part 1 of 3 of message reference 2)"
29,"+8888888888","+15125562100","13 Sep 2012 20:13:03","","ore; This and more I sat divining, with my head at ease reclining On the cushions velvet lining that the lamplight gloated oer, But whose velvet violet lining with",,"BQADAgMC  (part 2 of 3 of message reference 2)"
30,"+8888888888","+15125562100","13 Sep 2012 20:13:04","","the lamplight gloating oer She shall press, ah, nevermore!",,"BQADAgMD  (part 3 of 3 of message reference 2)"
31,"+1231231234","+15125562100","13 Sep 2012 20:13:08","","Deposit SMS with readreceiptrequest = true #0",
13,"+8888888888","+15125562100","13 Sep 2012 22:10:36","","SHORTUDH: Thus I sat engaged in guessing, but no syllable expressing To the fowl, whose fiery eyes now burned into my bosoms core; This and more I sat divining, wi",,"BQADAQMB  (part 1 of 3 of message reference 1)"
14,"+8888888888","+15125562100","13 Sep 2012 22:10:38","","h my head at ease reclining On the cushions velvet lining that the lamplight gloated oer, But whose velvet violet lining with the lamplight gloating oer She shall ",,"BQADAQMC  (part 2 of 3 of message reference 1)"
15,"+8888888888","+15125562100","13 Sep 2012 22:10:39","","ress, ah, nevermore!",,"BQADAQMD  (part 3 of 3 of message reference 1)"

The things I have done so far is

  1. Convert the time field into epoch time to make any comparisons easier
  2. Can read in (and write out the file).
  3. Can parse all the CSV columns.
  4. Can split the concatenation info into its parts, i.e. where 8bit or 16bit reference, part number, total, and reference id.

Now I'm stuck coming up with the best way to filter and sort the data efficiently. I've tried playing around with hashes and loading the file first into memory so I can sort on a particular message reference, but I'm not sure that will work with a large file.

I then thought about reading it in line by line, but I could hit an issue where the second line contains the first part of a concatenated SMS, and we might not get the subsequent parts until the very end of the file, so I'm thinking maybe that also isn't a good idea.

I also thought of a database, but I think it will be too complicated to setup on the system this needs to run. Another option is perhaps writing a package and storing the complex structure as an object? Perhaps I'm over complicating things? My brain is certainly going mushy!

Anyway, any ideas or guidance would be much appreciated.

Hopefully the above is clear, but please ask me if you have any questions.

Thanks, Will

Upvotes: 1

Views: 104

Answers (2)

amon
amon

Reputation: 57640

I don't think this problem us too complicated if decomposed correctly.

As I see it, your sorting program will contain of the following stages:

  1. Extract the relevant information from each line (timestamp and concat info).
  2. Group the lines by message reference, this can be done memory-efficiently with a cache.
  3. Sort the groups by timestamp.
  4. Flatten the groups into the original lines.

The Schwartzian transform

The Schwartzian is a common pattern when sorting in Perl. It speeds up sorts where the sorting index has to be extracted from the data that is to be actually sorted, by extracting this data once, and not at every comparision. It can also be described as decorate-sort-undecorate.

An example: Sorting strings by length. Note that the naive implementation would actually be better in this case.

my @words = qw( aaa b cccc );
my @sorted_words = 
    map  { $_->[1]             } # flatten
    sort { $a->[0] <=> $b->[0] } # sort by first field (length)
    map  { [ length $_, $_ ]   } # decorate: return arrayref with key and data
    @words;
print "[@sorted_words]\n"; # prints "[b aaa cccc]"

It would be good to keep this pattern in mind for your task

1. Extraction

You already managed that. For each line, we output an array reference or similar with the following fields:

0: timestamp (in epoch)
1: part no            \
2: total parts        | these are undef if no concat info is present
3: message reference  /
4: The unmodifed line

For the CSV extraction, you should use Text::CSV, to calculate the epoch, you should look into DateTime

2. Grouping

We define a cache in forms of a hash that has message references as key, and a group as value. A group is an arrayref as the extracted format specified above, but may contain further lines in positions 5 and onward (i.e. each tagged line is a group).

For each tagged line that is received, we do the following procedure:

# pseudocode
# this is how I understood your requirements,
# but it may be wrong. The general principle still holds
# (you may need to choose a different key)
IF the line doesn't have part information, THEN
    pass it on immediately.
ELSE
    IF the hash has an entry for our message reference, THEN
        IF the timestamp of the present group is too old, THEN
            pass on the existing group.
            Add our line for this key.
        ELSE
            Update the group with our line,
            adding the original line (at position 3 + part no),
            but not the metadata to the group.
            IF the group is made complete, THEN
                pass it on immediately,
                delete this entry from the hash.
    ELSE
        Add the line as a group.
        Make sure the content is at position 3 + part no, to allow easy updating.

After no new lines are present, we pass each remaining value in the hash on to the next stage.

The important thing to realize is that you don't have to keep all lines in memory here, but only incomplete groups.

Interesting Perl functions are exists $hash{element} and delete $hash{element}. The delete may be important to save memory.

3. Sorting

We simply sort each element by timestamp. If the total data is too much for the system to handle, we can use a trick:

  1. Sort smaller chunks of the data, save these to a file.
  2. Open each file.
  3. Load the first item from each file
  4. Do-While at least one file has items left:
    1. sort all loaded items
    2. pass on the first resulting element.
    3. load the next item from the file that the current first element came from
  5. pass on the other (already loaded) items in correct order

However, this is time-expensive.

4. Flattening

Here, we only receive the sorted and grouped items. All we have to do is to output the contained lines in their correct order.

Upvotes: 2

Jim Mischel
Jim Mischel

Reputation: 134045

I would do this in two phases: combining the message parts, and sorting. That should simplify the problem somewhat.

To start, I would use an external sorting utility (the GNU sort tool, for example) to sort by message number. That will at least group all of the parts that have the same message number together. A simple sort <inputfile >outputfile will do what you need. All you're really interested in is getting all parts that start with, for example, 371,"... next to each other.

You can then write a Perl program to read the output and accumulate lines that have the same message number. When you see a different message number, filter the lines you've accumulated to assemble a message from the different parts. And write that record out to a file. You might want to write the output in a form that's more easily sorted. Perhaps by outputting the fields you're sorting on at the front of the record, zero-padded if necessary to simplify sorting.

When that's complete, you have a file that contains one record per line and, if you constructed the records correctly, you can just do another sort <inputfile >outputfile to get the data in the order that you want.

That also simplifies your programming quite a bit: you don't have to worry about writing a custom sort for the data. Instead, you write a relatively simple Perl program to transform the data so that's more easily sorted by the existing tools.

Upvotes: 0

Related Questions