deemel
deemel

Reputation: 1036

Java Algorithm: pair list entries by multiple case criteria

I fear this won't be an easy question. I've been thinking about a proper solution for this problem for a long time and hope that a fresh bunch of brains have a better view on the problem - let's get to it:

Data:

What we're working with here is a csv file containing multiple columns, the relevant ones for this problem are:

Every entry in the data we're working with here has !null values for these attributes.

Example Data:

SID,UID,query,rawdate,timestamp,timegap,epoc,lengthwords,lengthchars,rank,clickurl
5,142,westchester.gov,2006-03-20 03:55:57,Mon Mar 20 03:55:57 CET 2006,0,1142823357504,1,15,1,http://www.westchestergov.com
10,142,207 ad2d 530,2006-04-08 01:31:14,Sat Apr 08 01:31:14 CEST 2006,10000,1144452674507,3,12,1,http://www.courts.state.ny.us
11,142,vera.org,2006-04-08 08:38:42,Sat Apr 08 08:38:42 CEST 2006,11000,1144478322507,1,8,1,http://www.vera.org

Note: there are multiple entries that have the same value for 'Epoc', this is due to the tools used to gather the data

Note2: the list has a size of ~700000, just fyi

Goal: Match pairs of entries that have the same query

Scope: entries that share the same UserID

Due to the mentioned anomaly in the data gathering process, the following has to be considered:

If two entries share the same value for 'Query' and for 'Epoc' , the following elements in the list have to be checked for these criteria until the next entry has a different value for one of these attributes. The group of entries that shared the same Query and Epoc values are to be considered as -one- entry, so in order to match a pair, another entry has to be found that matches the 'Query' value. For lack of a better name, let's call a group that shares the same Query and Epoc value a 'chain'

Now that this is out, it gets a bit easier, there are 3 types of pair compositions we can get out of this:

  1. Entry & Entry
  2. Entry & Chain
  3. Chain & Chain

Type 1 here just means two entries in the list that share the same value for 'Query', but not for 'Epoc'.

So this sums up the Equal Query Pairs

There's also the case of Different Query Pairs which can be described as the following:

After we have matched the equal query pairs, there's the possibility that there are entries which have not been paired with other entries because their query didn't match - every entry that has not been matched to another entry because of this is part of the set called 'different queries'

The members of this set have to be paired without following any criteria, but chains are still treated as -one- entry of the pair.

As for matching the pairs in general, there may be no redundant pairs - a single entry can be part of n many pairs, but two individual entries can only form one pair.

EXAMPLE:

The following entries are to be paired

UID,Query,Epoc,Clickurl
772,Donuts,1141394053510,https://www.dunkindonuts.com/dunkindonuts/en.html
772,Donuts,1141394053510,https://www.dunkindonuts.com/dunkindonuts/en.html
772,Donuts,1141394053510,https://www.dunkindonuts.com/dunkindonuts/en.html
772,raspberry pi,1141394164710,http://www.raspberrypi.org/
772,stackoverflow,1141394274810,http://en.wikipedia.org/wiki/Buffer_overflow
772,stackoverflow,1141394274850,http://www.stackoverflow.com
772,tall women,1141394275921,http://www.tallwomen.org/
772,raspberry pi,1141394277991,http://www.raspberrypi.org/
772,Donuts,114139427999,http://de.wikipedia.org/wiki/Donut
772,stackoverflow,1141394279999,http://www.stackoverflow.com
772,something,1141399299991,http:/something.else/something/

In this example, donuts is a chain, therefore the pairs are(linenumbers without header):

My -failed- approach to the problem:

The algorithm I developed to solve this works as follow:

Iterate the list of entries until the value for UserID changes.

Then, applied to a separate list that only contains the just iterated elements that share the same UserID:

   for (int i = 0; i < list.size(); i++) {
            Entry tempI = list.get(i);
            Boolean iMatched = false;
            //boolean to save whether or not c1 is set
            Boolean c1done = false;
            Boolean c2done = false;

        //Hashsets holding the clickurl values of the entries that form a pair
        HashSet<String> c1 = null;
        HashSet<String> c2 = null;

        for (int j = i + 1; j < list.size(); j++) {
            Entry tempJ = list.get(j);
            // Queries match
            if (tempI.getQuery().equals(tempJ.getQuery())) {
                // wheter or not Entry at position i has been matched or not
                if (!iMatched) {
                    iMatched = true;
                }
                HashSet<String> e1 = new HashSet<String>();
                HashSet<String> e2 = new HashSet<String>();
                int k = 0;
                // Times match
                HashSet<String> chainset = new HashSet<String>();
                if (tempI.getEpoc() == tempJ.getEpoc()) {
                    chainset.add(tempI.getClickurl());
                    chainset.add(tempJ.getClickurl());
                } else {
                    e1.add(tempI.getClickurl());
                    if (c1 == null) {
                        c1 = e1;
                        c1done = true;
                    } else {
                        if (c2 == null) {
                            c2 = e1;
                            c2done = true;
                        }
                    }
                }
                //check how far the chain goes and get their entries
                if ((j + 1) < list.size()) {
                    Entry tempjj = list.get(j + 1);
                    if (tempjj.getEpoc() == tempJ.getEpoc()) {
                        k = j + 1;
                        //search for the end of the chain
                        while ((k < list.size())
                                && (tempJ.getQuery().equals(list.get(k)
                                        .getQuery()))
                                && (tempJ.getEpoc() == list.get(k).getEpoc())) {

                            chainset.add(tempJ.getClickurl());
                            chainset.add(list.get(k).getClickurl());
                            k++;

                        }
                        j = k + 1; //continue the iteration at the end of the chain
                        if (c1 == null) {
                            c1 = chainset;
                            c1done = true;
                        } else {
                            if (c2 == null) {
                                c2 = chainset;
                                c2done = true;
                            }
                        }

                        // Times don't match
                    }
                } else {
                    e2.add(tempJ.getClickurl());
                    if (c1 == null) {
                        c1 = e2;
                        c1done = true;
                    } else {
                        if (c2 == null) {
                            c2 = e2;
                            c2done = true;
                        }
                    }
                }

                /** Block that compares the clicks in the Hashsets and computes the resulting data
                *  left out for now to not make this any more complicated than it already is
                **/

                // Queries don't match
            } else {
                if (!dq.contains(tempJ)) { //note: dq is an ArrayList holding the entries of the differen query set
                    dq.add(tempJ);
                }
            }

            if (j == al.size() - 1) {
                if (!iMatched) {
                    dq.add(tempI);
                }
            }
        }
        if (dq.size() >= 2) {

            for (int z = 0; z < dq.size() - 1; z++) {
                if (dq.get(z + 1) != null) {
                    /** Filler, iterate dq just like the normal list with two loops
                    *
                    **/
                }
            }
        }

    }

So, using an excessive amount of loops I try to match the pairs, resulting in a horribly long runtime which's end I have not seen up until this point

Okay I hope I didn't forget anything crucial, I'll add further needed information later If you've made it this far, thanks for reading - hopefully you have an idea that might help me

Upvotes: 4

Views: 490

Answers (2)

Dylan
Dylan

Reputation: 593

Use SQL to import the data into a db and then perform the queries. Your txt file is too large; it's no wonder that it takes so long to go through it. :)

Upvotes: 1

maniek
maniek

Reputation: 7307

First, remove all but one entry from each chain. To do this, sort by (userid, query, epoch), remove duplicates.

Then, scan the sorted list. take all entries for a (userid, query) pair. If there is only one, save it in a list for later processing, else emit all pairs.

For all the entries for a given user that You have saved for later processing (these are type 2 & 3), emit pairs.

Upvotes: 0

Related Questions