best wishes
best wishes

Reputation: 6644

significant difference detection on a stream of data

There are 2 groups of user. Based on their query I return some search results to them (a1,a2,a3). The search results could vary based on either the group that user belongs to or some user specific parameter. I want to measure, whether the search results to the users are significantly different to each other or not for the same query (let's say when there is difference of more than 7 results out of first 10 results).

Are there any real time/batch learning algorithm to do this?

Here is what i am planning so far,

  1. Batch incoming events in in some time interval. Let's say 5 mins.

  2. group all the response by (groupid, query). so that i will have list of records of the form

    (query1, group1, r1,r2,r3,...,r10)

    (query1, group2, r1,r4,r5,...,r11),

    (query1, group1, r2,r1,r3,...,r9)

    (query1, group2, r3,r4,r5,...,r11),

  3. Calculate the frequency distribution of results by groupid for a given query.

    (query1, group1): r1:5,r2:7,r3:10,r4:9 ... r11:10

    (query1, group2): r1:3,r2:9,r3:11,r4:11 ... r11:1

  4. Now measure how group1 and group2 are different from each other by using chi square distance.

I have few questions wrt this

  1. Are there other techniques or more suited tests that can perform this statistical analysis? note that we would have to handle missing data.
  2. What are the pitfals to be aware of for example what if there is a skew in number of users lying in group1 vs group2.
  3. What if there are n groups instead of just 2.

Literature suggestion are also welcome.

Upvotes: 1

Views: 37

Answers (0)

Related Questions