Reputation: 43

Filtering a dataset based on another dataset

I have lots of users and they have favorite colors. I have a dataset -each of its records has color data- and I want to send an email to each user. In each email, the user will see the filtered data based on his/her favorite colors which means I need to filter this dataset based on their favorite colors.

For example; users' favorite colors like that:
[User1:(“Green”, “Yellow”), User2:(“Green, Blue”), User3:(“Red”), User4:(“Orange”, “Purple”, “Red”), User5:(“Blue”, “Yellow”) …]

How can I effectively filter this dataset based on user’s favorite colors?

The most straightforward way is to loop through the user list and filter dataset by current user’s favorite colors in every iteration. However, that may cause redundant queries for the same or common colors. So, if I have 1 million users then I will make 1 million queries to the same dataset.

Can someone suggest an idea to make this process more elegant? I will do it with Python but the answer can be language independent.

Upvotes: 0

Answers (3)

Max Martynov

Reputation: 739

It will be better if you provide more details about language and tools/technologies that you use.

Is the question only in filtering of the existing dataset? Or can I make changes in the code? I have one idea if I can add some code.

I imagined how I can solve the problem don't use any tools (with pure JavaScript, for example). In this case I prefer to have two tables User -> Color (which you provided above) and Color -> User with relations between them and update both tables at once. Check the code snippet to see what I mean.

Redis (key-value database) will be a great choice for that.

I can't help you more, because the question doesn't contain any tech information, but I just leave my answer here. Perhaps this will push you to any idea :)

var USERS = {DefaultUser: {TestColor: true}};
var COLORS = {TestColor: {DefaultUser: true}};

function addColor (userId, color) {
	if (!COLORS[color]) COLORS[color] = {};
	COLORS[color][userId] = true;

	if (!USERS[userId]) USERS[userId] = {};
	USERS[userId][color] = true;
}

function removeColor (userId, color) {
	if (!COLORS[color]) return;
	delete COLORS[color][userId];

  if (!USERS[userId]) USERS[userId] = {};
	delete USERS[userId][color];
}

function findUsersByColor (color) {
	return Object.keys(COLORS[color] || {});
}

function addColorsToUsers () {
  addColor('User1', 'Green');
  addColor('User1', 'Yellow');
  addColor('User2', 'Green');
  addColor('User2', 'Blue');
  addColor('User3', 'Red');
  addColor('User4', 'Orange');
  addColor('User4', 'Purple');
  addColor('User4', 'Red');
  addColor('User5', 'Blue');
  addColor('User5', 'Yellow');
}

function runJob () {
  console.log('Result: findUsersByColor("Green")', findUsersByColor("Green"))
  removeColor("User1", "Green")
  console.log('Result: findUsersByColor("Green")', findUsersByColor("Green"))
}

addColorsToUsers();
runJob();

Upvotes: 0

SergGr

Reputation: 23788

Extending on @jake2389 idea, there are several tricks you can do. What you can really do greatly depends on how big your dataset is and how many times can you fit it in your memory (or your database). The obvious way to improve performance is to do some caching. Assume you have a method getRecordsForColors(colors) that does the real filtering (or real query to the DB). Some very naive approach would go like this (note I didn't try this code so there might be a lot of tiny mistakes):

cache = dict()

def getRecordsCached(colors):
    global cache
    if colors not in cache:
       records = getRecordsForColors(colors)
       cache[colors] = records
       return records
    else:
       return cache[colors]

The obvious drawback of this approach is that you have to hold in cache all the combinations of colors even if they are used by just 1 user and this might be a lot.

A bit more clever approach might be to choose some threshold like for example 3 colors that you can store all combinations for:

cache = dict()

def getRecordsCached(colors):
    global cache
    if colors not in cache:
       records = getRecordsForColors(colors)
       if len(colors) < threshold:
          cache[colors] = records
       return records
    else:
       return cache[colors]

This will cover most of the users and those users with rare long combinations will produce some duplicated queries.

Obviously you don't have to use a naive dict-based cache or in-memory cache at all. You can cache the data inside the same DB or you can use some specialized for cache DB like Memcached or Redis. Also instead of a threshold in a form of length of colors you may use some specialized cache library that support a LRU cache or some other replacement police

Finally if your logic is that the result for given set of colors is just a union of the results for each color, you may try to cover those rare big combinations of colors on the client side by caching results for each color alone and then if the color combination is not in cache directly, compute it by merging the items in the cached results for each color.

Upvotes: 1

user4934422

Reputation:

Since this is strictly theoretical (you don't provide which technology you wish to use), I would go ahead filter by a query that retrives users having the same matching options (colors). Now that can be accomplished either through SQL-query or LINQ to SQL if you're using .NET. If you can provide more info on what language you'll be using I can give you more specific answers.

Upvotes: 0

Filtering a dataset based on another dataset

Answers (3)

Related Questions