Guilherme Dimarchi
Guilherme Dimarchi

Reputation: 53

Best technology or algorithm to find best matches on a large database

We plan to have a large database with objects that have structure like this:

Person1: skills: ['a','b','c']

Person2: skills: ['a','b']

Person3: skills: ['d','e','f']

Person4: skills: ['a','b','d']

And then given an input of skills, the algorithm/technology shoud be able to quickly find the best fit Person given some skills.

Example: Find person with skills: a, b -> returns the list ordened like this [Person1,Person2,Person4,Person3]

So i would like some recommendations on what technology (database / language) to build this on top and which algorithm should perform good on a database with about 10k registers.

Upvotes: 1

Views: 241

Answers (2)

SaiBot
SaiBot

Reputation: 3745

You want to use an inverted index for this problem. The basic idea is to invert your representation from

1 -> a, b, c
2 -> a, b
3 -> d, e, f
4 -> a, b, d

to

a -> 1, 2, 4
b -> 1, 2, 4
c -> 1
d -> 3, 4
e -> 3
f -> 3

Now for each skill, you have a list of people capable of that skill (possibly ordered by skill level). In order to get the result for skills a, b you scan through the lists of a and b and increase the counter of each person you found, which gives you persons 1, 2, 4 each with count 3.

This is basically the same index structure as used for text search (here you have documents containing terms). Systems like elastic search include more advanced inverted indices that might suit your needs.

Upvotes: 1

desoares
desoares

Reputation: 861

Independently of the database you're planning to use queries you consider primary (the ones used more often) may have a huge benefit from indexing.

You should create the index in the same order of the queries. Based on the model you used for your example I take it that you're using a NoSQL DB. Indexes provide better performance on search but takes more time to record.

Finally I have to say that 10k is not a big collection, but querying nested arrays may be much slower without index.

Upvotes: 0

Related Questions