Reputation: 5475
Articles are parsed from an rss feed and each article might fall into many categories.Also each article has some metadata such as source,upstream etc.
Below is how we are designing the spaces.Each article is inserted into the articles space.
articles space
urlhash | article.content
abcdef | { dummy content}
primary key urlhash = hash(article.url).
In the category_articles space we insert the article multiple times based on how many categories it falls into
category_articles
source | category | urlhash | timestamp
bbc | arts | article1 | 27777
bbc | mobile | article8 | 27777
bbc | phone | article3 | 27778
nyt | sound | article7 | 36667
nyt | speaker | article7 | 45556
primary key = {source, category, urlhash}
secondary key = {source, category, timestamp}
I need latest articles for a given source and a possible category.Below is how I framed the query.
box.space.category_articles.index.secondary:select{{'nyt','speaker'},{ iterator = 'LE', limit = 5 }}
Now I will get article7 twice in the result.Currently I am filtering duplicate results in the code.How can I have distinct(urlhash) kind of option in tarantool.
Upvotes: 0
Views: 390
Reputation: 5475
I was able to find a better solution using the pairs function on the index and then filter the articles(track the unique ones using a lua table) until I get the unique number of articles.
index_object:pairs([key[, iterator-type]])
Upvotes: 1
Reputation: 504
Where is two possible options
The second one is using Lua stored procedure. An example:
function select_with_distinct() local ca = box.space.category_articles for _, v in pairs(ca.index.secondary:select{ {'nyt','speaker'},{ iterator = 'LE', limit = 5 }}) do -- filtring ... end end
Upvotes: 0