Reputation: 161
If I have lots of sets of strings and they each have different items in them and some of them have items that overlap.
I want to search for the set that is the union of all the items in the search set.
Is there something I can do with hashing?
Is the union find the right algorithm?
Is it a bloom filter I want?
In the following example, I can first check if there is an exact match by relying on equals
of HashSet.
I search each set and do a containsAll
to see if it is union with the search set.
As an example, If I have the "fruits" set with ["orange", "kiwi", "apple"] and "vegetables" set with ["cabbage", "carrot", "broccoli"]. I want a search for ["orange", "kiwi"] to match "fruits"
There must be a more better approach than this, but I'm not sure.
package main;
import java.util.HashMap;
import java.util.HashSet;
import java.util.Map;
public class SetKeyHash {
public static void main(String args[]) {
HashSet<String> vegetables = new HashSet<>();
vegetables.add("tomato");
vegetables.add("carrot");
vegetables.add("broccoli");
HashSet<String> fruit = new HashSet<>();
fruit.add("apple");
fruit.add("kiwi");
fruit.add("orange");
Map<HashSet<String>, String> map = new HashMap<>();
map.put(vegetables, "vegetables");
map.put(fruit, "fruit");
HashSet<String> search = new HashSet<>();
search.add("tomato");
search.add("carrot");
search.add("broccoli");
System.out.println("Exact set search");
System.out.println(map.get(search));
System.out.println("Partial set search");
HashSet<String> partialSearch = new HashSet<>();
partialSearch.add("tomato");
partialSearch.add("broccoli");
for (HashSet<String> set : map.keySet()) {
if (set.containsAll(partialSearch)) {
System.out.println(String.format("Found partial match %s", map.get(set)));
}
}
}
}
Upvotes: 5
Views: 1109
Reputation: 1
You could gather all items' sets (tags), like orange -> fruits kiwi -> fruits ...
and then loop input items to narrow down resulting intersection.
Something like this below. Note I used Set for map value type because one item could exist in multiple sets (even though the fruits/vegetables sample data does not)
import java.util.Collection;
import java.util.Iterator;
import java.util.Map;
import java.util.Set;
import java.util.TreeSet;
public class a {
private static final Map<String, Set<String>> ITEM_TAGS = Map.of(
"orange", Set.of("fruits"),
"kiwi", Set.of("fruits"),
"apple", Set.of("fruits"),
"cabbage", Set.of("vegetables"),
"carrot", Set.of("vegetables"),
"broccoli", Set.of("vegetables")
);
private static Collection<String> getTags(Set<String> items){
if (items == null || items.isEmpty()) {
return Set.of();
}
else if (items.size() == 1) {
return ITEM_TAGS.get(items.iterator().next());
}
Iterator<String> iterItems = items.iterator();
Set<String> tagScope = ITEM_TAGS.get(iterItems.next());
if (tagScope == null || tagScope.isEmpty()) {
return Set.of();
}
Set<String> result = new TreeSet<>(tagScope);
while (iterItems.hasNext() && !result.isEmpty()) {
Set<String> scope = ITEM_TAGS.get(iterItems.next());
if (scope == null || scope.isEmpty()) {
result.clear();
}
else {
result.retainAll(scope);
}
}
return result;
}
public static void main(String[] args) {
Collection<String> fruits = getTags(Set.of("orange", "kiwi"));
System.out.println(fruits);
}
}
Upvotes: 0
Reputation:
You can build an index mapping each item (e.g. apple) to all the sets that contain it, then look them up at search time and count how many matches each contains:
package com.example;
import java.util.Collection;
import java.util.HashMap;
import java.util.HashSet;
import java.util.List;
import java.util.Map;
import java.util.Objects;
import java.util.Set;
import static java.util.function.Function.identity;
import static java.util.stream.Collectors.toMap;
public class Sets {
public static void main(String[] args) {
var vegetables = Set.of("tomato", "carrot", "broccoli");
var fruits = Set.of("apple", "kiwi", "orange");
var random = Set.of("tomato", "rock");
/*
* A map where the key is an item (e.g. apple) and its value is a collection of all the sets
* that contain this item.
*/
var index = new HashMap<String, Set<Set<String>>>();
for (Set<String> items : List.of(vegetables, fruits, random)) {
for (String item : items) {
index.computeIfAbsent(item, key -> new HashSet<>()).add(items);
}
}
var search = Set.of("tomato", "broccoli");
/*
* For each of the terms in the search, find all the sets that contain each, and produce a map
* where the key is the set, and the value is the number of search terms the set matches.
*/
var atLeastOneMatches =
search.stream()
.collect(toMap(identity(), item -> index.getOrDefault(item, Set.of())))
.values()
.stream()
.flatMap(Collection::stream)
.collect(toMap(identity(), it -> 1, Integer::sum));
var extractMatches =
atLeastOneMatches.entrySet().stream()
.filter(entry -> entry.getValue() == search.size())
.map(Map.Entry::getKey)
.toList();
System.out.println("Search term(s): " + search);
System.out.println(
"Set(s) that contain at least one of the search terms: " + atLeastOneMatches.keySet());
System.out.println("Set(s) that contain all of the search terms: " + extractMatches);
}
}
Upvotes: 0
Reputation: 4251
Actually your question is quite broad, this is topic is really huge, there are many details ommitted, however, I'll offer and explain different approaches. Also we'll try to review bloom filters and union operations you mentioned.
To simplify code writing we'll use Java Google Guava library
Maven pom.xml
dependency to add to our project:
<!-- https://mvnrepository.com/artifact/com.google.guava/guava -->
<dependency>
<groupId>com.google.guava</groupId>
<artifactId>guava</artifactId>
<version>31.1-jre</version>
</dependency>
Having that said, let's take a look to different approaches:
As we're dealing with strings, in case if we have primarily quite long strings they could take too many space in RAM memory. But as long as we don't care about the actual values of these strings in our collections, we may use only hash codes of them. If our strings are primarily short, then there's no need to hash them, we can just use them as they are.
Let's assume that our strings may be quite long (in real world, for example, URLs). What algorithm for hashing should we use? Well, it really depends, but the main goal is to achieve uniqueness to prevent collissions as much as possible. So in this example we'll use SHA-256 hashing algorithm which gives 256-bit output.
So, the way we would search for intersections in this approach is the following:
public static void main(String[] args) {
// initialize the first set in which we'll perform searching
Set<HashCode> vegetables = Set.of(
hash256("tomato"),
hash256("carrot"),
hash256("broccoli")
);
// initialize the second set in which we'll perform searching
Set<HashCode> fruit = Set.of(
hash256("apple"),
hash256("kiwi"),
hash256("orange")
);
// collecting those collections to Map, where key = collection name, value = collection itself
Map<String, Set<HashCode>> targetMap = Map.of("vegetables", vegetables, "fruit", fruit);
// initialize set for searching full intersections
Set<HashCode> search = Set.of(hash256("tomato"), hash256("carrot"));
// let's search
Set<String> intersections = targetMap.entrySet().stream()
// skip sets which size is lower than search set size
.filter(target -> target.getValue().size() >= search.size())
// check the rest sets with containsAll()
.filter(entry -> entry.getValue().containsAll(search))
.map(Map.Entry::getKey).collect(Collectors.toSet());
// print names of sets which intersect with our search set
System.out.println(results);
}
// using Guava library to get 256-bit hash of string
public static HashCode hash256(String input) {
HashFunction hashFunction = Hashing.sha256();
return hashFunction.hashString(input, StandardCharsets.UTF_8);
}
Note: while iterating through the sets we first skip all the sets which size is less than our search set, just for optimization sake. Obviously there's no sense to check those sets since it is guaranteed that they don't have enough elements for full intersection with our search set. Also, you can add equals()
checking if you want to check whether the search set is the actual instance of the target set.
A Bloom filter is a probabilistic data structure that is good for checking whether or not an element is definitely not in a set, but cannot tell you for sure whether something is in the set. A query returns either "possibly in set" or "definitely not in set". A bloom filter doesn't store the elements themselves, this is the crucial point. You don't use a bloom filter to test if an element is present, you use it to test whether it's certainly not present, since it guarantees no false negatives. And it takes significantly less space comparing to a hash table/hash set and etc. and speeds up the checking.
The bloom filter works by taking the elements, running them through N different hash functions each with an accompanying bit vector, and then setting to True the position in the accompanying vector based on the hash observed. Once the position is flipped True (1), it is never set back to False (0).
The number of different hash functions N, as well as the length of the bit vectors, is usually reverse engineered in the implementation based on a configured capacity and maximum allowed error rate. Capacity is the number of distinct elements we expect to ever be seen by the bloom filter.
Of course, it has disadvantages:
So, using Guava library which has implementation of Bloom Filter we could write our code the following way:
public static void main(String[] args) throws IOException {
// initialize the first set in which we'll perform searching
Set<String> vegetables = Set.of("tomato", "carrot", "broccoli");
// initialize the second set in which we'll perform searching
Set<String> fruit = Set.of("apple", "kiwi", "orange");
// preparing our blooms.
// assign expectedAssertions for test = 100.
// assign fpp (the desired false positive probability) to 0.01 = 1%
int expectedAssertions = 100;
BloomFilter<String> vegetablesBloom = BloomFilter.create(
Funnels.stringFunnel(StandardCharsets.UTF_8), expectedAssertions, 0.01);
BloomFilter<String> fruitBloom = BloomFilter.create(
Funnels.stringFunnel(StandardCharsets.UTF_8), expectedAssertions, 0.01);
// put all the vegetables and fruits to blooms
vegetables.forEach(vegetablesBloom::put);
fruit.forEach(fruitBloom::put);
// collecting those blooms to Map, where key = bloom name, value = bloom itself
Map<String, BloomFilter<String>> targetMap =
Map.of("vegetables", vegetablesBloom, "fruit", fruitBloom);
// initialize set for searching full intersections
Set<String> search = Set.of("tomato", "carrot");
Set<String> results = targetMap.entrySet().stream()
// skip blooms which size is lower than search set size
.filter(entry -> entry.getValue().approximateElementCount() >= search.size())
.filter(entry -> intersect(search, entry.getValue()))
.map(Map.Entry::getKey).collect(Collectors.toSet());
System.out.println(results);
}
// check intersection of our search set with bloom
public static boolean intersect(Set<String> search, BloomFilter<String> bloomFilter) {
for (String value : search) {
// return false immediately if this is definitely not the case.
if (!bloomFilter.mightContain(value)) {
return false;
}
}
return true;
}
Pay attention to expectedAssertions
parameter. This is fpp
- the desired false positive probability (must be positive and less than 1.0). Our bloom is more accurate the less value we provide, however, the more it accurate, the more memory our bloom takes.
Here we're iterating through our blooms checking each element from our search collection in it with mightContain(value)
if it says false
, then we're sure this is not the case, this bloom is not intersected with our set.
In the previous approach we were checking each element of our search collection with bloom filter. One operation you can do with two bloom filters is union them, which means that if you have a bloom filter A and bloom filter B, you end up with a third bloom filter C that contains all the unique items from both A and B. How you union two bloom filters is you just do a bitwise OR on every bit in A and B to get the bits for C. Simple and fast to calculate.
But by itself this operation cannot be useful in our case. However, we can use it to perform the intersection checking. How can we achieve this?
Well, there's some math included:
Count(Intersection(A,B)) = (Count(A) + Count(B)) – Count(Union(A,B))
Having that said, let's try to perform this.
Change the piece of code of previous approch to the following:
// prepare our search bloom
BloomFilter<String> searchBloom = BloomFilter.create(
Funnels.stringFunnel(StandardCharsets.UTF_8), expectedAssertions, 0.01);
search.forEach(searchBloom::put);
Set<String> results = targetMap.entrySet().stream()
// skip blooms which size is lower than search set size
.filter(entry -> entry.getValue().approximateElementCount() >= search.size())
.filter(entry -> intersectWithUnion(searchBloom, entry.getValue()))
.map(Map.Entry::getKey).collect(Collectors.toSet());
System.out.println(results);
And intersectWithUnion()
method:
// check intersection of our search bloom with bloom
public static boolean intersectWithUnion(BloomFilter<String> searchBloom, BloomFilter<String> bloomFilter) {
long searchCount = searchBloom.approximateElementCount();
long targetCount = bloomFilter.approximateElementCount();
// making a copy of search bloom
BloomFilter<String> union = searchBloom.copy();
// perform union operation
union.putAll(bloomFilter);
// apply our formula
long intersectionFactor = (searchCount + targetCount) - union.approximateElementCount();
// if intersection factor is equal to initial search count, then we have full intersection
return intersectionFactor == searchCount;
}
Here we creating a bloom from our search set first. Then check the intersection of each bloom with our new method. The union operation is implemented in Guava library with putAll() method. Why do we call copy()
on our search bloom? Well, as docs of putAll()
states:
Combines this Bloom filter with another Bloom filter by performing a bitwise OR of the underlying data. The mutations happen to this instance. Callers must ensure the Bloom filters are appropriately sized to avoid saturating them.
So, putAll()
mutates our bloom. This copy()
is our overhead in this case.
Important note: To perform union operation the blooms must be compatible. Basically they must have the same expectedAssertions and fpp. To be compatible there are requirements:
Otherwise IllegalArgumentException
will be thrown.
Well, there's no direct answer to this question, as it really depends to a lot of factors in your real case. There's no "always better" or "always performant" cases. Bloom filters are very useful and very memory-cheap when dealing with huge amount of data, for example, millions and millions of strings. But it has it's own disadvantages. In simple cases where you have not so much data it is definitely better to use some straight approach. Hope it hepls :)
Also, an interesting article about bloom filters
Upvotes: 3
Reputation: 1123
You can use inverted index, lucene uses it.
vegetables(Document) fruit(Document)
tomato +
carrot +
broccoli +
apple +
kiwi +
orange +
Other answers are probably saying the same, invert the map and search item one by one.
If you have a lot of those set, better insert into database like elastic search that uses inverted index too, or use full-text search engine on mysql(still uses inverted index).
Upvotes: 1
Reputation: 1019
Another way of looking at this is preparing an index over the component sets which here are fruits
and vegetables
A pythonic way to do this would be as follows. I am sure similar can be easily done in java
>>> lookup = defaultdict(lambda: set())
>>> populate_lookup = lambda x, y: [lookup[elem].add(y) for elem in x]
>>> populate_lookup(["orange", "kiwi", "apple"], "fruits")
[None, None, None]
>>> populate_lookup(["cabbage", "carrot", "broccoli"], "vegetables")
[None, None, None]
>>> search_matching = lambda x: set.intersection(*[lookup[elem] for elem in x])
>>> search_matching(["orange", "kiwi"])
{'fruits'}
>>>
Where lookup is simply a reverse dict of the elems
>>> dict(lookup)
{'orange': {'fruits'}, 'kiwi': {'fruits'}, 'apple': {'fruits'}, 'cabbage': {'vegetables'}, 'carrot': {'vegetables'}, 'broccoli': {'vegetables'}}
>>>
If you want to solve this for a service using SQL and tables, I would suggest looking at GIN indexes in Postgres.
EDITED (Added)
An alternative to intersection
would be a counter as follows, not necessarily an optimization (set intersection will only look at the smallest set)
search_matching_counter = lambda x: [xi[0] for xi in Counter([set_name for elem in x for set_name in lookup[elem]]).items() if xi[1] == len(x)]
search_matching_counter(["orange", "kiwi"])
['fruits']
Upvotes: 1
Reputation: 36
As far as I understand your problem, you are trying to classify different vegetables and fruits and store the names of the vegetables and fruits just once. And you want to retrieve the type of food item by the name of the food item.
The way you are describing your search query,
"If I have the "fruits" set with ["orange", "kiwi", "apple"] and "vegetables" set with ["cabbage", "carrot", "broccoli"]. I want a search for ["orange", "kiwi"] to match "fruits"
You would have to iterate through the parameter array (["orange", "kiwi"]) and look for matches. You can keep a count for how many in the search parameter are vegetables and how many are fruits.
Storing your data works in the same way as described by Moemen. It would be
Map<String, HashSet<String>> map = new HashMap<>();
The type of variable ("vegetables", "fruits") will be key stored as a string, and your items would be in a Hashset.
In its entirety, the map would look something like:
Map-> "Vegetables": ["tomato", "broccoli", "carrot"],
"fruits": ["apple", "kiwi", "orange"]
A second approach to this would be to have a map of string pointing to string.
map=>{string:string}
This map you can input data and make it so:
map-> "orange": "fruit",
"apple": "fruit",
"carrot": "vegetable" and so on.
This way you would be able to run a loop through the search food items and retrieve whether they are fruits or vegetables.
I am almost certain that food items combined cannot be used to query a map. Meaning, you will have to go element by element. So ["orange", "apple"], you would have to have a loop for looking for the type of orange and then apple.
Hope this answers your questions and gives you an insight on planning structures for data storage.
Upvotes: 1
Reputation: 404
You need to use Map<String, HashSet<String>> map = new HashMap<>();
instead.
The following solution finds all set names which contain all elements in "search" set.
HashSet<String> vegetables = new HashSet<>();
vegetables.add("tomato");
vegetables.add("carrot");
vegetables.add("broccoli");
HashSet<String> fruit = new HashSet<>();
fruit.add("apple");
fruit.add("kiwi");
fruit.add("orange");
Map<String, HashSet<String>> map = new HashMap<>();
map.put("vegetables", vegetables);
map.put("fruit", fruit);
HashSet<String> search = new HashSet<>();
search.add("tomato");
search.add("carrot");
//search.add("broccoli");
Set<String> nameOfSets = map.entrySet().stream()
.filter(entry -> entry.getValue().containsAll(search))
.map(Map.Entry::getKey).collect(Collectors.toSet());
Upvotes: 1