Reputation: 2889
I have a java hash map which is used for generating "rules" learned by inference. For example the input might look like:
'prevents'('scurvy','vitamin C').
'contains'('vitamin C','orange').
'contains'('vitamin C','sauerkraut').
'isa'('fruit','orange').
'improves'('health','fruit').
and the output might look like this:
prevents(scurvy, orange).
prevents(scurvy, sauerkraut).
improves(health, orange).
For a small test everything works quite well but in my actual data set I have many instances of identical rules. I want to somehow store the number of occurrences for each rule and write that to a file, along with the number of times it's been seen, since I think that can be some kind of naive confidence measure for the likelihood the rule is a good one.
At this juncture I store sentences like this:
public class Sentence {
private String verb;
private String object;
private String subject;
public Sentence(String verb, String object, String subject ){
this.verb = verb;
this.object = object;
this.subject = subject;
}
public String getVerb(){ return verb; }
public String getObject(){ return object; }
public String getSubject(){ return subject; }
public String toString(){
return verb + "(" + object + ", " + subject + ")";
}
}
The hash map construct:
public class Ontology {
private List<Sentence> sentences = new ArrayList<>();
/*
* The following maps store the relation of a string occurring
* as a subject or object, respectively, to the list of Sentence
* ordinals where they occur.
*/
private Map<String,List<Integer>> subject2index = new HashMap<>();
private Map<String,List<Integer>> object2index = new HashMap<>();
/*
* This set contains strings that occur as both,
* subject and object. This is useful for determining strings
* acting as an in-between connecting two relations.
*/
private Set<String> joints = new HashSet<>();
public void addSentence( Sentence s ){
// add Sentence to the list of all Sentences
sentences.add( s );
// add the Subject of the Sentence to the map mapping strings
// occurring as a subject to the ordinal of this Sentence
List<Integer> subind = subject2index.get( s.getSubject() );
if( subind == null ){
subind = new ArrayList<>();
subject2index.put( s.getSubject(), subind );
}
subind.add( sentences.size() - 1 );
// add the Object of the Sentence to the map mapping strings
// occurring as an object to the ordinal of this Sentence
List<Integer> objind = object2index.get( s.getObject() );
if( objind == null ){
objind = new ArrayList<>();
object2index.put( s.getObject(), objind );
}
objind.add( sentences.size() - 1 );
// determine whether we've found a "joining" string
if( subject2index.containsKey( s.getObject() ) ){
joints.add( s.getObject() );
}
if( object2index.containsKey( s.getSubject() ) ){
joints.add( s.getSubject() );
}
}
public Collection<String> getJoints(){
return joints;
}
public List<Integer> getSubjectIndices( String subject ){
return subject2index.get( subject );
}
public List<Integer> getObjectIndices( String object ){
return object2index.get( object );
}
public Sentence getSentence( int index ){
return sentences.get( index );
}
}
and finally the code that determines the rules:
public static void main(String[] args) throws IOException {
Ontology ontology = new Ontology();
BufferedReader br = new BufferedReader(new FileReader("file.txt"));
Pattern p = Pattern.compile("'(.*?)'\\('(.*?)','(.*?)'\\)");
String line;
while ((line = br.readLine()) != null) {
Matcher m = p.matcher(line);
if( m.matches() ) {
String verb = m.group(1);
String object = m.group(2);
String subject = m.group(3);
ontology.addSentence( new Sentence( verb, object, subject ) );
}
}
for( String joint: ontology.getJoints() ){
for( Integer subind: ontology.getSubjectIndices( joint ) ){
Sentence xaS = ontology.getSentence( subind );
for( Integer obind: ontology.getObjectIndices( joint ) ){
Sentence yOb = ontology.getSentence( obind );
Sentence s = new Sentence( xaS.getVerb(),
xaS.getObject(),
yOb.getSubject() );
System.out.println( s );
}
}
}
}
Is there some kind of fast and efficient way to eliminate duplicates from this hash map, keeping only one instance for each unique rule and simultaneously associating the new index with the number of identical instances of that rule that we had observed in the original map?
I want to eliminate the duplicate 'rules' after the sentences have been processed. But only after I have a chance to count the frequency with which each rule has occurred and saving it as a value associated with the unique rule that I end up keeping.
Upvotes: 0
Views: 155
Reputation: 140318
If you are able to use Guava, you can use an implementation of Multiset. The example in the user guide sounds reasonably similar to your requirements.
Upvotes: 0
Reputation: 27946
I suggest some changes to your data model. You can quite easily store the number of times a sentence occurs in a Map
as follows:
Map<Sentence, Integer> sentenceCount = new HashMap<>();
This relies on implement equals
and hashCode
methods for Sentence
. It automatically eliminates duplicates by using Sentence
as the key.
You can add new sentences to it as follows:
public addSentence(Sentence sentence) {
if (!sentenceCount.containsKey(sentence))
sentenceCount.put(sentence, 0);
sentenceCount.put(sentence, sentenceCount.get(sentence) + 1);
}
Now you no longer need your sentences
list because you can get the set of sentences using sentenceCount.keySet()
.
If you need the maps from subject and object to sentences then I don't suggest you use an index: that is an error prone approach. Instead I suggest you make them direct maps:
Map<String, Set<Sentence>> subjectMap;
Map<String, Set<Sentence>> objectMap;
You can use this to find, say, the number of times a certain subject appears:
subjectMap.get("subject").stream().mapToInt(sentenceCount::get).sum();
Upvotes: 1