hash map, remove duplicates but also store # of original occurances. java

Question

I have a java hash map which is used for generating "rules" learned by inference. For example the input might look like:

'prevents'('scurvy','vitamin C').
'contains'('vitamin C','orange').
'contains'('vitamin C','sauerkraut').
'isa'('fruit','orange').
'improves'('health','fruit').

and the output might look like this:

prevents(scurvy, orange).
prevents(scurvy, sauerkraut).
improves(health, orange).

For a small test everything works quite well but in my actual data set I have many instances of identical rules. I want to somehow store the number of occurrences for each rule and write that to a file, along with the number of times it's been seen, since I think that can be some kind of naive confidence measure for the likelihood the rule is a good one.

At this juncture I store sentences like this:

public class Sentence {
private String verb;
private String object;
private String subject;
public Sentence(String verb, String object, String subject ){
this.verb = verb;
this.object = object;
this.subject = subject;
}
public String getVerb(){ return verb; }
public String getObject(){ return object; }
public String getSubject(){ return subject; }
public String toString(){
return verb + "(" + object + ", " + subject + ")";
}
}

The hash map construct:

public class Ontology {
private List sentences = new ArrayList<>();
/*
 * The following maps store the relation of a string occurring
 * as a subject or object, respectively, to the list of Sentence
 * ordinals where they occur.
 */
private Map> subject2index = new HashMap<>();
private Map> object2index = new HashMap<>();
/*
 * This set contains strings that occur as both,
 * subject and object. This is useful for determining strings
 * acting as an in-between connecting two relations. 
 */
private Set joints = new HashSet<>();
public void addSentence( Sentence s ){
// add Sentence to the list of all Sentences
sentences.add( s );
// add the Subject of the Sentence to the map mapping strings
// occurring as a subject to the ordinal of this Sentence
List subind = subject2index.get( s.getSubject() );
if( subind == null ){
   subind = new ArrayList<>();
    subject2index.put( s.getSubject(), subind );
}
subind.add( sentences.size() - 1 );
// add the Object of the Sentence to the map mapping strings
// occurring as an object to the ordinal of this Sentence
List objind = object2index.get( s.getObject() );
if( objind == null ){
    objind = new ArrayList<>();
    object2index.put( s.getObject(), objind );
}
objind.add( sentences.size() - 1 );
// determine whether we've found a "joining" string
if( subject2index.containsKey( s.getObject() ) ){
    joints.add( s.getObject() );
}
if( object2index.containsKey( s.getSubject() ) ){
    joints.add( s.getSubject() );
}
}
public Collection getJoints(){
return joints;
}
public List getSubjectIndices( String subject ){
return subject2index.get( subject );
}
public List getObjectIndices( String object ){
return object2index.get( object );
}
public Sentence getSentence( int index ){
return sentences.get( index );
}
}

and finally the code that determines the rules:

public static void main(String[] args) throws IOException {
Ontology ontology = new Ontology();
BufferedReader br = new BufferedReader(new FileReader("file.txt"));
Pattern p = Pattern.compile("'(.*?)'$'(.*?)','(.*?)'$"); 
String line;
while ((line = br.readLine()) != null) {
    Matcher m = p.matcher(line);
    if( m.matches() ) {
        String verb    = m.group(1);
        String object  = m.group(2);
        String subject = m.group(3);
        ontology.addSentence( new Sentence( verb, object, subject ) );
    }
}

for( String joint: ontology.getJoints() ){
    for( Integer subind: ontology.getSubjectIndices( joint ) ){
        Sentence xaS = ontology.getSentence( subind );
        for( Integer obind: ontology.getObjectIndices( joint ) ){
            Sentence yOb = ontology.getSentence( obind );
            Sentence s = new Sentence( xaS.getVerb(),
                                       xaS.getObject(),
                                       yOb.getSubject() );
            System.out.println( s );
        }
    }
}
}

Is there some kind of fast and efficient way to eliminate duplicates from this hash map, keeping only one instance for each unique rule and simultaneously associating the new index with the number of identical instances of that rule that we had observed in the original map?

I want to eliminate the duplicate 'rules' after the sentences have been processed. But only after I have a chance to count the frequency with which each rule has occurred and saving it as a value associated with the unique rule that I end up keeping.

sprinter · Accepted Answer

I suggest some changes to your data model. You can quite easily store the number of times a sentence occurs in a Map as follows:

Map sentenceCount = new HashMap<>();

This relies on implement equals and hashCode methods for Sentence. It automatically eliminates duplicates by using Sentence as the key.

You can add new sentences to it as follows:

public addSentence(Sentence sentence) {
    if (!sentenceCount.containsKey(sentence))
        sentenceCount.put(sentence, 0);
    sentenceCount.put(sentence, sentenceCount.get(sentence) + 1);
}

Now you no longer need your sentences list because you can get the set of sentences using sentenceCount.keySet().

If you need the maps from subject and object to sentences then I don't suggest you use an index: that is an error prone approach. Instead I suggest you make them direct maps:

Map> subjectMap;
Map> objectMap;

You can use this to find, say, the number of times a certain subject appears:

subjectMap.get("subject").stream().mapToInt(sentenceCount::get).sum();

hash map, remove duplicates but also store # of original occurances. java

Answers (2)

Related Questions