Reputation: 421
One of the features used by many sentiment analysis programs is calculated by assigning to relevant unigrams, bigrams or pairs a specific score according to a lexicon. More in detail:
//unigrams
good 1
bad -1
great 2
//bigrams
good idea 1
bad idea -1
//pairs (--- stands for whatever):
hold---up -0.62
how---i still -0.62
Given a sample text T, for each each unigram, bigram or pair in T i want to check if a correspondence is present in the lexicon.
The unigram\bigram part is easy: i load the lexicon in a Map and then iterate my text, checking each word if present in the dictionary. My problems is with detecting pairs.
One way to check if specific pairs are present in my text would be to iterate the whole lexicon of pairs and use a regex on the text. Checking for each word in the lexicon if "start_of_pair.*end_of_pair" is present in the text. This seems very wasteful, because i'd have to iterate the WHOLE lexicon for each text to analyze. Any ideas on how to do this in a smarter way?
Related questions: Most Efficient Way to Check File for List of Words and Java: Most efficient way to check if a String is in a wordlist
Upvotes: 0
Views: 601
Reputation: 421
At the end i ended up solving it this way:
I loaded the pair lexicon as Map<String, Map<String, Float>>
- where the first key is the first half of the pairs, the inner map holds all the possible ending for that key's start and the corresponding sentiment value.
Basically i have a List of possible endings (enabledTokens) which i increase each time i read a new token - and then i search this list to see if the current token is the ending of some previous pair.
With a few modifications to prevent the previous token from being used right away for an ending, this is my code:
private Map<String, Map<String, Float>> firstPartMap;
private List<LexiconPair> enabledTokensForUnigrams, enabledTokensForBigrams;
private Queue<List<LexiconPair>> pairsForBigrams; //is initialized with two empty lists
private Token oldToken;
public void parseToken(Token token) {
String unigram = token.getText();
String bigram = null;
if (oldToken != null) {
bigram = oldToken.getText() + " " + token.getText();
}
checkIfPairMatchesAndUpdateFeatures(unigram, enabledTokensForUnigrams);
checkIfPairMatchesAndUpdateFeatures(bigram, enabledTokensForBigrams);
List<LexiconPair> pairEndings = toPairs(firstPartMap.get(unigram));
if(bigram!=null)pairEndings.addAll(toPairs(firstPartMap.get(bigram)));
pairsForBigrams.add(pairEndings);
enabledTokensForUnigrams.addAll(pairEndings);
enabledTokensForBigrams.addAll(pairsForBigrams.poll());
oldToken = token;
}
private void checkIfPairMatchesAndUpdateFeatures(String text, List<LexiconPair> listToCheck) {
Iterator<LexiconPair> iter = listToCheck.iterator();
while (iter.hasNext()) {
LexiconPair next = iter.next();
if (next.getText().equals(text)) {
float val = next.getValue();
POLARITY polarity = getPolarity(val);
for (LexiconFeatureSubset lfs : lexiconsFeatures) {
lfs.handleNewValue(Math.abs(val), polarity);
}
//iter.remove();
//return; //remove only 1 occurrence
}
}
}
Upvotes: 0
Reputation: 109567
One could realize a frequency map of bigrams as:
Map<String, Map<String, Integer> bigramFrequencyMap = new TreeMap<>();
Fill the map with the desired bigrams with initial frequency 0. First lexeme, second lexeme, to frequency count.
static final int MAX_DISTANCE = 5;
Then a lexical scan would keep the last #MAX_DISTANCE lexemes.
List<Map<String, Integer>> lastLexemesSecondFrequencies = new ArrayList<>();
void processLexeme() {
String lexeme = readLexeme();
// Check whether there is a bigram:
for (Map<String, Integer> prior : lastLexemesSecondFrequencies) {
Integer freq = prior.get(lexeme);
if (freq != null) {
prior.put(lexeme, 1 + freq);
}
}
Map<String, Integer> lexemeSecondFrequencies =
bigramFrequencyMap.get(lexeme);
if (lexemeSecondFrequencies != null) {
// Could remove lexemeSecondFrequencies if present in lastLexemes.
lastLexems.add(0, lexemeSecondFrequencies); // addFirst
if (lastLexemes.size() > MAX_DISTANCE) {
lastLexemes.remove(lastLexemes.size() - 1); // removeLast
}
}
}
The optimization is to keep the bigrams second half, and only handle registered bigrams.
Upvotes: 0