Alex Stone
Alex Stone

Reputation: 47344

iOS iPhone how to list all keywords in a UTextView by frequency of use?

I got a UITextView with an arbitrary length text (up to 10000 characters). I need to parse this text, extract all keywords and list them by the frequency of use with the most frequently used word being on top, next one down, etc. I will most likely present a modal UITableView after the operation is completed.

I'm thinking of an efficient and useful way to do this. I can try to separate a string using a delimiter in the form of [whitespace, punctuation marks, etc]. This gets me an array of character sequences. I can add each add sequence as an NSMutableDictionary key, and increment its count once I see another instance of that word. However, this may result in a list of 300-400 words, most having frequency of 1.

Is there a good way to implement the logic that I'm describing? Should I try to sort the array in alphabetical order and try some kind of "fuzzy" logic match? Are there any NSDataDetector or NSString methods that can do this kind of work for me?

An additional question is: how would I extract stuff like a, at, to, for, etc, and do not list them in my keyword list?

It would be great if I can take a look at a sample project that has already accomplished this task.

Thank you!

Upvotes: 1

Views: 556

Answers (3)

Alex Stone
Alex Stone

Reputation: 47344

I ended up going with the CFStringTokenizer . I'm not sure if the bridged casts below are correct, but it seems to work

-(void)listAllKeywordsInString:(NSString*)text
    {
        if(text!=nil)
        {
            NSMutableDictionary* keywordsDictionary = [[NSMutableDictionary alloc] initWithCapacity:1024];
            NSString* key = nil;
            NSLog(@"%@",text);

             NSLog(@"Started parsing: %@",[[NSDate date] description]);

            CFStringRef string =(__bridge CFStringRef)text; // Get string from somewhere

        CFStringTokenizerRef tokenizer = CFStringTokenizerCreate(kCFAllocatorDefault,  (__bridge_retained CFStringRef) text, CFRangeMake (0,CFStringGetLength((__bridge_retained CFStringRef)text)), kCFStringTokenizerUnitWord, CFLocaleCopyCurrent());

            unsigned tokensFound = 0; // or the desired number of tokens

            CFStringTokenizerTokenType tokenType = kCFStringTokenizerTokenNone;

            while(kCFStringTokenizerTokenNone != (tokenType = CFStringTokenizerAdvanceToNextToken(tokenizer)) ) {
                CFRange tokenRange = CFStringTokenizerGetCurrentTokenRange(tokenizer);
                CFStringRef tokenValue = CFStringCreateWithSubstring(kCFAllocatorDefault, string, tokenRange);

                // This is the found word
                key =(__bridge NSString*)tokenValue;

                //increment its count
                NSNumber* count = [keywordsDictionary objectForKey:key];
                if(count!=nil)
                {
                     [keywordsDictionary setValue:[NSNumber numberWithInt:1] forKey:key];
                }else {
                    [keywordsDictionary setValue:[NSNumber numberWithInt:count.intValue+1] forKey:key];
                }



                CFRelease(tokenValue);

                ++tokensFound;
            }
            NSLog(@"Ended parsing. tokens Found: %d, %@",tokensFound,[[NSDate date] description]);
            NSLog(@"%@",[keywordsDictionary description]);
            // Clean up
            CFRelease(tokenizer);

        }


    }

Upvotes: 0

sirab333
sirab333

Reputation: 3722

There are many approaches to do this.

You should definitely add all your Keywords to an array (or other collection object) and reference it/ iterate through it so you are searching for these keywords and only these keywords (and are avoiding checking for occurrences of a, at, to, for, etc.)

NSArray *keywords = [ add your keywords ];

NSString *textToSearchThrough = @" your text ";  // or load your text File here

- loop control statement here (like maybe fast enumerate), and inside this loop:
NSRange range = [textToCheckThrough rangeOfString:keywords[currentKeyword] 
                              options:NSCaseInsensitiveSearch];
if(range.location != NSNotFound) {
   // meaning, you did find it 
   // add it to a resultsArray, add 1 to this keyword's occurrenceCounter (which you must also declare and keep track of)
   // etc.
}

Then you loop through your results array, check number of occurrences per keyword, purge those who's occurrence count is < minOccurrenceCount, and sort remaining from highest to lowest.

Upvotes: 2

omz
omz

Reputation: 53561

You can use CFStringTokenizer to get the word boundaries. For counting, you could use an NSMutableDictionary, as you suggested, or an NSCountedSet, which might be slightly more efficient.

If you're not interested in words that have a frequency of 1 (or some other threshold), you would have to filter them out after counting all the words.

For ignoring certain words (a, the, for...), you need a word list specific to the language of your text. The Wikipedia article on stop words contains a couple of links, e.g. this CSV file.

Upvotes: 2

Related Questions