Reputation: 2257

Simple Text Analysis library for C

I'm in the midst of creating my school project for our programming class. I'm making a Medical Care system console app and I want to implement this kind of feature:

When a user enters what they are feeling. (Like they are feeling sick, having sore throat, etc) I want the C Text analysis library to help me analyze and parse the info given by the user (which have been saved into a string) and determine the medicine to be given. (I'll be the one to give which medicine is for which, I just want the library to help me analyze the info given by the user).

Thanks!

A good example would be this one: http://www.codeproject.com/Articles/32175/Lucene-Net-Text-Analysis

Unfortunately it's for C#

Update: Any C library that can help me even for the simple tokenizing and indexing of the words? I know I could do it by brute force coding... But a reliable and stable api would be better. Thanks!

Upvotes: 2

Answers (3)

wildplasser

Reputation: 44250

This is what wakkerbot makes of your question. (The scores are low, because wakkerbot/Hubert is all Dutch.) But the tokeniser seems to do fine on English:

[   6]:        |    29/ 27|  4.792 | weight |
------|--------+----------+---------+--------+
 0  11|  15645 |    10/ 9 | 0.15469 |  0.692 |'to'
 1   0|  19416 |    10/10 | 0.12504 |  0.646 |'i'
 2  10|  10483 |     4/ 3 | 0.10030 |   0.84 |'and'
 3   3|   3292 |     5/ 5 | 0.09403 |    1.4 |'be'
 4   7|  27363 |     3/ 3 | 0.06511 |    1.4 |'one'
 5  12|  36317 |     3/ 3 | 0.06511 |   8.52 |'this'
 6   2|  35466 |     2/ 2 | 0.05746 |   10.7 |'just'
 7   4|  12258 |     2/ 2 | 0.05301 |   0.56 |'info'
 8  18|  81898 |     2/ 2 | 0.04532 |   20.1 |'ll'
 9  20|  67009 |     3/ 3 | 0.04124 |   48.8 |'text'
10  13|  70575 |     2/ 2 | 0.03897 |    156 |'give'
11  19|  16806 |     2/ 2 | 0.03426 |   1.13 |'c'
12  14|   5992 |     2/ 2 | 0.03376 |  0.914 |'for'
13   1|   3940 |     1/ 1 | 0.02561 |   1.12 |'my'
14   5|   7804 |     1/ 1 | 0.02561 |   2.94 |'class'
15  17|   7920 |     1/ 1 | 0.02561 |   7.35 |'feeling'
16  15|  20429 |     3/ 2 | 0.01055 |   3.93 |'com'
17  16|  36544 |     2/ 1 | 0.00433 |   4.28 |'www'

To support my lex/nonlex tokeniser argument, this is the relevant part of wakkerbot's tokeniser:

for(pos=0; str[pos]; ) {
    switch(*sp) {
    case T_INIT: /* initial */
        if (myisalpha(str[pos])) {*sp = T_WORD; pos++; continue; }
        if (myisalnum(str[pos])) {*sp = T_NUM; pos++; continue; }
        /* if (strspn(str+pos, "-+")) { *sp = T_NUM; pos++; continue; }*/
        *sp = T_ANY; continue;
        break;
    case T_ANY: /* either whitespace or meuk: eat it */
        pos += strspn(str+pos, " \t\n\r\f\b" );
        if (pos) {*sp = T_INIT; return pos; }
        *sp = T_MEUK; continue;
        break;
    case T_WORD: /* inside word */
        while ( myisalnum(str[pos]) ) pos++;
        if (str[pos] == '\0' ) { *sp = T_INIT;return pos; }
        if (str[pos] == '.' ) { *sp = T_WORDDOT; pos++; continue; }
        *sp = T_INIT; return pos;
     ...

As you can see, most of the time will be spent in the line with while ( myisalnum(str[pos]) ) pos++;, which catches all the words. myisalnum() is a static function, which will probably be inlined. (There are similar tight loops for numbers and whitespace, of course)

UPDATE: for completeness, the definition for myisalpha():

static int myisalpha(int ch)
{
   /* with <ctype.h>, this is a table lookup, too */
int ret = isalpha(ch);
if (ret) return ret;
        /* don't parse, just assume valid utf8 */
if (ch == -1) return 0;
if (ch & 0x80) return 1;
return 0;
}

Upvotes: 1

wdavilaneto

Reputation: 818

Yes, There's a C++ Data science toolkit called MeTA - ModErn Text Analysis Toolkit. Here's follow the features:

text tokenization, including deep semantic features like parse trees
inverted and forward indexes with compression and various caching strategies
a collection of ranking functions for searching the indexes
topic models
classification algorithms
graph algorithms
language models
CRF implementation (POS-tagging, shallow parsing)
wrappers for liblinear and libsvm (including libsvm dataset parsers)
UTF8 support for analysis on various languages
multithreaded algorithms

It comes with tests and examples. In your case I think statistical classifiers, like Bayes, will perfectly do the job, but, you can also do manual classification. It was the best feat to my personal case. Hope it helps.

Here's the link https://meta-toolkit.org/

Best Regards,

Upvotes: 0

Steve

Reputation: 31642

Analyzing natural language text is one of the most difficult problems you could possibly pick.

Most likely your solution will come down to simply looking for keywords like "sick" "sore throat", etc - which can be accomplished with a simple dictionary of keywords and results.

As far as truly "understanding" what the user typed though - good luck with that.

EDIT:

A few technologies worth pointing out:

Regarding your question about a lexer - you can easily use flex if you feel you need something like that. Probably faster (in terms of execution speed AND development speed) than trying to code the multi-token search by hand.

On Mac there is a very cool framework called Latent Semantic Mapping. There is a WWDC 2011 video on it - and it's awesome. You basically feed it a ton of example inputs and train it on what result you want. It may be as close as you're going to get. It is C-based.

http://en.wikipedia.org/wiki/Latent_semantic_mapping

https://developer.apple.com/library/mac/#documentation/TextFonts/Reference/LatentSemanticMapping/index.html

Upvotes: 6

Simple Text Analysis library for C

Answers (3)

Related Questions