I need to write a text parser for the education domain which can extract out the information like institute, location, course etc from the free text. Currently i am doing it through lucene, steps are as follows: Index all the data related to institute, courses and location. Making shingles of the free text and searching each shingle in location, course and institute index dir and then trying to find out which part of text represents location, course etc. In this approach I am missing lot of cases like B.tech can be written as btech, b-tech or b.tech. I want to know is there any thing available which can do all these kind of things, I have heard about Ling-pipe and Gate but don't know how efficient they are.

Reputation: 125

Need to extract information from free text, information like location, course etc

I need to write a text parser for the education domain which can extract out the information like institute, location, course etc from the free text.

Currently i am doing it through lucene, steps are as follows:

Index all the data related to institute, courses and location.
Making shingles of the free text and searching each shingle in location, course and institute index dir and then trying to find out which part of text represents location, course etc.

In this approach I am missing lot of cases like B.tech can be written as btech, b-tech or b.tech.

I want to know is there any thing available which can do all these kind of things, I have heard about Ling-pipe and Gate but don't know how efficient they are.

Upvotes: 1

Answers (6)

Renaud

Reputation: 16521

B.tech can be written as btech, b-tech or b.tech

Lucene will let you do fuzzy searches based on the Levenshtein Distance. A query for roam~ (note the ~) will find terms like foam and roams.

That might allow you to match the different cases.

Upvotes: 0

yura

Reputation: 14655

You can try http://code.google.com/p/graph-expression/ example of Adress parsing rules

  GraphRegExp.Matcher Token = match("Token");
            GraphRegExp.Matcher Country = GraphUtils.regexp("^USA$", Token);
            GraphRegExp.Matcher Number = GraphUtils.regexp("^\\d+$", Token);
            GraphRegExp.Matcher StateLike = GraphUtils.regexp("^([A-Z]{2})$", Token);
            GraphRegExp.Matcher Postoffice = seq(match("BoxPrefix"), Number);
            GraphRegExp.Matcher Postcode =
                            mark("Postcode", seq(GraphUtils.regexp("^\\d{5}$", Token), opt(GraphUtils.regexp("^\\d{4}$", Token))))
                    ;
            //mark(String, Matcher) -- means creating chunk over sub matcher
            GraphRegExp.Matcher streetAddress = mark("StreetAddress", seq(Number, times(Token, 2, 5).reluctant()));
            //without new lines
            streetAddress = regexpNot("\n", streetAddress);
            GraphRegExp.Matcher City = mark("City", GraphUtils.regexp("^[A-Z]\\w+$", Token));

            Chunker chunker = Chunkers.pipeline(
                    Chunkers.regexp("Token", "\\w+"),
                    Chunkers.regexp("BoxPrefix", "\\b(POB|PO BOX)\\b"),
                    new GraphExpChunker("Address",
                            seq(
                                    opt(streetAddress),
                                    opt(Postoffice),
                                    City,
                                    StateLike,
                                    Postcode,
                                    Country
                            )
                    ).setDebugString(true)
            );

Upvotes: 0

ffriend

Reputation: 28552

You definitely need GATE. GATE has 2 main most frequently used features (among thousands others): rules and dictionaries. Dictionaries (gazetteers in GATE's terms) allow you to put all possible cases like "B.tech", "btech" and so on in a single text file and let GATE find and mark them all. Rules (more precisely, JAPE-rules) allow you to define patterns in text. For example, here's pattern to catch MIT's postal address ("77 Massachusetts Ave., Building XX, Cambridge MA 02139"):

{Token.kind == number}(SP){Token.orth == uppercase}(SP){Lookup.majorType == avenue}(COMMA)(SP)
{Token.string == "Building"}(SP){Token.kind == number}(COMMA)(SP)
{Lookup.majorType == city}(SP){Lookup.majorType == USState}(SP){Token.kind == number}

where (SP) and (COMMA) - macros (just to make text shorter), {Somthing} - is annotation, , {Token.kind == number} - annotation "Token" with feature "kind" equal to "number" (i.e. just number in the text), {Lookup} - annotation that captures values from dictionary (BTW, GATE already has dictionaries for such things as US cities). This is quite simple example, but you should see how easily you can cover even very complicated cases.

Upvotes: 1

jpountz

Reputation: 9964

You may want to check UIMA. As Lingpipe and Gate, this framework features text annotation, which is what you are trying to do. Here is a tutorial which will help you write an annotator for UIMA:

http://uima.apache.org/d/uimaj-2.3.1/tutorials_and_users_guides.html#ugr.tug.aae.developing_annotator_code

UIMA has addons, in particular one for Lucene integration.

Upvotes: 0

mossaab

Reputation: 1834

You may need to write a regular expression to cover each possible form of your vocabulary.

Be careful about your choice of analyzer / tokenizer, because words like B.tech can be easily split into 2 different words (i.e. B and tech).

Upvotes: 0

hrzafer

Reputation: 1141

I didn't use Lucene but in your case I would leave different forms of the same keyword as they are and just hold a link table or such. In this table I'd keep the relation of these different forms.

Upvotes: 0

Need to extract information from free text, information like location, course etc

Answers (6)

Related Questions