Reputation: 3936
I need to extract from venue *country*, city from google search result. For example I search for "EEE Symposium on Computational intelligence for Image Processing". I'm using googles custom search api.
I get a snippet like this,
"snippet": "The Computer Security Foundations Symposium is an annual conference for
researchers in ... It was created in 1988 as a workshop of the IEEE Computer
Society Technical Committee on Security and ... CSF-26 was held at Tulane
University, New Orleans, LA, June 26-28, 2013. ... CSFW-19 program and 5-
minute talks.",
How do I extract 'Tulane University, New Orleans' from the response....mind you there are multiple results but lets assume I take only the 1st one which contains this..
Upvotes: 0
Views: 93
Reputation: 1538
This is difficult, given that you're handling natural language. There are a few possibilities. It really depends on the input.
You could try finding these using templates/regex. If you know that venues are introduced by "held at" or "organized at" etc, you can use that information to extract the venues/locations.
You could use a POS/NE Tagger to tag the words. Using the Standford CoreNLP Pipeline yields (shortened, using only relevant sentence and information):
CSF-26 NN O was VBD O held VBN O at IN O Tulane NNP ORGANIZATION University NNP ORGANIZATION New NNP LOCATION Orleans NNP LOCATION LA NNP LOCATION June NNP DATE 26-28 CD DATE 2013 CD DATE
The word is followed by the POS tag, followed by the NE entity tag. O stands for "Other", the rest should be self explanatory. You could then look for LOCATION and surrounding LOCATION or ORGANIZATION.
You could use a database of geographical names to find COUNTRY/CITY, then look at the x surrounding words. If you can also provide a list of commonly used "venue" names, you can include that to further ameliorate the results. This step can be integrated into any other method as well.
This list is not exhaustive. It depends greatly on the variance of the input.
Upvotes: 1