Reputation: 15945

Determining geo location by arbitrary body of text

I am working on a project that I am not exactly sure how to approach. The problem can be summarized as following:

Given an arbitrary body of text(kind of like a report), determine what geographic location that each part of the report is referring to.

Geographic locations range from states to counties(all within US), so their number is limited, but each report generally contains references to multiple locations. For example, first 5 paragraphs of report might be about a state as a whole, and then then next 5 would be about individual counties within that state, or something like that.

I am curious what would be the best way of approaching a problem like that, perhaps with a specific recommendation in terms of NLP or ML frameworks(Python or Java)?

Upvotes: 7

Answers (4)

Ash

Reputation: 3550

In order to do the task you need a labelled training set. Then you train a classification model over that training set and predict the location of new pieces of text based on the model. You can see how all of them work together in this sample code written on top of SCIKIT-LEARN: http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html

labelled training set:

You can train a classifier over a training set where each sample in training is (a paragraph, region_id). the region_id can be the id of a country, region or a city.

Training a classification model:

You build a bag of words (e.g. unigrams) model of each sample and train a classifier (e.g. Logistic Regression with L1 regularization) over the labelled training set. You can use any tool but I recommend using SCIKIT-LEARN in Python which is very simple and efficient to use.

Prediction:

After training, given a paragraph or a piece of text, the trained model is able to find a region_id for it which is based on the words used in the sample.

Remember to tune the regularization parameter over a development set to get good result (to prevent over-fitting the training sample).

Read my paper and this one on geolocation using text: http://www.aclweb.org/anthology/N15-1153

and the corresponding poster: http://www.slideshare.net/AfshinRahimi2/geolocation-twittertextnetwork-48968497

Also I have written a tool called Pigeo that does exactly that and comes with a pretrained model. Besides these works there are lots of other research papers on text-based geolocation that you can find.

Upvotes: 0

iliasfl

Reputation: 559

Identifying mentions of geographic locations is rather trivial using OpenNLP or GATE etc. The main problem comes afterwards, when you have to disambiguate places with the same name. For example, in the US there are 29 places named "Bristol". Which one is the correct?

There are several approaches you can use to disambiguate. A simple one is to gather the list of all location mentioned in the text, get their potential longitude/latitudes and then find the set that has the minimum sum of distances.

A better solution that I have seen people deploying is get from Wikipedia all articles that refer to places, put them in a DB for text like Lucene, and then use your text as query to find the most promising location between candidates by measuring some similarity score. The idea, is that in the article except the word "Bristol" also a river name, a person, or something of similar will be mentioned.

Upvotes: 2

Mark Giaconia

Reputation: 3953

Old question but it may be useful for others to know that Apache OpenNLP has an addon called the GeoEntityLinker and takes document text and sentences, extracts entities (toponymns), performs lookup on the USGS and GeoNames gazateers (Lucene indexes), resolves (or attempts to at least) the topopnymns in several ways, and returns you the scored gazateer entries in relation to each sentence in the document passed in. It will be released with OpenNLP 1.6 if all goes well.... not much documentation if any at this point.

This is the ticket in OpenNLP Jira: https://issues.apache.org/jira/i#browse/OPENNLP-579.

this is the source code:

http://svn.apache.org/viewvc/opennlp/addons/geoentitylinker-addon/

FYI: I am the main committer working on it.

Upvotes: 2

GrantD71

Reputation: 1875

I may actually be able to help a little here (my research is in the area of Toponym Resolution).

If I understand you correctly, you are looking for a way to (1) find the place names in the text, (2) disambiguate the place name's geographic reference, and (3) spatially ground whole sentences or paragraphs.

There are a lot of open source packages that can do #1. Stanford Core NLP, OpenNLP

There are a few packages that can do #1 and #2. CLAVIN is probably the only ready to use open source application that can do this at the moment. Yahoo Placemaker costs money but can do it.

There isn't really a package that can do #3. There is a newer project called TEXTGROUNDER doing something called "Document Geolocation", but while the code is available it is not set up be run on your own input texts. I only recommend you look at it if you are itching to either start or contribute to a project trying to do something like this.

All three tasks are still part of ongoing research and can get incredibly complicated depending on the details of the source text. You didn't provide much detail about your texts, but hopefully this information can help you.

Upvotes: 7

Determining geo location by arbitrary body of text

Answers (4)

Related Questions