Vimal Jose
Vimal Jose

Reputation: 83

Natural language processing to recognise numerical data

My requirement is to recognize and extract numerical data from a natural language sentence (English only) in response to queries. Platform is Java. For example if the user query is "What is the height of mount Everest" and we have a paragraph as:

In 1856, the Great Trigonometric Survey of British India established the first published height of Everest, then known as Peak XV, at 29,002 ft (8,840 m). In 1865, Everest was given its official English name by the Royal Geographical Society upon recommendation of Andrew Waugh, the British Surveyor General of India at the time, who named it after his predecessor in the post, and former chief, Sir George Everest.[4] Chomolungma had been in common use by Tibetans for centuries, but Waugh was unable to propose an established local name because Nepal and Tibet were closed to foreigners. (Pasted from wikipedia)

For a user query "Height of mount Everest" from the paragraph I need to get 29002 ft or 8840 m as the answer. Can anyone please suggest any possible ways of doing it in Java? Are there any open source libraries for the same?

Upvotes: 3

Views: 492

Answers (1)

AdamH
AdamH

Reputation: 2201

Obviously, doing this well is extremely difficult to do. If it's an assignment though then I'm guessing the expectation is a bit lower. Here are some thoughts to hopefully get you started:

I'd split the problem into 2 parts; parsing the question block and then passing the answer block. From the question block, you need to know 2 pieces of information, the noun of what you're searching for, and also the type of the answer. In this case the noun is Everest and the type is height. "Types" of data you can build a dictionary for fairly quickly to search your input string for (e.g. "height", "weight", "distance", "age"). The nouns are more difficult, so I'd say to just assume that every non-type in the question is a potential noun, perhaps removing a dictionary of known non-nouns (such as "at", "the", "of" etc.).

Once you've identified the noun and type from the question, you can begin scanning your answer block. I'd begin by breaking that up into sentences. Then scan each sentence for each of your nouns. If one is found in that sentence, you need to scan the sentence again for numbers (taking into account possible whitespace or comma delimiting). Finally, you need to look "around" any numbers you find for a measurement type. So in this case, your "type" that we parsed from the question was "height". You would need to create a mapping of types to measurements, so "height" would map "km, ft, in, cm, m" etc. If the number has one of these types around it, then return the number and measurement type as the answer.

Hope that gets you started. As stated above, this is not intended to be a robust, commercial solution. It's homework-level.

Upvotes: 3

Related Questions