Reputation: 483
I have an interesting problem that I need help with. I am currently working on a feature of my program and stumbled into this issues
I have a huge list of street names in Indonesia ( > 100k rows ) stored in database, Each street name may have more than 1 word. For example : "Sudirman", "Gatot Subroto", or "Jalan Asia Afrika" are all legit street names
have a bunch of texts ( > 1 Million rows ) in databases, that I split into sentences. Now, the features ( function to be exact ) that I need to do , is to test whether there are street names inside the sentences or no, so just a true / false test
I have tried to solve it by doing these steps:
a. Putting the street names into a Key,Value Hash
b. Split each sentences into words
c. Test whether words are in the hash
This is fast, but will not work with multiple words
Another alternatives that I thought of is to do these steps:
a. Split each sentences into words
b. Query the database with LIKE statement ( i,e. SELECT #### FROM street_table WHERE name like '%word%' )
c. If query returned a row, it means that the sentence contains street names
Now, this solution is going to be a very IO intensive.
So my question is "What is the most efficient way to do this test" ? regardless of the programming language. I do this in python mainly, but any language will do as long as I can grasp the concepts
============EDIT 1 =================
Will this be periodical ?
Yes, I will call this feature / function with an interval of 1 minute. Each call will take 100 row of texts at least and test them against the street name database
Upvotes: 1
Views: 426
Reputation: 116
The Aho-Corasick algorithm could be pretty useful. One of it's advantages is that it's run time is independent of how many words you are searching for (only how long the text is you are searching through). It will be especially useful if your list of street names is not changing frequently.
http://en.wikipedia.org/wiki/Aho%E2%80%93Corasick_string_matching_algorithm
Upvotes: 0
Reputation: 18552
A simple solution would be to create a dictionary/multimap with first-word-of-street-name=>full-street-name(s). When you iterate each word in your sentence you'll look up potential street names, and check if you have a match (by looking at the next words).
This algorithm should be fairly easy to implement and should perform pretty good too.
Upvotes: 2
Reputation: 620
Using nlp, you can determine the proper noun in a sentence. Please refer to the link below.
http://nlp.stanford.edu/software/lex-parser.shtml
The standford parser is accurate in its calculation. Once you have the proper noun, you can decide the approach to follow.
Upvotes: 1
Reputation: 8182
So you have a document and want to seach if it contains any of your list of streetnames?
Turbo Boyer-Moore is a good starting point for doing that.
Here is more information on turbo boyer moore
But, i strongly believe, you will have to do something about the organisation of your list of street names. there should be some bucket access to it, i.e. you can easily filter for street names:
Here an example: Street name: Asia-Pacific-street
You can access your list by: A (getting a starting point for all that start with an A) AS (getting a starting point for all that start with an AS)
and so on...
I believe you should have lots of buckets for that, at least 26 (first letter) * 26 (second letter)
more information about bucketing
Upvotes: 0