Reputation: 483

Given a huge set of street names, what is the most efficient way to test whether a text contains one of the street names from the set?

I have an interesting problem that I need help with. I am currently working on a feature of my program and stumbled into this issues

I have a huge list of street names in Indonesia ( > 100k rows ) stored in database, Each street name may have more than 1 word. For example : "Sudirman", "Gatot Subroto", or "Jalan Asia Afrika" are all legit street names
have a bunch of texts ( > 1 Million rows ) in databases, that I split into sentences. Now, the features ( function to be exact ) that I need to do , is to test whether there are street names inside the sentences or no, so just a true / false test

I have tried to solve it by doing these steps:

a. Putting the street names into a Key,Value Hash

b. Split each sentences into words

c. Test whether words are in the hash

This is fast, but will not work with multiple words

Another alternatives that I thought of is to do these steps:

a. Split each sentences into words

b. Query the database with LIKE statement ( i,e. SELECT #### FROM street_table WHERE name like '%word%' )

c. If query returned a row, it means that the sentence contains street names

Now, this solution is going to be a very IO intensive.

So my question is "What is the most efficient way to do this test" ? regardless of the programming language. I do this in python mainly, but any language will do as long as I can grasp the concepts

============EDIT 1 =================

Will this be periodical ?

Yes, I will call this feature / function with an interval of 1 minute. Each call will take 100 row of texts at least and test them against the street name database

Upvotes: 1

Answers (4)

Dylan M.

Reputation: 116

The Aho-Corasick algorithm could be pretty useful. One of it's advantages is that it's run time is independent of how many words you are searching for (only how long the text is you are searching through). It will be especially useful if your list of street names is not changing frequently.

http://en.wikipedia.org/wiki/Aho%E2%80%93Corasick_string_matching_algorithm

Upvotes: 0

Aleksander Blomskøld

Reputation: 18552

A simple solution would be to create a dictionary/multimap with first-word-of-street-name=>full-street-name(s). When you iterate each word in your sentence you'll look up potential street names, and check if you have a match (by looking at the next words).

This algorithm should be fairly easy to implement and should perform pretty good too.

Upvotes: 2

dan-boa

Reputation: 620

Using nlp, you can determine the proper noun in a sentence. Please refer to the link below.

http://nlp.stanford.edu/software/lex-parser.shtml

The standford parser is accurate in its calculation. Once you have the proper noun, you can decide the approach to follow.

Upvotes: 1

Mare Infinitus

Reputation: 8182

So you have a document and want to seach if it contains any of your list of streetnames?

Turbo Boyer-Moore is a good starting point for doing that.

Here is more information on turbo boyer moore

But, i strongly believe, you will have to do something about the organisation of your list of street names. there should be some bucket access to it, i.e. you can easily filter for street names:

Here an example: Street name: Asia-Pacific-street

You can access your list by: A (getting a starting point for all that start with an A) AS (getting a starting point for all that start with an AS)

and so on...

I believe you should have lots of buckets for that, at least 26 (first letter) * 26 (second letter)

more information about bucketing

Upvotes: 0

Given a huge set of street names, what is the most efficient way to test whether a text contains one of the street names from the set?

Answers (4)

Related Questions