Reputation: 1
I am designing a word filter that can filter out bad words (200 words in list) in an article (about 2000 words). And there I have a problem that what data structure I need to save this bad word list, so that the program can use a little time to find the bad word in articles?
-- more details
If the size of bad word list is 2000, the article is 50000, and the program will procedure about 1000 articles one time. Which data structure I should choose, a less then O(n^2) solution in searching?
Upvotes: 0
Views: 774
Reputation: 10665
You need a Bag
data structure for this problem. In a Bag
data structure elements have no order but is designed for fast lookup of an element in the Bag
. It time complexity is O(1)
. So for N words in an article overall complexity turns out to be O(N)
. Which is the best you can achieve in this case. Java Set
is an example of Bag
implementation in Java.
Upvotes: 0
Reputation: 3947
A dictionary usually is a mapping from one thing (word in 1st language) to another thing (word in 2nd language). You don't seem to need this mapping here, but just a set of words.
Most languages provide a set data structure out of the box that has insert
and membership testing methods.
A small example in Python, comparing a list
and a set
:
import random
import string
import time
def create_word(min_len, max_len):
return "".join([random.choice(string.ascii_lowercase) for _ in
range(random.randint(min_len, max_len+1))])
def create_article(length):
return [create_word(3, 10) for _ in range(length)]
wordlist = create_article(50000)
article = " ".join(wordlist)
good_words = []
bad_words_list = [random.choice(wordlist) for _ in range(2000)]
print("using list")
print(time.time())
for word in article.split(" "):
if word in bad_words_list:
continue
good_words.append(word)
print(time.time())
good_words = []
bad_words_set = set(bad_words_list)
print("using set")
print(time.time())
for word in article.split(" "):
if word in bad_words_set:
continue
good_words.append(word)
print(time.time())
This creates an "article" of 50000 randomly created "words" with a length between 3 and 10 letters, then picks 2000 of those words as "bad words".
First, they are put in a list
and the "article" is scanned word by word if a word is in
this list of bad words. In Python, the in
operator tests for membership. For an unordered list, there's no better way than scanning the whole list.
The second approach uses the set
datatype that is initialized with the list of bad words. A set
has no ordering, but way faster lookup (again using the in
keyword) if an element is contained. That seems to be all you need to know.
On my machine, the timings are:
using list
1421499228.707602
1421499232.764034
using set
1421499232.7644095
1421499232.785762
So it takes about 4 seconds with a list and 2 hundreths of a second with a set.
Upvotes: 1
Reputation: 33
You can use HashTable because its average complexity is O(1) for insert and search and your data just 2000 words. http://en.wikipedia.org/wiki/Hash_table
Upvotes: 1
Reputation: 7739
I think the best structure, you can use there is set
. - http://en.wikipedia.org/wiki/Set_%28abstract_data_type%29
I takes log_2(n)
time to add element to structure (once-time operation) and the same answer every query. So if you will have 200 elements in data structure, your program will need to do only about 8 operations to check, does the word is existing in set
.
Upvotes: 0