yanbin
yanbin

Reputation: 1

What is a good data structure to save a dictionary?

I am designing a word filter that can filter out bad words (200 words in list) in an article (about 2000 words). And there I have a problem that what data structure I need to save this bad word list, so that the program can use a little time to find the bad word in articles?

-- more details

If the size of bad word list is 2000, the article is 50000, and the program will procedure about 1000 articles one time. Which data structure I should choose, a less then O(n^2) solution in searching?

Upvotes: 0

Views: 774

Answers (4)

Meena Chaudhary
Meena Chaudhary

Reputation: 10665

You need a Bag data structure for this problem. In a Bag data structure elements have no order but is designed for fast lookup of an element in the Bag. It time complexity is O(1). So for N words in an article overall complexity turns out to be O(N). Which is the best you can achieve in this case. Java Set is an example of Bag implementation in Java.

Upvotes: 0

Jasper
Jasper

Reputation: 3947

A dictionary usually is a mapping from one thing (word in 1st language) to another thing (word in 2nd language). You don't seem to need this mapping here, but just a set of words.

Most languages provide a set data structure out of the box that has insert and membership testing methods.

A small example in Python, comparing a list and a set:

import random
import string
import time

def create_word(min_len, max_len):
    return "".join([random.choice(string.ascii_lowercase) for _ in
                    range(random.randint(min_len, max_len+1))])

def create_article(length):
    return [create_word(3, 10) for _ in range(length)]

wordlist = create_article(50000)
article = " ".join(wordlist)
good_words = []
bad_words_list = [random.choice(wordlist) for _ in range(2000)]

print("using list")
print(time.time())
for word in article.split(" "):
    if word in bad_words_list:
        continue
    good_words.append(word)

print(time.time())

good_words = []
bad_words_set = set(bad_words_list)

print("using set")
print(time.time())
for word in article.split(" "):
    if word in bad_words_set:
        continue
    good_words.append(word)

print(time.time())

This creates an "article" of 50000 randomly created "words" with a length between 3 and 10 letters, then picks 2000 of those words as "bad words".

First, they are put in a list and the "article" is scanned word by word if a word is in this list of bad words. In Python, the in operator tests for membership. For an unordered list, there's no better way than scanning the whole list.

The second approach uses the set datatype that is initialized with the list of bad words. A set has no ordering, but way faster lookup (again using the in keyword) if an element is contained. That seems to be all you need to know.

On my machine, the timings are:

using list
1421499228.707602
1421499232.764034
using set
1421499232.7644095
1421499232.785762

So it takes about 4 seconds with a list and 2 hundreths of a second with a set.

Upvotes: 1

Kim
Kim

Reputation: 33

You can use HashTable because its average complexity is O(1) for insert and search and your data just 2000 words. http://en.wikipedia.org/wiki/Hash_table

Upvotes: 1

TN888
TN888

Reputation: 7739

I think the best structure, you can use there is set. - http://en.wikipedia.org/wiki/Set_%28abstract_data_type%29

I takes log_2(n) time to add element to structure (once-time operation) and the same answer every query. So if you will have 200 elements in data structure, your program will need to do only about 8 operations to check, does the word is existing in set.

Upvotes: 0

Related Questions