Reputation: 139
I want to get difficult words out of english text online like from gutenberg for vocabulary building using python or javascript . I don't wont to get simple words but unique vocabulary like regal , apocryphal ..etc.
How to ensure when I split text that I only get unique vocabulary not simple words.
Upvotes: 1
Views: 2033
Reputation: 1188
I defined a "non common word" as a word that does not appear in the first 10000 most common English words.
The 10 k most common words is an arbitrary boundary, but as is stated in the github repo:
According to analysis of the Oxford English Corpus, the 7,000 most common English lemmas account for approximately 90% of usage, so a 10,000 word training corpus is more than sufficient for practical training applications.
import requests
english_most_common_10k = 'https://raw.githubusercontent.com/first20hours/google-10000-english/master/google-10000-english-usa-no-swears.txt'
# Get the file of 10 k most common words from TXT file in a github repo
response = requests.get(english_most_common_10k)
data = response.text
set_of_common_words = {x for x in data.split('\n')}
# Once we have the set of common words, we can just check.
# The check is in average case O(1) operation,
# but you can use for example some sort of search three with O(log(n)) complexity
while True:
word = input()
if word in set_of_common_words:
print(f'The word "{word}" is common')
else:
print(f'The word "{word}" is difficult')
Upvotes: 3
Reputation: 101
You can also use pop() to remove from english dictionary the most difficult words list.
Upvotes: 0
Reputation: 2171
As @Hoog suggested, here is the pseudocode:
simple_words = [...]
difficult_words = [word for word in english_vocabulary if word not in simple_words]
Upvotes: 1