Kaustubh Kendurkar
Kaustubh Kendurkar

Reputation: 139

Getting out difficult english words from text for vocabulary building using python or javascript

I want to get difficult words out of english text online like from gutenberg for vocabulary building using python or javascript . I don't wont to get simple words but unique vocabulary like regal , apocryphal ..etc.

How to ensure when I split text that I only get unique vocabulary not simple words.

Upvotes: 1

Views: 2033

Answers (3)

Ronald Luc
Ronald Luc

Reputation: 1188

I defined a "non common word" as a word that does not appear in the first 10000 most common English words.

The 10 k most common words is an arbitrary boundary, but as is stated in the github repo:

According to analysis of the Oxford English Corpus, the 7,000 most common English lemmas account for approximately 90% of usage, so a 10,000 word training corpus is more than sufficient for practical training applications.

import requests

english_most_common_10k = 'https://raw.githubusercontent.com/first20hours/google-10000-english/master/google-10000-english-usa-no-swears.txt'

# Get the file of 10 k most common words from TXT file in a github repo
response = requests.get(english_most_common_10k)
data = response.text

set_of_common_words = {x for x in data.split('\n')}

# Once we have the set of common words, we can just check.
# The check is in average case O(1) operation,
# but you can use for example some sort of search three with O(log(n)) complexity
while True:
    word = input()
    if word in set_of_common_words:
        print(f'The word "{word}" is common')
    else:
        print(f'The word "{word}" is difficult')

Upvotes: 3

Rafael Rotiroti
Rafael Rotiroti

Reputation: 101

You can also use pop() to remove from english dictionary the most difficult words list.

Upvotes: 0

Novak
Novak

Reputation: 2171

As @Hoog suggested, here is the pseudocode:

simple_words = [...]
difficult_words = [word for word in english_vocabulary if word not in simple_words]

Upvotes: 1

Related Questions