Reputation: 321
I'd like to make a program that can calculate which word or words would get the player the most points at scrabble. However, for this, I need a list of words that are accepted. I have found the sowpods list for English, but I'd like to make the program work also in French, which is my native language.
I have found this website that provides such a list, but it's split into 918 webpages, which would make it rather long to copy-paste everything...
I had tried to use a Python library (I don't remember which one) for web scraping to get the words but as I didn't really know how to use it, it seemed very hard. I got the whole text of the website and I could then go character-by-character to select only the list of words, but as the number of characters on each page would be different, this couldn't be automated very easily.
I've thought about using regex to select only words in capital letters (as they are on the website) but if there are other words or characters in capital letters such as titles on the website, then my list of words won't be correct.
How could I get all the words without having to change my code for each page?
Upvotes: 1
Views: 956
Reputation: 467
It looks like the website has a consistent structure across the pages. The words are stored in a span
tag with a class named mot
. The library you mentioned could be BeautifulSoup, which makes the automation really easy. You just need to request each page, select the tag mot
and extract the innerHtml. Splitting the contents by the space character (" "
) will give you an array with all the words you need from that page in particular.
Let me know if this helps you and if you need more help with the coding part.
Edit: I have included some code below.
import requests
from bs4 import BeautifulSoup
def getWordsFromPage(pageNumber):
if pageNumber == 1:
# page 1
url = 'https://www.listesdemots.net/touslesmots.htm'
else:
# from page 2 to 918
url = 'https://www.listesdemots.net/touslesmotspage{pageNumber}.htm'
url = url.format(pageNumber=pageNumber)
response = requests.get(url)
html_soup = BeautifulSoup(response.text, 'html.parser')
span = html_soup.find("span", "mot")
words = span.text.split(" ")
return words
print("Page 1")
print(getWordsFromPage(1))
print("Page 24")
print(getWordsFromPage(24))
print("Page 918")
print(getWordsFromPage(918))
Upvotes: 2
Reputation: 18106
All pages are equal and the words are in a span tag. Using requests and beautifulsoup makes it quite easy:
import time
import requests
from bs4 import BeautifulSoup
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.120 YaBrowser/19.10.3.281 Yowser/2.5 Safari/537.36'
}
pages = ['https://www.listesdemots.net/touslesmots.htm'] # adds the first page
for p in range(2, 919): # adding the remaining pages
pages.append('https://www.listesdemots.net/touslesmotspage%s.htm' % p)
for p in pages:
data = None
try:
# download each page
r = requests.get(p, headers=headers)
data = r.text
except:
# we dont handle errors
print("Failed to download page: %s" % p)
if data:
# instanciate the html parser and search for span.mot
soup = BeautifulSoup(data, 'html.parser')
wordTag = soup.find("span", {"class": "mot"})
words = 0
if wordTag:
# if there is a span.mot tag found, split its contents by one blanc
words = wordTag.contents[0].split(' ')
print ('%s got %s words' % (p, len(words)))
time.sleep(5)
Output:
https://www.listesdemots.net/touslesmots.htm got 468 words
https://www.listesdemots.net/touslesmotspage2.htm got 496 words
https://www.listesdemots.net/touslesmotspage3.htm got 484 words
https://www.listesdemots.net/touslesmotspage4.htm got 468 words
....
Upvotes: 2