Scraping a list of words split into multiple webpages with Python

Question

I'd like to make a program that can calculate which word or words would get the player the most points at scrabble. However, for this, I need a list of words that are accepted. I have found the sowpods list for English, but I'd like to make the program work also in French, which is my native language.

I have found this website that provides such a list, but it's split into 918 webpages, which would make it rather long to copy-paste everything...

I had tried to use a Python library (I don't remember which one) for web scraping to get the words but as I didn't really know how to use it, it seemed very hard. I got the whole text of the website and I could then go character-by-character to select only the list of words, but as the number of characters on each page would be different, this couldn't be automated very easily.

I've thought about using regex to select only words in capital letters (as they are on the website) but if there are other words or characters in capital letters such as titles on the website, then my list of words won't be correct.

How could I get all the words without having to change my code for each page?

mhSangar · Accepted Answer

It looks like the website has a consistent structure across the pages. The words are stored in a span tag with a class named mot. The library you mentioned could be BeautifulSoup, which makes the automation really easy. You just need to request each page, select the tag mot and extract the innerHtml. Splitting the contents by the space character (" ") will give you an array with all the words you need from that page in particular.

Let me know if this helps you and if you need more help with the coding part.

Edit: I have included some code below.

import requests
from bs4 import BeautifulSoup

def getWordsFromPage(pageNumber):
    if pageNumber == 1:
        # page 1
        url = 'https://www.listesdemots.net/touslesmots.htm'
    else:
        # from page 2 to 918
        url = 'https://www.listesdemots.net/touslesmotspage{pageNumber}.htm'
        url = url.format(pageNumber=pageNumber)

    response = requests.get(url)
    html_soup = BeautifulSoup(response.text, 'html.parser')
    span = html_soup.find("span", "mot")

    words = span.text.split(" ")

    return words

print("Page 1")
print(getWordsFromPage(1))
print("Page 24")
print(getWordsFromPage(24))
print("Page 918")
print(getWordsFromPage(918))

Scraping a list of words split into multiple webpages with Python

Answers (2)

Related Questions