Thibaut B.
Thibaut B.

Reputation: 321

Scraping a list of words split into multiple webpages with Python

I'd like to make a program that can calculate which word or words would get the player the most points at scrabble. However, for this, I need a list of words that are accepted. I have found the sowpods list for English, but I'd like to make the program work also in French, which is my native language.

I have found this website that provides such a list, but it's split into 918 webpages, which would make it rather long to copy-paste everything...

I had tried to use a Python library (I don't remember which one) for web scraping to get the words but as I didn't really know how to use it, it seemed very hard. I got the whole text of the website and I could then go character-by-character to select only the list of words, but as the number of characters on each page would be different, this couldn't be automated very easily.

I've thought about using regex to select only words in capital letters (as they are on the website) but if there are other words or characters in capital letters such as titles on the website, then my list of words won't be correct.

How could I get all the words without having to change my code for each page?

Upvotes: 1

Views: 956

Answers (2)

mhSangar
mhSangar

Reputation: 467

It looks like the website has a consistent structure across the pages. The words are stored in a span tag with a class named mot. The library you mentioned could be BeautifulSoup, which makes the automation really easy. You just need to request each page, select the tag mot and extract the innerHtml. Splitting the contents by the space character (" ") will give you an array with all the words you need from that page in particular.

Let me know if this helps you and if you need more help with the coding part.

Edit: I have included some code below.

import requests
from bs4 import BeautifulSoup

def getWordsFromPage(pageNumber):
    if pageNumber == 1:
        # page 1
        url = 'https://www.listesdemots.net/touslesmots.htm'
    else:
        # from page 2 to 918
        url = 'https://www.listesdemots.net/touslesmotspage{pageNumber}.htm'
        url = url.format(pageNumber=pageNumber)

    response = requests.get(url)
    html_soup = BeautifulSoup(response.text, 'html.parser')
    span = html_soup.find("span", "mot")

    words = span.text.split(" ")

    return words

print("Page 1")
print(getWordsFromPage(1))
print("Page 24")
print(getWordsFromPage(24))
print("Page 918")
print(getWordsFromPage(918))

Upvotes: 2

Maurice Meyer
Maurice Meyer

Reputation: 18106

All pages are equal and the words are in a span tag. Using requests and beautifulsoup makes it quite easy:

import time
import requests
from bs4 import BeautifulSoup

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.120 YaBrowser/19.10.3.281 Yowser/2.5 Safari/537.36'
}

pages = ['https://www.listesdemots.net/touslesmots.htm']  # adds the first page
for p in range(2, 919):  # adding the remaining pages
    pages.append('https://www.listesdemots.net/touslesmotspage%s.htm' % p)

for p in pages:
    data = None
    try:
        #  download each page
        r = requests.get(p, headers=headers)
        data = r.text
    except:
        #  we dont handle errors
        print("Failed to download page: %s" % p)

    if data:
        #  instanciate the html parser and search for span.mot
        soup = BeautifulSoup(data, 'html.parser')        
        wordTag = soup.find("span", {"class": "mot"})

        words = 0
        if wordTag:
            #  if there is a span.mot tag found, split its contents by one blanc
            words = wordTag.contents[0].split(' ')

        print ('%s got %s words' % (p, len(words)))
    time.sleep(5)

Output:

https://www.listesdemots.net/touslesmots.htm got 468 words
https://www.listesdemots.net/touslesmotspage2.htm got 496 words
https://www.listesdemots.net/touslesmotspage3.htm got 484 words
https://www.listesdemots.net/touslesmotspage4.htm got 468 words
....

Upvotes: 2

Related Questions