Reputation: 65
So I'm trying to make a python script for a thesaurus. I'm a student and will be using it for writing essays, etc to save time when changing words. So far I've been able to open thesaurus.com with my intended search word but I can't seem to figure out how to copy the first 5 returned words and put them in a list then print out.
At this point, I've checked youtube and google. I've also tried searching on stackoverflow but it was of little help so I'm asking for help please.This is what my code looks like:
import webbrowser as wb
import antigravity
word = str(input()).lower()
returned_words_list = []
url = 'https://www.thesaurus.com/browse/{}'.format(word)
wb.open(url, new=2)
I just want it to pprint the returned_words_list to the console at this point. So far I can't even get it to automatically get the words from the website.
Upvotes: 1
Views: 10658
Reputation: 84465
Looking at the web-traffic the page does a request to a different url which returns the results. You can use that endpoint, with a couple of headers, to get all results in json format. Then, looking at this answer by @Martijn Pieters (+ to him), provided you use a generator, you can restrict iterations with islice
from itertools
. You could of course just slice the full lot from list comprehension as well. Results are returned in descending order of similarity
which is particularly useful here as you get the words with the highest similarity scores.
generator
import requests
from itertools import islice
headers = {'Referer':'https://www.thesaurus.com/browse/word','User-Agent' : 'Mozilla/5.0'}
word = str(input()).lower()
r = requests.get('https://tuna.thesaurus.com/relatedWords/{}?limit=6'.format(word), headers = headers).json()
if r['data']:
synonyms = list(islice((i['term'] for i in r['data'][0]['synonyms']), 5))
print(synonyms)
else:
print('No synonyms found')
list comprehension
import requests
headers = {'Referer':'https://www.thesaurus.com/browse/word','User-Agent' : 'Mozilla/5.0'}
word = str(input()).lower()
r = requests.get('https://tuna.thesaurus.com/relatedWords/{}?limit=6'.format(word), headers = headers).json()
if r['data']:
synonyms = [i['term'] for i in r['data'][0]['synonyms']][:5]
print(synonyms)
else:
print('No synonyms found')
Upvotes: 1
Reputation: 15208
To find the result in markup I would rely on attribute data-linkid:
import requests
from bs4 import BeautifulSoup
word = str(input()).lower()
url = 'https://www.thesaurus.com/browse/{}'.format(word)
response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")
result = soup.select('li > span > a[data-linkid]')[:5]
for link in result:
print(link.string)
import requests
from lxml import etree
word = str(input()).lower()
url = 'https://www.thesaurus.com/browse/{}'.format(word)
response = requests.get(url)
tree = etree.HTML(response.text)
result = tree.xpath('//li/span/a[@data-linkid]')[:5]
for link in result:
print(link.text)
ps In common the parsing HTML is not the best way in the long term, I would look at free REST services such as http://thesaurus.altervista.org/.
Upvotes: 0
Reputation: 387
As comments have mentioned, BeautifulSoup (bs4) is a great library for this. You can use bs4 to parse the entire page, then zone in on the elements you want. First the ul element that contains the words, and then the a elements that hold a word.
import requests
from bs4 import BeautifulSoup
word = "hello"
url = 'https://www.thesaurus.com/browse/{}'.format(word)
r = requests.get(url)
returned_words_list = []
soup = BeautifulSoup(r.text, 'html.parser')
word_ul = soup.find("ul", {"class":'css-1lc0dpe et6tpn80'})
for idx, elem in enumerate(word_ul.findAll("a")):
returned_words_list.append(elem.text.strip())
if idx >= 4:
break
print (returned_words_list)
Upvotes: 0