Chukwudumebi Orji
Chukwudumebi Orji

Reputation: 65

How to copy information from a web page in python

So I'm trying to make a python script for a thesaurus. I'm a student and will be using it for writing essays, etc to save time when changing words. So far I've been able to open thesaurus.com with my intended search word but I can't seem to figure out how to copy the first 5 returned words and put them in a list then print out.

At this point, I've checked youtube and google. I've also tried searching on stackoverflow but it was of little help so I'm asking for help please.This is what my code looks like:

import webbrowser as wb
import antigravity

word = str(input()).lower()
returned_words_list = []
url = 'https://www.thesaurus.com/browse/{}'.format(word)

wb.open(url, new=2)

I just want it to pprint the returned_words_list to the console at this point. So far I can't even get it to automatically get the words from the website.

Upvotes: 1

Views: 10658

Answers (3)

QHarr
QHarr

Reputation: 84465

Looking at the web-traffic the page does a request to a different url which returns the results. You can use that endpoint, with a couple of headers, to get all results in json format. Then, looking at this answer by @Martijn Pieters (+ to him), provided you use a generator, you can restrict iterations with islice from itertools. You could of course just slice the full lot from list comprehension as well. Results are returned in descending order of similarity which is particularly useful here as you get the words with the highest similarity scores.


generator

import requests
from itertools import islice

headers = {'Referer':'https://www.thesaurus.com/browse/word','User-Agent' : 'Mozilla/5.0'}
word = str(input()).lower()
r = requests.get('https://tuna.thesaurus.com/relatedWords/{}?limit=6'.format(word), headers = headers).json()

if r['data']:
    synonyms = list(islice((i['term'] for i in r['data'][0]['synonyms']), 5))
    print(synonyms)
else:
    print('No synonyms found')

list comprehension

import requests

headers = {'Referer':'https://www.thesaurus.com/browse/word','User-Agent' : 'Mozilla/5.0'}
word = str(input()).lower()
r = requests.get('https://tuna.thesaurus.com/relatedWords/{}?limit=6'.format(word), headers = headers).json()
if r['data']:
    synonyms = [i['term'] for i in r['data'][0]['synonyms']][:5]
    print(synonyms)
else:
    print('No synonyms found')

Upvotes: 1

vladimir
vladimir

Reputation: 15208

To find the result in markup I would rely on attribute data-linkid:

  1. first way based on using BeautifulSoup
import requests
from bs4 import BeautifulSoup

word = str(input()).lower()
url = 'https://www.thesaurus.com/browse/{}'.format(word)

response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")
result = soup.select('li > span > a[data-linkid]')[:5]

for link in result:
    print(link.string)
  1. second one based on lxml
import requests
from lxml import etree

word = str(input()).lower()
url = 'https://www.thesaurus.com/browse/{}'.format(word)

response = requests.get(url)
tree = etree.HTML(response.text)
result = tree.xpath('//li/span/a[@data-linkid]')[:5]

for link in result:
    print(link.text)

ps In common the parsing HTML is not the best way in the long term, I would look at free REST services such as http://thesaurus.altervista.org/.

Upvotes: 0

James
James

Reputation: 387

As comments have mentioned, BeautifulSoup (bs4) is a great library for this. You can use bs4 to parse the entire page, then zone in on the elements you want. First the ul element that contains the words, and then the a elements that hold a word.

import requests
from bs4 import BeautifulSoup

word = "hello"
url = 'https://www.thesaurus.com/browse/{}'.format(word)
r = requests.get(url)
returned_words_list = []

soup = BeautifulSoup(r.text, 'html.parser')
word_ul = soup.find("ul", {"class":'css-1lc0dpe et6tpn80'})
for idx, elem in enumerate(word_ul.findAll("a")):
    returned_words_list.append(elem.text.strip())
    if idx >= 4:
        break

print (returned_words_list)

Upvotes: 0

Related Questions