Vedank Pande
Vedank Pande

Reputation: 446

How can I scrape text from the table in this page?

I'm trying to scrape words and their meaning from the word list on this this page using bs4 and selenium, although I'm not sure how I can loop through the <tr> and <td> tags after I get the table html from the bs4 find_all method:

from selenium import webdriver
from bs4 import BeautifulSoup

root = "https://www.graduateshotline.com/gre-word-list.html"

driver.get(root)
content = driver.page_source
soup = BeautifulSoup(content,'html.parser')
table = soup.find_all('table',attrs={'class': 'tablex border1'})[0]

Now in the table variable I have the html for the whole table, here's a snippet from the start and end:

<table class="tablex border1"> <tbody><tr><td><a href="https://gre.graduateshotline.com/a.pl?word=introspection" target="_blank">introspection</a></td>
<td>examining one's own thoughts and feelings</td></tr>
<tr><td><a href="https://gre.graduateshotline.com/a.pl?word=philanthropist" target="_blank">philanthropist</a></td>
.
.
.
<tr><td><a href="https://gre.graduateshotline.com/a.pl?word=refine" target="_blank">refine</a></td>
<td>make or become pure cultural </td></tr>
</tbody></table>

I'm not sure how I can access the words and their meanings using it. Any ideas?

Upvotes: 1

Views: 91

Answers (3)

MendelG
MendelG

Reputation: 20128

You want to access the first and second td under the <table> to get the correct tags.

You can use the nth-of-type(n) CSS selector to specifically access the first or second <td>. For example: td:nth-of-type(1)

To use a CSS selector, use the select_one() method instead of .find().

Note:

  • There is no need to use Selenium this can be done with the requests module, which will greatly improve the performance of your code.
  • Since are only looking for one <table>, use the find() method which returns the first found tag, instead of find_all().

Here is a complete working example:

import requests
from bs4 import BeautifulSoup


root = "https://www.graduateshotline.com/gre-word-list.html"
soup = BeautifulSoup(requests.get(root).content, "html.parser")
table = soup.find("table", attrs={"class": "tablex border1"})

fmt_string = "{:<50} {:<30}"
for tag in table:
    try:
        print(
            fmt_string.format(
                tag.select_one("td:nth-of-type(1)").text,
                tag.select_one("td:nth-of-type(2)").text,
            )
        )
    except AttributeError:
        continue

Output:

introspection                                      examining one's own thoughts and feelings
philanthropist                                     one who loves mankind         
antidote                                           medicine used against a poison or a disease
strive                                             to make great efforts, to struggle
ambidextrous                                       able to use the left hand or the right equally well
precursors                                         a person or thing that precedes, as in a  process or job.
retrospective                                      Looking back on past          
introvert                                          one who turns towards himself 

Using {:<50} {:<30} will align the characters with a length of 50 and 30 to the left, in order to prettify the output.

Upvotes: 2

TankorSmash
TankorSmash

Reputation: 12777

You want to iterate through all the tablerows, and pull out the text from a pair of td elements.

for row in table.find_all("tr"):
  tds = row.find_all("td")
  print(f"{tds[0].text}: {tds[1].text}")

...
repel: refuse to accept/cause dislike
superimpose: put something on the top
centurion: leader of a unit of 100 soldiers

For what its worth, you can use python-requests to get a webpage's content without spinning up a browser:

import requests

content = requests.get(root).text

Upvotes: 1

Md. Fazlul Hoque
Md. Fazlul Hoque

Reputation: 16187

Now your table data is generating and you can collect your desired data this way. Thanks

import pandas as pd
import requests
link = 'https://www.graduateshotline.com/gre-word-list.html'
r = requests.get(link, headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'})
table_data = pd.read_html(r.text)
print(table_data)

Upvotes: 2

Related Questions