Reputation: 446
I'm trying to scrape words and their meaning from the word list on this this page using bs4
and selenium
, although I'm not sure how I can loop through the <tr>
and <td>
tags after I get the table html from the bs4
find_all
method:
from selenium import webdriver
from bs4 import BeautifulSoup
root = "https://www.graduateshotline.com/gre-word-list.html"
driver.get(root)
content = driver.page_source
soup = BeautifulSoup(content,'html.parser')
table = soup.find_all('table',attrs={'class': 'tablex border1'})[0]
Now in the table variable I have the html for the whole table, here's a snippet from the start and end:
<table class="tablex border1"> <tbody><tr><td><a href="https://gre.graduateshotline.com/a.pl?word=introspection" target="_blank">introspection</a></td>
<td>examining one's own thoughts and feelings</td></tr>
<tr><td><a href="https://gre.graduateshotline.com/a.pl?word=philanthropist" target="_blank">philanthropist</a></td>
.
.
.
<tr><td><a href="https://gre.graduateshotline.com/a.pl?word=refine" target="_blank">refine</a></td>
<td>make or become pure cultural </td></tr>
</tbody></table>
I'm not sure how I can access the words and their meanings using it. Any ideas?
Upvotes: 1
Views: 91
Reputation: 20128
You want to access the first and second td
under the <table>
to get the correct tags.
You can use the nth-of-type(n)
CSS selector to specifically access the first or second <td>
. For example: td:nth-of-type(1)
To use a CSS selector, use the select_one()
method instead of .find()
.
Note:
Selenium
this can be done with the requests
module, which will greatly improve the performance of your code.<table>
, use the find()
method which returns the first found tag, instead of find_all()
.Here is a complete working example:
import requests
from bs4 import BeautifulSoup
root = "https://www.graduateshotline.com/gre-word-list.html"
soup = BeautifulSoup(requests.get(root).content, "html.parser")
table = soup.find("table", attrs={"class": "tablex border1"})
fmt_string = "{:<50} {:<30}"
for tag in table:
try:
print(
fmt_string.format(
tag.select_one("td:nth-of-type(1)").text,
tag.select_one("td:nth-of-type(2)").text,
)
)
except AttributeError:
continue
Output:
introspection examining one's own thoughts and feelings
philanthropist one who loves mankind
antidote medicine used against a poison or a disease
strive to make great efforts, to struggle
ambidextrous able to use the left hand or the right equally well
precursors a person or thing that precedes, as in a process or job.
retrospective Looking back on past
introvert one who turns towards himself
Using {:<50} {:<30}
will align the characters with a length of 50 and 30 to the left, in order to prettify the output.
Upvotes: 2
Reputation: 12777
You want to iterate through all the tablerows, and pull out the text from a pair of td elements.
for row in table.find_all("tr"):
tds = row.find_all("td")
print(f"{tds[0].text}: {tds[1].text}")
...
repel: refuse to accept/cause dislike
superimpose: put something on the top
centurion: leader of a unit of 100 soldiers
For what its worth, you can use python-requests
to get a webpage's content without spinning up a browser:
import requests
content = requests.get(root).text
Upvotes: 1
Reputation: 16187
Now your table data is generating and you can collect your desired data this way. Thanks
import pandas as pd
import requests
link = 'https://www.graduateshotline.com/gre-word-list.html'
r = requests.get(link, headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'})
table_data = pd.read_html(r.text)
print(table_data)
Upvotes: 2