Reputation: 23
UPDATE: HedgeHog's answer worked. To overcome the numpy issue, I uninstalled numpy-1.19.4 and installed the previous version numpy-1.19.3.
[Python 3.9.0 and BeautifulSoup 4.9.0.]
I am trying to use the BeautifulSoup library in Python to parse the HTML table found on the Department of Justice's Office of Legal Counsel website, and write the data to a CSV file. The table can be found at https://www.justice.gov/olc/opinions?keys=&items_per_page=40.
The table is deeply nested within 11 <div>
elements. The abridged prettified version of the HTML up to the table's location is:
<html>
<body>
<section>
<11 continually nested div elements>
...
<table>
</table>
...
</divs>
</section>
</body>
</html>
The table is a simple three-column table, topped with a header row (which is inside a <thead>
element), as shown below:
Date | Title | Headnotes |
---|---|---|
01/19/2021 | Preemption of State and Local Requirements Under a PREP Act Declaration | The Public Readiness and Emergency Preparedness Act and the COVID -19 declaration issued by the Secretary of Health and Human Services under that Act preempt state or local requirements, such as state licensing laws, that would prohibit or effectively prohibit qualifying state-licensed pharmacists from ordering and administering FDA-approved COVID -19 tests and FDA-authorized or FDA-licensed COVID -19 vaccines. |
The <tr>
elements have one of four different classes:
<tr class="odd views-row-first">
- This only exists on the very first row after the header row.<tr class="even">
- appears on every even table row<tr class="odd">
- appears on every odd row after the first row<tr class="even views-row-last">
- appears on the very last row (the user can choose to see 10, 20, or 40 items per page, which means the last row will always be even)Within the <tr>
elements, naturally, each <td>
element corresponds to one of the data types (date, title, headnotes). Notwithstanding the specific <tr>
class, each table row follows the same general format:
<tr class="odd-or-even/first-or-last">
<td class="views-field views-field-field-opinion-post-date active">
<span class="date-display-single" . . . >
01/01/1970
</span>
</td>
<td class="views-field views-field-field-opinion-attachment-file">
<a href="/olc/files/file-number/download">
Title
</a>
</td>
<td class="views-field views-field-field-opinion-overview">
<p>
Headnotes
</p>
<p>
Some headnotes have multiple paragraph elements.
</p>
</td>
</tr>
All of the Python scripts I have used have started with this:
import requests
from bs4 import BeautifulSoup
r = requests.get("https://www.justice.gov/olc/opinions?keys=&items_per_page=40")
soup = BeautifulSoup(r.text, "html.parser")
f = open("olc-op.csv", "w", encoding="utf-8")
headers = "Date, Title, Headnotes \n"
f.write(headers)
My tinkering has primarily been focused on the find_all()
argument and the for loop
.
The problem I am having is that I am either getting only a single row in my CSV file or the error in the title to this post.
Since all of the <td>
elements I want to scrape are within the <tbody>
element, I ran tbody
through find_all()
:
requests = soup.find_all("tbody")
In the for loop
I specified <td>
as the element, followed by the class name applied to each data:
for result in results:
date = result.find("td", class_="views-field views-field-field-opinion-post-date active").text
title = result.find("td", class_="views-field views-field-field-opinion-attachment-file").text
headnotes = result.find("td", class_="views-field views-field-field-opinion-overview").text
data = date + "," + title + "," + headnotes
f.write(data)
The output of the above code in the CSV file is:
Date,Title,Headnotes
01/19/2021 ,
Preemption of State and Local Requirements Under a PREP Act Declaration ,
The Public Readiness and Emergency Preparedness Act and the COVID -19 declaration issued by the Secretary of Health and Human Services under that Act preempt state or local requirements, such as state licensing laws, that would prohibit or effectively prohibit qualifying state-licensed pharmacists from ordering and administering FDA-approved COVID -19 tests and FDA-authorized or FDA-licensed COVID -19 vaccines.
Yes, the data is technically separated by a comma, but not in the way I intended. There is also some unneeded whitespace after the header row.
I replaced the .text
at the end of the .find()
statements with .striped_strings
, which returned the
following TypeError:
TypeError: unsupported operand type(s) for +: 'NoneType' and 'str'
To try and overcome this error, I changed f.write(data)
to f.write(str(data))
in the for loop
, and received
the same TypeError
.
I did some further researach, and changed the end of each variable in the for loop
from .striped_strings
to
.get_text(strip=True)
. I also changed my f.write()
statement to
f.write(date + "," + title + "," + headnotes)
These changes yielded one perfectly scraped table row, in addition to the header row:
Date, Title, Headnotes
01/19/2021,Preemption of State and Local Requirements Under a PREP Act Declaration,The Public Readiness and Emergency Preparedness Act and the COVID -19 declaration issued by the Secretary of Health and Human Services under that Act preempt state or local requirements, such as state licensing laws, that would prohibit or effectively prohibit qualifying state-licensed pharmacists from ordering and administering FDA-approved COVID -19 tests and FDA-authorized or FDA-licensed COVID -19 vaccines.
But I obviously wanted to loop over the entire table and get all of the table rows.
The second to last thing I tried was to possibly get more specific in the find_all()
statement. I changed it from tbody
to
tr
with no class specified, so it would (I thought) return all of the <tr>
elements, which I could then parse
for the specific <td>
element. Instead, I got this error:
AttributeError: 'NoneType' object has no attribute 'get_text'
The final change I made was to change .get_text(strip=True)
back to .text
, which resulted in the error in the
title of this post:
AttributeError: 'NoneType' object has no attribute 'text'
Where have I gone wrong?
Upvotes: 2
Views: 145
Reputation: 25048
pandas
Always ask yourself - Is there an easier way to get my goals?
It is, you can simply use pandas to do it in two lines. In your case it do all the things for you.
I also try to go through your question and may answer to it.
Example
import pandas as pd
pd.read_html('https://www.justice.gov/olc/opinions?keys=&items_per_page=40')[0].to_csv('olc-op.csv', index=False)
Excited by the effort of asking your question, I will go some bonus miles and tell you what happens.
There are two major points that prevented you from reaching your goal .
Selecting the right things
Reason why there is only one line in your csv - You made this:
soup.find_all("tbody")
So your loop only loops one time, cause there is only one tbody
- You figured out the structure and talked about the <tr>
but do not selected them for looping.
Writing your lines
Even if you fixed the above you would have found only one line in the csv, cause the \n
was missing in your string.
Hope that helps to understand, what went wrong and you can use it in case pandas
wont work, cause of dynamic served content, ...
Example
import requests
from bs4 import BeautifulSoup
r = requests.get("https://www.justice.gov/olc/opinions?keys=&items_per_page=40")
soup = BeautifulSoup(r.text, "html.parser")
with open("olc-op.csv", "a+", encoding="utf-8") as f:
headers = "Date, Title, Headnotes \n"
f.write(headers)
for result in soup.select("tbody tr"):
tds = result.find_all("td")
date = tds[0].get_text(strip=True)
title = tds[1].get_text(strip=True)
headnotes = tds[2].get_text(strip=True)
data = date + "," + title + "," + headnotes +'\n'
f.writelines(data)
Upvotes: 1