Python, BeautifulSoup: Only one CSV row returned or keep getting "AttributeError: 'NoneType' object has no attribute 'text'" when parsing HTML table

Question

UPDATE: HedgeHog's answer worked. To overcome the numpy issue, I uninstalled numpy-1.19.4 and installed the previous version numpy-1.19.3.

[Python 3.9.0 and BeautifulSoup 4.9.0.]

I am trying to use the BeautifulSoup library in Python to parse the HTML table found on the Department of Justice's Office of Legal Counsel website, and write the data to a CSV file. The table can be found at https://www.justice.gov/olc/opinions?keys=&items_per_page=40.

The table is deeply nested within 11

elements. The abridged prettified version of the HTML up to the table's location is:


 
  
   <11 continually nested div elements>
    ...
    
    
    ...

The table is a simple three-column table, topped with a header row (which is inside a element), as shown below:

Date	Title	Headnotes
01/19/2021	Preemption of State and Local Requirements Under a PREP Act Declaration	The Public Readiness and Emergency Preparedness Act and the COVID -19 declaration issued by the Secretary of Health and Human Services under that Act preempt state or local requirements, such as state licensing laws, that would prohibit or effectively prohibit qualifying state-licensed pharmacists from ordering and administering FDA-approved COVID -19 tests and FDA-authorized or FDA-licensed COVID -19 vaccines.

The elements have one of four different classes:

- This only exists on the very first row after the header row.
- appears on every even table row
- appears on every odd row after the first row
- appears on the very last row (the user can choose to see 10, 20, or 40 items per page, which means the last row will always be even)

Within the elements, naturally, each element corresponds to one of the data types (date, title, headnotes). Notwithstanding the specific class, each table row follows the same general format:


  
    
      01/01/1970
    
  
  
    
      Title
    
  
  
    
      Headnotes
    
    
      Some headnotes have multiple paragraph elements.

All of the Python scripts I have used have started with this:

import requests
from bs4 import BeautifulSoup

r = requests.get("https://www.justice.gov/olc/opinions?keys=&items_per_page=40")
soup = BeautifulSoup(r.text, "html.parser")

f = open("olc-op.csv", "w", encoding="utf-8")
headers = "Date, Title, Headnotes 
"
f.write(headers)

My tinkering has primarily been focused on the find_all() argument and the for loop.

The problem I am having is that I am either getting only a single row in my CSV file or the error in the title to this post.

Since all of the elements I want to scrape are within the element, I ran tbody through find_all():

requests = soup.find_all("tbody")

In the for loop I specified as the element, followed by the class name applied to each data:

for result in results:
    date = result.find("td", class_="views-field views-field-field-opinion-post-date active").text
    title = result.find("td", class_="views-field views-field-field-opinion-attachment-file").text
    headnotes = result.find("td", class_="views-field views-field-field-opinion-overview").text
    data = date + "," + title + "," + headnotes
    f.write(data)

The output of the above code in the CSV file is:

Date,Title,Headnotes

01/19/2021 ,
Preemption of State and Local Requirements Under a PREP Act Declaration ,
The Public Readiness and Emergency Preparedness Act and the COVID -19 declaration issued by the Secretary of Health and Human Services under that Act preempt state or local requirements, such as state licensing laws, that would prohibit or effectively prohibit qualifying state-licensed pharmacists from ordering and administering FDA-approved COVID -19 tests and FDA-authorized or FDA-licensed COVID -19 vaccines.

Yes, the data is technically separated by a comma, but not in the way I intended. There is also some unneeded whitespace after the header row.

I replaced the .text at the end of the .find() statements with .striped_strings, which returned the following TypeError:

TypeError: unsupported operand type(s) for +: 'NoneType' and 'str'

To try and overcome this error, I changed f.write(data) to f.write(str(data)) in the for loop, and received the same TypeError.

I did some further researach, and changed the end of each variable in the for loop from .striped_strings to .get_text(strip=True). I also changed my f.write() statement to

f.write(date + "," + title + "," + headnotes)

These changes yielded one perfectly scraped table row, in addition to the header row:

Date, Title, Headnotes 
01/19/2021,Preemption of State and Local Requirements Under a PREP Act Declaration,The Public Readiness and Emergency Preparedness Act and the COVID -19 declaration issued by the Secretary of Health and Human Services under that Act preempt state or local requirements, such as state licensing laws, that would prohibit or effectively prohibit qualifying state-licensed pharmacists from ordering and administering FDA-approved COVID -19 tests and FDA-authorized or FDA-licensed COVID -19 vaccines.

But I obviously wanted to loop over the entire table and get all of the table rows.

The second to last thing I tried was to possibly get more specific in the find_all() statement. I changed it from tbody to tr with no class specified, so it would (I thought) return all of the elements, which I could then parse for the specific element. Instead, I got this error:

AttributeError: 'NoneType' object has no attribute 'get_text'

The final change I made was to change .get_text(strip=True) back to .text, which resulted in the error in the title of this post:

AttributeError: 'NoneType' object has no attribute 'text'

Where have I gone wrong?

HedgeHog · Accepted Answer

Alternativ is use of `pandas`

Always ask yourself - Is there an easier way to get my goals?

It is, you can simply use pandas to do it in two lines. In your case it do all the things for you.

Requesting the url
Searching for the table and scraping the contents
Push the results to an csv

I also try to go through your question and may answer to it.

Example

import pandas as pd

pd.read_html('https://www.justice.gov/olc/opinions?keys=&items_per_page=40')[0].to_csv('olc-op.csv', index=False)

But answering to your question

Excited by the effort of asking your question, I will go some bonus miles and tell you what happens.

There are two major points that prevented you from reaching your goal .

Selecting the right things

Reason why there is only one line in your csv - You made this:
```
soup.find_all("tbody")
```
So your loop only loops one time, cause there is only one tbody - You figured out the structure and talked about the but do not selected them for looping.
Writing your lines

Even if you fixed the above you would have found only one line in the csv, cause the was missing in your string.

Hope that helps to understand, what went wrong and you can use it in case pandas wont work, cause of dynamic served content, ...

Example

import requests
from bs4 import BeautifulSoup

r = requests.get("https://www.justice.gov/olc/opinions?keys=&items_per_page=40")
soup = BeautifulSoup(r.text, "html.parser")

with open("olc-op.csv", "a+", encoding="utf-8") as f:
    headers = "Date, Title, Headnotes 
"
    f.write(headers)

    for result in soup.select("tbody tr"):
        tds = result.find_all("td")
        date = tds[0].get_text(strip=True)
        title = tds[1].get_text(strip=True)
        headnotes = tds[2].get_text(strip=True)
        data = date + "," + title + "," + headnotes +'
'
        f.writelines(data)

Python, BeautifulSoup: Only one CSV row returned or keep getting "AttributeError: 'NoneType' object has no attribute 'text'" when parsing HTML table

Answers (1)

Alternativ is use of `pandas`

But answering to your question

Related Questions

Python, BeautifulSoup: Only one CSV row returned or keep getting &quot;AttributeError: &#39;NoneType&#39; object has no attribute &#39;text&#39;&quot; when parsing HTML table

Answers (1)

Alternativ is use of pandas

But answering to your question

Related Questions

Python, BeautifulSoup: Only one CSV row returned or keep getting "AttributeError: 'NoneType' object has no attribute 'text'" when parsing HTML table

Alternativ is use of `pandas`