Reputation: 93
So i wrote this code for now, to get news from a specific topic from cnn right now im getting an error here is the code:
from bs4 import BeautifulSoup
import requests
import csv
import re
serch_term = input('What News are you looking for today? ')
url = f'https://edition.cnn.com/search?q={serch_term}'
page = requests.get(url)
doc = BeautifulSoup(page, "html.parser")
page_text = doc.find_all(class_="cnn-search__result-headline")
print(page_text)
But i am getting this error, i already tried a bunch of things but none of them worked for me
What News are you looking for today? coronavirus
Traceback (most recent call last):
File "c:\Users\user\Desktop\Informatik\Praktik\Projekte\Python\news_automation\main.py", line 10, in <module>
doc = BeautifulSoup(page, "html.parser")
File "C:\Users\user\AppData\Local\Programs\Python\Python310\lib\site-packages\bs4\__init__.py", line 312, in __init__
elif len(markup) <= 256 and (
TypeError: object of type 'Response' has no len()
i googled already and tried a bunch of things but none of them worked Does someone know what is wrong? So,
Upvotes: 2
Views: 4976
Reputation: 497
As for why your original code is not working, the page
variable is a response object but Beautiful soup is expecting HTML.
This is fixed with doc = BeautifulSoup(page.text, "html.parser")
However, It looks like cnn is using a javascript library to render their site which would make scraping the data slightly harder as you would need to use a tool like selenium or pyppeteer to run a headless browser in order to render the content.
Following on @Lenatian's answer, the easiest way would be to call the CNN api and not have to scrape the data at all, I have updated the code to check for errors and check if there are no results from the API. If successful it will print out a list of headlines attached to the search query.
You can also print the value of results
to see all the other data that is sent back from the api.
import requests
# serch_term = input('What News are you looking for today? ')
search_term = "Corona Virus"
url = f"https://search.api.cnn.io/content?q={serch_term}"
response = requests.get(url)
if not response.ok:
raise Exception("There was an error calling the API")
response = response.json()
results = response.get("result")
if not results:
print("Could not find any results for your search term :(")
print([result.get("headline") for result in results])
Upvotes: 0
Reputation: 132
I tested it myself and you should change this line of code as follows:
from: source = requests.get(url) to: page = source.text
Extra informations:
I found that u can use this search.api.cnn.io as follows and make directly into json as i wrote the code and what you need to do is extract information which you need.
url = f"search.api.cnn.io/content?q={serch_term}"
extra_parameters_sample_url"https://search.api.cnn.io/content?q=coronavirus&sort=newest&category=business,us,politics,world,opinion,health&size=100&from=0"
source = requests.get(url).text
json_reponse = json.loads(source)
Upvotes: 2