Noam Kramer
Noam Kramer

Reputation: 93

requests.exceptions.InvalidURL: Failed to parse: <Response [200]> in python

So i wrote this code for now, to get news from a specific topic from cnn right now im getting an error here is the code:

from bs4 import BeautifulSoup
import requests
import csv
import re

serch_term = input('What News are you looking for today? ')

url = f'https://edition.cnn.com/search?q={serch_term}'
page = requests.get(url)
doc = BeautifulSoup(page, "html.parser")

page_text = doc.find_all(class_="cnn-search__result-headline")
print(page_text)

But i am getting this error, i already tried a bunch of things but none of them worked for me

What News are you looking for today? coronavirus
Traceback (most recent call last):
  File "c:\Users\user\Desktop\Informatik\Praktik\Projekte\Python\news_automation\main.py", line 10, in <module>
    doc = BeautifulSoup(page, "html.parser")
  File "C:\Users\user\AppData\Local\Programs\Python\Python310\lib\site-packages\bs4\__init__.py", line 312, in __init__
    elif len(markup) <= 256 and (
TypeError: object of type 'Response' has no len()

i googled already and tried a bunch of things but none of them worked Does someone know what is wrong? So,

Upvotes: 2

Views: 4976

Answers (2)

James Burgess
James Burgess

Reputation: 497

As for why your original code is not working, the page variable is a response object but Beautiful soup is expecting HTML.

This is fixed with doc = BeautifulSoup(page.text, "html.parser")

However, It looks like cnn is using a javascript library to render their site which would make scraping the data slightly harder as you would need to use a tool like selenium or pyppeteer to run a headless browser in order to render the content.

Following on @Lenatian's answer, the easiest way would be to call the CNN api and not have to scrape the data at all, I have updated the code to check for errors and check if there are no results from the API. If successful it will print out a list of headlines attached to the search query.

You can also print the value of results to see all the other data that is sent back from the api.

import requests

# serch_term = input('What News are you looking for today? ')
search_term = "Corona Virus"
url = f"https://search.api.cnn.io/content?q={serch_term}"

response = requests.get(url)

if not response.ok:
  raise Exception("There was an error calling the API")

response = response.json()
results = response.get("result")

if not results:
  print("Could not find any results for your search term :(")

print([result.get("headline") for result in results])

Upvotes: 0

Lentian Latifi
Lentian Latifi

Reputation: 132

I tested it myself and you should change this line of code as follows:

from: source = requests.get(url) to: page = source.text

Extra informations:

I found that u can use this search.api.cnn.io as follows and make directly into json as i wrote the code and what you need to do is extract information which you need.

url = f"search.api.cnn.io/content?q={serch_term}"

extra_parameters_sample_url"https://search.api.cnn.io/content?q=coronavirus&sort=newest&category=business,us,politics,world,opinion,health&size=100&from=0"

source = requests.get(url).text 
json_reponse = json.loads(source)

Upvotes: 2

Related Questions