Reputation: 103

Using python requests for several urls in a dataframe

I have a CSV that I read using pandas and looks like:

                |    URL            | Status Code | 
--------------- | ------------------|-------------|
       0        | www.example.com   |    404      |
----------------|-------------------|-------------|
        1       | www.example.com/2 |   404       |

I want to check if the URLs on the second column are still responding with 404. I have this code:

url = df['URL']
urlData = requests.get(url).content
rawData = pd.read_csv(io.StringIO(urlData.decode('utf-8')))
print(rawData)

I get the following error:

InvalidSchema: No connection adapters were found for '0 http://www.example.com

1 http://www.example.com/2

Name: URL, dtype: object'

I searched several questions but could not find the answer. Any help is appreciated.

Upvotes: 1

Answers (3)

thehappycheese

Reputation: 353

If you are in a jupyter notebook you can easily use pandas-aiohttp (disclaimer; I just published this package);

import pandas as pd
import pandas_aiohttp

example_urls = pd.Series([
    "https://jsonplaceholder.typicode.com/posts/1",
    "https://jsonplaceholder.typicode.com/posts/2",
])

data = await example_urls.aiohttp.get_text()

0    {\n  "userId": 1,\n  "id": 1,\n  "title": "sun...
1    {\n  "userId": 1,\n  "id": 2,\n  "title": "qui...
dtype: object

Note: You can add assert pandas_aiohttp on the line after import pandas_aiohttp to prevent your IDE from highlighting the apparently "unused import". This package works by registering a custom accessor (i.e. monkey patching, which I feel is only ok because pandas documents it as a feature)

If you are not in a jupyter notebook then there is some extra work to start your own async event loop:

import pandas as pd
import pandas_aiohttp
import asyncio

example_urls = pd.Series([
    "https://jsonplaceholder.typicode.com/posts/1",
    "https://jsonplaceholder.typicode.com/posts/2",
])

async def main():
    data = await example_urls.aiohttp.get_text()
    print(data)

asyncio.run(main())

By default this will use 100 parallel connections, and should be waaaay faster than most other methods.

Upvotes: 0

randomir

Reputation: 18697

The requests.get is not broadcastable, so you'll either have to call it for each URL with pandas.DataFrame.apply:

>>> df['New Status Code'] = df.URL.apply(lambda url: requests.get(url).status_code)
>>> df
   Status Code                URL  New Status Code
0          404    www.example.com              404
1          404  www.example.com/2              404

or use numpy.vectorize:

>>> vectorized_get = numpy.vectorize(lambda url: requests.get(url).status_code)
>>> df['New Status Code'] = vectorized_get(df.URL)

Upvotes: 4

David

Reputation: 775

df['URL'] is going to return you a Series of data, not a single value. I suspect your code is blowing up on the requests.get(url).content line.

Can you post more of the code?

You may want to look at the apply function: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.apply.html.

Upvotes: 0

Using python requests for several urls in a dataframe

Answers (3)

Related Questions