Reputation: 103
I have a CSV that I read using pandas and looks like:
| URL | Status Code |
--------------- | ------------------|-------------|
0 | www.example.com | 404 |
----------------|-------------------|-------------|
1 | www.example.com/2 | 404 |
I want to check if the URLs on the second column are still responding with 404. I have this code:
url = df['URL']
urlData = requests.get(url).content
rawData = pd.read_csv(io.StringIO(urlData.decode('utf-8')))
print(rawData)
I get the following error:
InvalidSchema: No connection adapters were found for '0 http://www.example.com
Name: URL, dtype: object'
I searched several questions but could not find the answer. Any help is appreciated.
Upvotes: 1
Views: 1777
Reputation: 353
If you are in a jupyter notebook you can easily use pandas-aiohttp
(disclaimer; I just published this package);
import pandas as pd
import pandas_aiohttp
example_urls = pd.Series([
"https://jsonplaceholder.typicode.com/posts/1",
"https://jsonplaceholder.typicode.com/posts/2",
])
data = await example_urls.aiohttp.get_text()
0 {\n "userId": 1,\n "id": 1,\n "title": "sun...
1 {\n "userId": 1,\n "id": 2,\n "title": "qui...
dtype: object
Note: You can add
assert pandas_aiohttp
on the line afterimport pandas_aiohttp
to prevent your IDE from highlighting the apparently "unused import". This package works by registering a custom accessor (i.e. monkey patching, which I feel is only ok because pandas documents it as a feature)
If you are not in a jupyter notebook then there is some extra work to start your own async event loop:
import pandas as pd
import pandas_aiohttp
import asyncio
example_urls = pd.Series([
"https://jsonplaceholder.typicode.com/posts/1",
"https://jsonplaceholder.typicode.com/posts/2",
])
async def main():
data = await example_urls.aiohttp.get_text()
print(data)
asyncio.run(main())
By default this will use 100 parallel connections, and should be waaaay faster than most other methods.
Upvotes: 0
Reputation: 18697
The requests.get
is not broadcastable, so you'll either have to call it for each URL with pandas.DataFrame.apply
:
>>> df['New Status Code'] = df.URL.apply(lambda url: requests.get(url).status_code)
>>> df
Status Code URL New Status Code
0 404 www.example.com 404
1 404 www.example.com/2 404
or use numpy.vectorize
:
>>> vectorized_get = numpy.vectorize(lambda url: requests.get(url).status_code)
>>> df['New Status Code'] = vectorized_get(df.URL)
Upvotes: 4
Reputation: 775
df['URL'] is going to return you a Series of data, not a single value. I suspect your code is blowing up on the requests.get(url).content line.
Can you post more of the code?
You may want to look at the apply function: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.apply.html.
Upvotes: 0