Jaffer Wilson
Jaffer Wilson

Reputation: 7273

How to findout HTTP status faster?

I have a file of size 10 GB. This file mostly contains URLs. I am trying to get the HTTP status code of each URL and store them into another File with .CSV extension.
I have searched for a code and found a solution to access the status code of a URL using Python:

import requests
request = requests.get('http://www.example.com')
print(request.status_code)

But it takes on one URL. I have a file of a larger size. I do not know how I can input URLs from a file to this command. Even how to store the output in .CSV format
Even it is not faster. I am looking for a faster solution that will give me faster result for 10 GB file.
I tried the Ubuntu command also:

xargs -n1 -P 10 curl -o /dev/null --silent --head --write-out '%{url_effective},%{http_code}\n' < Input_File.txt > output.CSV

But it is also not multi threaded. It is taking single line at a time and then storing to CSV.
So, my question is how I can make this work faster for a file size of 10 GB. If there is any solution for this in any programming language, I will be happy to implement.
Here is the sample file of URL - a small chunk from my 10 GB file:
https://drive.google.com/file/d/0BzQ6rtO2VN95c0YzclhySVZYNDQ/view?usp=sharing
I want to store the output in CSV as:

URL,Http Status code

For example:

http://google.com,200  
http://example.com,503  

Hope this help to understand my query.

Upvotes: 2

Views: 498

Answers (1)

e4c5
e4c5

Reputation: 53774

What curl can do, python requests can often do, and do better. like curl, it also has a HEAD method.

import requests
response = requests.head('http://www.example.com')
print(response.status_code)

Upvotes: 1

Related Questions