Reputation: 7273
I have a file of size 10 GB. This file mostly contains URLs. I am trying to get the HTTP status code of each URL and store them into another File with .CSV
extension.
I have searched for a code and found a solution to access the status code of a URL using Python:
import requests
request = requests.get('http://www.example.com')
print(request.status_code)
But it takes on one URL. I have a file of a larger size. I do not know how I can input URLs from a file to this command. Even how to store the output in .CSV
format
Even it is not faster. I am looking for a faster solution that will give me faster result for 10 GB file.
I tried the Ubuntu command also:
xargs -n1 -P 10 curl -o /dev/null --silent --head --write-out '%{url_effective},%{http_code}\n' < Input_File.txt > output.CSV
But it is also not multi threaded. It is taking single line at a time and then storing to CSV
.
So, my question is how I can make this work faster for a file size of 10 GB. If there is any solution for this in any programming language, I will be happy to implement.
Here is the sample file of URL - a small chunk from my 10 GB file:
https://drive.google.com/file/d/0BzQ6rtO2VN95c0YzclhySVZYNDQ/view?usp=sharing
I want to store the output in CSV as:
URL,Http Status code
For example:
http://google.com,200
http://example.com,503
Hope this help to understand my query.
Upvotes: 2
Views: 498