Reputation: 1486
I have a file with million urls like: the data file is like:
http://wonderland.cjfallon.ie/
http://www.youtube.com/
http://www.starfall.com/
http://education.scholastic.co.uk/
http://www.scoilnet.ie/
http://www.nessy.com/
http://www.senteacher.org/
http://scoop.it/
http://www.moviemaker.com/
http://learni.st/
http://www.twitter.com/
http://www.facebook.com/
http://www.gutenberg.org/
http://www.gutenberg.org/cache/epub/42361/pg42361.txt
I want to crawl them,so the bound is network IO,so I want to use multiple threads or gevent to tackle it.
my multiple threads code works well in : https://gist.github.com/young001/5449751
but when using gevent, the code is : https://gist.github.com/young001/baa3eebbf7342c5ac077 it always goes wrong:
status is 200
status is 200
Internal error in evhttp
the url is down http://web2.socialcomputingmagazine.com/the_social_graph_issues_and_strategies_in_2008.htm
the reason
status is 200
status is 200
status is 200
status is 200
status is 200
status is 200
status is 301
status is 200
status is 301
status is 200
status is 200
Internal error in evhttp
and then it stalled. I don't know why it comes out like that?
any help?
it seems all should go well but it's not,it makes me crazy.
Upvotes: 0
Views: 139
Reputation: 9515
I can reproduce it here after fixing up your sample.
Basically this seems to be a gevent bug that it sometimes gives Internal error in evhttp
.
The source code says:
# sometimes this happens, don't know why
sys.stderr.write("Internal error in evhttp\n")
You'll have to either debug that or use something else, or just retry when it fails.
Upvotes: 1