User Submitted Pages

Question

I am creating a web application where users will be allowed to submit URL links to various pieces of content. Pretty standard. The site then follows the URL and downloads its contents. It dawned on me quite quickly that this is a potential security concern. The user could easily link me to an enormous image or even just junk data. Obviously I don't want to tie up all my bandwidth downloading this. So I am have general practice web programming questions.

How much can I trust the HTTP header they send me? Presumably, the entire thing could be a lie. Can I rely on the content-length attribute, or could this easily be fabricated? What about MIME types?
With question 1 in mind, does this mean it is best practice to treat everything an a stream, download it in chunks, and then just abort the process after we have exceeded a certain data limit? If so, what would be an appropriate limit if I am downloading single images and average HTML pages?
Somewhat off topic, but what HTTP status codes are generally accepted as good (basically, I would give my application the go ahead to get the body of the website in chunks if I saw these codes)? Any besides 200?

Can anyone recommend a decent book (preferably online) that covers this type information, preferably in Python or just language agnostic.

Thank you!

Kombajn zbożowy · Accepted Answer

Yes, the entire response can be a lie. HTTP server should adhere to the protocol, but you can never be sure that a malicious one won't send you corrupted data.
Right, you should abort either after actual content is longer than declared in content-length or exceeding a certain threshold. For the limit value, you need to experiment. Here is some research on this. Maybe 5MB per webpage would be a good start.
You could possibly follow redirects (301), but apart to that stick to 200 only.

User Submitted Pages

Answers (1)

Related Questions