DeepBlue
DeepBlue

Reputation: 694

Security of fetching a url content in php

I am concerned about the safety of fetching content from unknown url in PHP.

We will basically use cURL to fetch html content from user provided url and look for Open Graph meta tags, to show the links as content cards.

Because the url is provided by the user, I am worried about the possibility of getting malicious code in the process.

I have another question: does curl_exec actually download the full file to the server? If yes then is it possible that viruses or malware be downloaded when using curl?

Upvotes: 3

Views: 1338

Answers (5)

Luddig
Luddig

Reputation: 2819

Expanding on the answer made by Ray Radin.

Tips on precautionary measures

He is correct that if you use sound a sound process to search the fetched resource there should be no problem in fetching whatever url is provided. Some examples here are:

  • Don't store the file in a public facing directory on your webserver. Then you expose yourself to this being executed.
  • Don't store it in a database, this might lead to a second order sql injection attack
  • In general, don't store anything from the resource you are requesting, if you have to do this use a specific whitelist of what you are searching for

Check the header information

Even though there is no foolprof way of validating what you are requesting with a specific url. There are ways you can make your life easier and prevent some potential issues.

For example a url might point to a large binary, large image file or something similar.

Make a HEAD request first to get the header information. Then look at the Content-type and Content-length headers to see if the content is a plain text html file

You should however not trust these since they can be spoofed. Doing this will hovewer make sure that even non-malicous content won't crash your script. Requesting image files is presumably something you don't want users to do.

Guzzle

I recommend using Guzzle to do your request since it is in my opinion provides some functionallity that should make this easier

Upvotes: 1

stack reader
stack reader

Reputation: 167

It is safe but you will need to do a proper data check before using it. As you should with any data input anyway.

Upvotes: 0

Gopikrishna Mallik
Gopikrishna Mallik

Reputation: 21

you can use httpclient.class instead of file_get_content or curl. because it connect's the page through the socket.After download the data you can take the meta data using preg_match.

Upvotes: 1

anwerj
anwerj

Reputation: 2488

Short answer is file_get_contents is safe you retrieve data, even curl is. It is up to you what you do with that data.
Few Guidelines:
1. Never Run eval on that data.
2. Don't save it to database without filtering.
3. Don't even use file_get_contents or curl.

Use: get_meta_tags

array get_meta_tags ( string $filename [, bool $use_include_path = false ] )
// Example
$tags = get_meta_tags('http://www.example.com/');

You will have all meta tags parsed, filtered in an array.

Upvotes: 1

Rei
Rei

Reputation: 6363

Using cURL is similar to using fopen() and fread() to fetch content from a file. Safe or not, depends on what you're doing with the fetched content.

From your description, your server works as some kind of intermediary that extracts specific subcontent from a fetched HTML content. Even if the fetched content contains malicious code, your server never executes it, so no harm will come to your server.

Additionally, because your server only extracts specific subcontent (Open Graph meta tags, as you say), everything else that is not what you're looking for in the fetched content is ignored, which means your users are automatically protected.

Thus, in my opinion, there is no need to worry. Of course, this relies on the assumption that the content extraction process is sound. Someone should take a look at it and confirm it.

does curl_exec actually download the full file to the server?

It depends on what you mean by "full file". If you mean "the entire HTML content", then yes. If you mean "including all the CSS and JS files that the feched HTML content may refer to", then no.

is it possible that viruses or malware be downloaded when using curl?

The answer is yes. The fetched HTML content may contain malicious code, however, if you don't execute it, no harm will come to you.

Again, I'm assuming that your content extraction process is sound.

Upvotes: 8

Related Questions