user10848
user10848

Reputation: 161

How to get domain name from URL with PHP?

I have 2 tables (url_feed and clean_domains). I'm trying to copy all of the data from url_feed into clean_domains and at the same time insert the domain from the url column.

Also, it will change the status from queued to complete in url_feed after it copies the row to clean_domains.

Here's what the 2 tables look like:

url_feed

id | url                                    | matches | networks                    | status
1  | http://whatever.com/example1.php       | 5       | Facebook::Twitter Inc       | queued
2  | http://www.example.com/other-stuff.php | 2       | MySpace::Facebook::Twitter  | queued
3  | http://www.test.com/random-text        | 12      | Instagram::Twitter          | queued

clean_domains

id | domain       | url                                     | matches | networks                    | status
1  | whatever.com | http://whatever.com/example1.php        | 5       | Facebook::Twitter Inc       | queued
2  | example.com  | http://www.example.com/other-stuff.php  | 2       | MySpace::Facebook::Twitter  | queued
3  | test.com     | http://www.test.com/random-text         | 12      | Instagram::Twitter          | queued

Here's my code:

<?php
$con=mysqli_connect("localhost","redacted","redacted","redacted");

mysqli_query($con,"INSERT INTO clean_domains
(id,domain,url,matches,networks)
  SELECT
    id,
    SUBSTRING_INDEX(REPLACE(REPLACE(REPLACE(REPLACE(url, 'http://', ''), 'https://', ''), 'http://www.', ''), 'https://www.', ''), '/', 1),
    url,
    matches,
    networks
  FROM url_feed
  WHERE status = 'queued'");

mysqli_query($con,"UPDATE url_feed
SET    status = 'complete'
WHERE status = 'queued' AND
id IN (SELECT id
FROM   clean_domains)");

mysqli_close($con);
?>

My code is working for 99% of domains, but I can't quite figure out how to make it work perfectly.

Here's the 3 times it seems to not work perfectly:

  1. Colons - http://example.com:88/testing - URLs like this are having their domains output as example.com:88 whereas I'd want it to be example.com.

  2. IP Addresses - http://188.123.44.12/test.php - For IPs it seems to be correctly entering the IP address into the database. In the example it would enter 188.123.44.12 as the domain - but I don't want that. I only want domain names, so if it's an IP, it shouldn't be copied over. It should be marked as complete in url_feed and move on to the next row.

  3. Subdomains - http://subdomain.whatever.example.com/test.html - This is being entered into the domain column as subdomain.whatever.example.com when I want it to be example.com instead.

The only way I can think of verifying if a domain being entered is truly a domain would be to run a whois query on each of them. If it doesn't come back as a valid one, it removed the first block of text. For example, it wouldn't get a valid result for subdomain.whatever.example.com so then it tries whatever.example.com and then it tries example.com until the result is valid or it skips it and marks the status column as complete.

Any ideas on what I can change to make this work properly?

Here's where I am right now with this:

$_url_string = 'https://testfewfew.dsd.google.co.uk/testing/whatever';
preg_match("/[a-z0-9\-]{1,63}\.[a-z\.]{2,6}$/", parse_url($_url_string, PHP_URL_HOST), $_domain_tld);
echo $_domain_tld[0];

Upvotes: 3

Views: 2635

Answers (1)

Alex Andrei
Alex Andrei

Reputation: 7283

Just use built-in php function parse_url

You can filter a subdomain from a hostname like this

$url = 'http://subdomain.whatever.example.com/test.html';

$data = parse_url($url);

$host = $data['host'];

$hostname = explode(".", $host);
$domain = $hostname[count($hostname)-2] . "." . $hostname[count($hostname)-1];

print $domain;

Will output

example.com

If you have an url with a port, parse_url will deal with it easily, example

$url = 'http://example.com:88/testing';

$data = parse_url($url);

print_r($data);

Will output

Array
(
    [scheme] => http
    [host] => example.com
    [port] => 88
    [path] => /testing
)

And below you check if the hostname is a valid IP address or not

$url = 'http://188.123.44.12/test.php';

$data = parse_url($url);

print_r($data);

$hostIsIpAddress = ip2long($data['host']) !== false;

var_dump($hostIsIpAddress);

Which will output bool(true) or bool(false) respectively

Upvotes: 5

Related Questions