Reputation: 265

How to remove duplicate domains from a URL list using javascript

I am stuck at a rather simple problem - removing duplicate domains from a list of URL's, using javascript.

Here's what I am currently doing: I have an array called 'list' which has the list of url's. I work on that to extract the domains, and put them in a new array called 'domain'.

Then I use two for loops to go through the entire list and check for duplicate domains. If the domains match, I splice the duplicate one out. But it seems to be removing too many, and I am pretty sure I am doing something wrong. Can somebody tell me what I am doing wrong, or suggest a simpler/better way of doing it?

for (i=0; i<list.length; i++) {

    for (j=i+1; j<list.length; j++) {

        if (domain[i] == domain[j]) {

            console.log('REMOVING:');
            console.log(i + '. ' + list2[i]);
            console.log(j + '. ' + list2[j]);
            console.log(domain[i]);
            console.log(domain[j]);

            list.splice(j,1);

        }
    }
}

This is not a 'how to remove duplicates from an array' question. As I have a list of URL's, and need to check for - and remove, only the duplicate 'domains'. So suppose I have 4 URL's from youtube, I need to keep only the first one and remove the rest.

Upvotes: 1

Answers (6)

rodneyrehm

Reputation: 13557

Try to get rid of the domains array. Instead build a map of already "used" domains:

var urls = [
  'http://example.org/page-1.html',
  'http://example.org/page-2.html',
  'http://google.com/search.html',
  'http://mozilla.com/foo.html',
];

var domains = {};
var uniqueUrls = urls.filter(function(url) {
  // whatever function you're using to parse URLs
  var domain = extractDomain(url);
  if (domains[domain]) {
    // we have seen this domain before, so ignore the URL
    return false;
  }
  // mark domain, retain URL
  domains[domain] = true;
  return true;
});

console.log(uniqueUrls);

Upvotes: 2

Kalman

Reputation: 8131

If you are able to use the Undescore.js library, it's as simple as

yourArray = _.uniq(yourArray);

http://underscorejs.org/#uniq

Upvotes: 0

Nielsvh

Reputation: 1219

The best way to remove duplicates is to use a map. The example has an array of URIs with some duplicates. First insert the strings into an object, then iterate over the object to create an array. Boom, no duplicates.

function getHostName(url) {
    var match = url.match(/:\/\/(www[0-9]?\.)?(.[^/:]+)/i);
    if (match != null && match.length > 2 && typeof match[2] === 'string' && match[2].length > 0) {
    return match[2];
    }
    else {
        return null;
    }
}

var uris = ["http://foo.org/barbar","http://www.bar.com/foo/bar/bar.html","http://foo.bar/lorem/","http://foo.org","https://bar.bar","http://foo.org","http://bar.bar"];
var urisObj = {};
for(var i = 0;i<uris.length;i++){
  urisObj[getHostName(uris[i])] = getHostName(uris[i]);
}

uris = Object.keys(urisObj).map(function(x) { return urisObj[x];});

console.log(uris);

Edit:

Using http://www.primaryobjects.com/2012/11/19/parsing-hostname-and-domain-from-a-url-with-javascript/ to get the host name from a string.

Upvotes: 0

Kalman

Reputation: 8131

If you want to do it using your original way (or very similar to it), instead of going up the array (with i++) - go down the array instead. As in the following code,

var list = ["abc", "cba", "abc", "abc", "abc", "abc"];

for (var i = list.length - 1; i >= 0; i--) {

  for (var j = i-1; j >= 0; j--) {

    if (list[i] == list[j]) {

        console.log('REMOVING:');
        console.log(i + '. ' + list[i]);
        console.log(j + '. ' + list[j]);
        console.log(list[i]);
        console.log(list[j]);

        list.splice(i, 1);

    }
  }
}

console.log(list);

Upvotes: 0

kemiller2002

Reputation: 115508

You can let an object handle the checking for you.

var a = [];

a.push('http://test')
a.push('http://that');
a.push('http://that');
a.push('http://that');

var o = {}

for(var ii = 0; ii < a.length; ii++){
    o[a[ii]] = o[a[ii]]
}

var nA = [];

for (var k in o) {
    nA.push(k);
}

Upvotes: 0

Rob M.

Reputation: 36511

ES5: filter the array and only include if the current item's index is equal to its index in the array:

list.filter(function(elem, pos, arr) {
   return arr.indexOf(elem) === pos;
});

ES6: use a Set

const uniqueDomains = [ ...new Set(list) ];

or if you can't use the spread operator:

new Set(list).toJSON()

Upvotes: 3

How to remove duplicate domains from a URL list using javascript

Answers (6)

Related Questions