Reputation: 187
Can Matlab eliminate the path in URL and leave only the domain part? Does Matlab have any function to eliminate the path behind?
Let's say, example 1:
input :http://www.mathworks.com/help/images/removing-noise-from-images.html
output :http://www.mathworks.com
Upvotes: 2
Views: 379
Reputation: 30589
This regexp
pattern should do the trick:
>> str = 'http://www.mathworks.com/help/images/removing-noise-from-images.html';
>> out = regexp(str,'\w*://[^/]*','match','once')
out =
'http://www.mathworks.com'
The search pattern '\w*://[^/]*'
says look for a string that starts with some "word" characters ('\w*
) corresponding to the protocol (e.g. http, https, rtsp), followed by the ubiquitous ://
, and then any number of characters that are not a forward slash ([^/]*
).
Edit: The 'once'
option should eliminate a nested cell
.
UPDATE: just the hostname, allowing inputs with no protocol.
>> str = {'http://www.mathworks.com/help/images/removing-noise-from-images.html';
'https://www.mathworks.com/help/matlab/ref/[email protected]';
'google.com/voice'}
>> out = regexp(str,'([^/]*)(?=/[^/])','match','once')
out =
'www.mathworks.com'
'www.mathworks.com'
'google.com'
UPDATE 2: regexp
madness!
>> str = {'http://www.mathworks.com/help/images/removing-noise-from-images.html';
'https://www.mathworks.com/help/matlab/ref/[email protected]';
'google.com/voice';
'http://monkey.org/';
'stackoverflow.com/';
'meta.stackoverflow.com'};
>> out = regexp(str,'.*?[^/](?=(/([^/]|$)|$))','match','once')
out =
'http://www.mathworks.com'
'https://www.mathworks.com'
'google.com'
'http://monkey.org'
'stackoverflow.com'
'meta.stackoverflow.com'
% hostname.m
function hostnames = hostname(str)
hostnames = regexp(str,'.*?[^/](?=(/([^/]|$)|$))','match','once');
Upvotes: 2
Reputation: 221624
Code:
function output_url = domain_name(input_url)
c1 = strfind(input_url,'//');
ind1 = strfind(input_url,'/');
if isempty(c1) && isempty(ind1)
output_url = input_url; % For case like - www.mathworks.com
return;
end
if ~isempty(c1)
if numel(ind1)>2
output_url = input_url(1:ind1(3)-1); % For cases like - http://www.mathworks.com/ or http://www.mathworks.com/something/
else
output_url = input_url; % For case like - http://www.mathworks.com
end
else
output_url = input_url(1:ind1(1)-1); % For cases like - www.mathworks.com/ or www.mathworks.com/something/
end
return;
Example runs:
%% Long URLs with extensions
disp(domain_name('www.mathworks.com/help/images/removing-noise-from-images.html'))
disp(domain_name('http://www.mathworks.com/help/images/removing-noise-from-images.html'))
%% Short URLs without HTTP://
disp(domain_name('www.mathworks.com'))
disp(domain_name('www.mathworks.com/'))
%% Short URLs with HTTP://
disp(domain_name('http://www.mathworks.com'))
disp(domain_name('http://www.mathworks.com/'))
Return:
www.mathworks.com
http://www.mathworks.com
www.mathworks.com
www.mathworks.com
http://www.mathworks.com
http://www.mathworks.com
An alternative method and probably efficient one would be to use REGEXP, but apparently I prefer numbers.
Edit 1: If you prefer to use bunch of URLs at the sametime, you may use a cell array. Obviously, the output would be a cell array too. Look at the following MATLAB script to get a feel of it -
% Input
in_urls_cell = [{'http://mathworks.com/'},{'mathworks.com/help/matlab/ref/strcmpi.html'},{'mathworks.com/help/matlab/ref/[email protected]'}];
% Get domain name
out_urls_cell = cell(size(in_urls_cell));
for count = 1:numel(in_urls_cell)
out_urls_cell(count)={domain_name(cell2mat(in_urls_cell(count)))};
end
% Display only domain name
for count = 1:numel(out_urls_cell)
disp(cell2mat(out_urls_cell(count)));
end
The above script returns -
http://mathworks.com
mathworks.com
mathworks.com
Upvotes: 2