stackoverflow
stackoverflow

Reputation: 409

How to parse HTML tags in Matlab using regexp?

I'm short on time and specifically wanted to extract a string like the one below. Problem is the tag isn't of the form <a> data </a>.

Given,

s = <em style="font-size:medium"> 5,888 </em>

how to extract out just 5,888 in matlab?

Upvotes: 4

Views: 4784

Answers (2)

stackoverflow
stackoverflow

Reputation: 409

Thanks folks for your help. I'm basically trying to get the population of a US county on Matlab. Thought I'l share my code, though not the most elegant. Might help some soul. :)

county = 'morris';
state = 'ks';

county = strrep(county, ' ' , '+');
str = sprintf('https://www.google.com/search?&q=population+%s+%s',county,state);
s = urlread(str);
pop = regexp(s,'<em[^>]*>(.*?)</em>', 'tokens');
pop = char(pop{:});
pop = strrep(pop, ',' , '');
pop = str2num(pop);

Upvotes: 3

Rody Oldenhuis
Rody Oldenhuis

Reputation: 38032

You will find useful info here, or here, or here, all of which are google-first-page results and would have been faster than asking a question here.

Anyway, quick-'n-dirty way: You can filter on the <> symbols:

>> s = '<em style="font-size:medium"> 5,888 </em> <sometag> test </sometag>'    
>> a = regexp(s, '[<>]');    
>> s( cell2mat(arrayfun(@(x,y)x:y, a(2:2:end-1)+1, a(3:2:end)-1, 'uni',false)) )

ans = 

   5,888 test

Or, slightly more robust and much cleaner, replace everything between any tags (including the tags) with emptyness:

>> s = regexprep(s, '<.*?>', '')
ans = 

   5,888 test

Upvotes: 3

Related Questions