Reputation: 122
I have about 3000 rows in my Google Spreadsheet and each row contains data about one article from our website. In one column (e.g. A:A) is stored formated text in HTML. I need extract all URLs inside href=""
attribute from this column and work with them later. (It could be array or text string separated with coma or space in B column)
I tryied to use REGEXTRACT
formula but it gives me only the first result. Then I tryied to use REGEXREPLACE
but I'm unable to write proper expression to get only URL links.
I know that it is not proper way to use regex to get anything from HTML. Is there another way to extract these values from HTML text in one cell?
Link to sample data: Google Spreadsheet
Thak you in advance! I'm real newbie here and in scripting, parsing etc. too.
Upvotes: 4
Views: 2761
Reputation: 201388
How about this samples? I used href=\"(.*?)\"
for retrieving the URL. The sample of regex101.com is here.
=TEXTJOIN(CHAR(10),TRUE,ARRAYFORMULA(IFERROR(REGEXEXTRACT(SPLIT(a1,">"),"href="&CHAR(34)&"(.*?)"&CHAR(34)))))
In this case, since REGEXEXTRACT
retrieves only the first matched string, after the cell data is separated by SPLIT
, the URL is retrieved by REGEXEXTRACT
.
function myFunction(str){
var re = /href=\"(.*?)\"/g;
var result = "";
while ((res=re.exec(str)) !== null) {
result += res[1] + "\n";
};
return result.slice(0,-1);
}
This script can be used as a custom function. When you use this, please put =myFunction(A1)
to a cell.
The result is the same to above method.
If I misunderstand your question, I'm sorry.
Upvotes: 6