Břeťa
Břeťa

Reputation: 122

Extract href attribute from HTML text in Google Sheets

I have about 3000 rows in my Google Spreadsheet and each row contains data about one article from our website. In one column (e.g. A:A) is stored formated text in HTML. I need extract all URLs inside href="" attribute from this column and work with them later. (It could be array or text string separated with coma or space in B column)

I tryied to use REGEXTRACT formula but it gives me only the first result. Then I tryied to use REGEXREPLACE but I'm unable to write proper expression to get only URL links.

I know that it is not proper way to use regex to get anything from HTML. Is there another way to extract these values from HTML text in one cell?

Link to sample data: Google Spreadsheet

Thak you in advance! I'm real newbie here and in scripting, parsing etc. too.

Upvotes: 4

Views: 2761

Answers (1)

Tanaike
Tanaike

Reputation: 201388

How about this samples? I used href=\"(.*?)\" for retrieving the URL. The sample of regex101.com is here.

1. Using Google spreadsheets functions :

=TEXTJOIN(CHAR(10),TRUE,ARRAYFORMULA(IFERROR(REGEXEXTRACT(SPLIT(a1,">"),"href="&CHAR(34)&"(.*?)"&CHAR(34)))))

In this case, since REGEXEXTRACT retrieves only the first matched string, after the cell data is separated by SPLIT, the URL is retrieved by REGEXEXTRACT.

Result :

enter image description here

2. Using Google Apps Script :

function myFunction(str){
  var re  = /href=\"(.*?)\"/g;
  var result = "";
  while ((res=re.exec(str)) !== null) {
    result += res[1] + "\n";
  };
  return result.slice(0,-1);
}

This script can be used as a custom function. When you use this, please put =myFunction(A1) to a cell.

Result :

The result is the same to above method.

If I misunderstand your question, I'm sorry.

Upvotes: 6

Related Questions