Venom John
Venom John

Reputation: 43

Extracting and replacing html link tag with regex

I am trying to do some html scraping with JavaScript, and would like to take the a href link and replace it into a hyperlink on a Discord embed. I am having trouble with regex, I am finding it very difficult to learn. I assume I will also need another regex to capture it all so I can replace it with my desired target?

This is an example raw html that I have:

An **example**, also known as a <a href="https://www.example.com/example%20type">example type</a>

to make this readable within a Discord embed, I am looking for a desired output of:

An **example**, also known as a [**example type**](https://www.example.com/example%20type)

I have tried extracting the URL via regex, which I can match however, I am having issues with extracting the link and the (I think its called target? The 'example type' in the example link text) and then replacing the string with my desired output. I have the following: (https://regexr.com/73574)

/href="[^"]+/g

This matches href="https://www.example.com/example%20type, and feels like a very early step, it includes 'href' in the match, and it does not capture the target.

EDIT: I apologise, I did not think about additional checks, what if the string has multiple links? and text after them, for example:

An **example**, also known as a <a href="https://www.example.com/example%20type">example type</a> is the first example, and now I have <a href="https://www.example.com/second">second</a> example

with a desired output of:

An **example**, also known as a [**example type**](https://www.example.com/example%20type) is the first example, and now I have [**second**](https://www.example.com/second) example

Upvotes: 1

Views: 1710

Answers (3)

Venom John
Venom John

Reputation: 43

Solution:

const input = 'An **example**, also known as a <a href="https://www.example.com/example%20type">example type</a> first and second here <a href="https://www.example.com/no%20u">no u</a> and then done noice';
const output = input.replace(/<a href="([^"]+)">([^<]+)<\/a>/g, '[**$2**]($1)')

console.log(output);

Regex breakdown:

  • <a href=" - Matches the opening <a href" HTML tag
  • ([^"]+) - This is a capturing group, matches a number of characters that are not double quotes
  • "> - Matches the closing double quotes, including the closing tag '>'
  • ([^<]+) - Another capturing group, matches a number of characters that are not a less than symbol
  • <\/a> - Matches the closing HTML tag

I then use the replace method seen in my output variable. Within the replace, you see two options (regex, replaceWith) The first option is obvious, its the regex. The second option [**$2**]($1), uses the capturing groups we see in the regex, the first group $1 provides the link within the HTML tag, and $2 provides the HTML target (the name after the link, for example in my input variable, the first target you see is: 'example type'. The only important bits in this option is: $2 and $1, however I wanted to display them in a certain way, [**target**](link).

Upvotes: 2

AbsoluteZero
AbsoluteZero

Reputation: 401

You can use regular expression groups to capture things that interest you. My regular expression here might be far from perfect but I don't think that's important here - it shows you a way and you can always improve it if needed.

Things you have to do:

  • prepare regex that captures groups that you need (anchor tag, anchor text, anchor url),
  • remove the anchor tag completely from the text
  • inject anchor text and anchor href into the final string

Here's a quick code example of that:

const anchorRegex = /(<a\shref="([^"]+)">(.+?)<\/a>)/i;
const textToBeParsed = `An **example**, also known as a <a href="https://www.example.com/example%20type">example type</a>`;

const parseText = (text) => {
    const matches = anchorRegex.exec(textToBeParsed);
  
  if (!matches) {
    console.warn("Something went wrong...");

    return;
  }
  
  const [, fullAnchorTag, anchorUrl, anchorText] = matches;
  const textWithoutAnchorTag = text.replace(fullAnchorTag, '');
  
  return `${textWithoutAnchorTag}[**${anchorText}**](${anchorUrl})`;
};

console.log(parseText(textToBeParsed));

Upvotes: 1

akash
akash

Reputation: 587

Try this: (?<=href=")[^"]*

By using a lookbehind, you can now verify that the text behind is equal to href=" without capturing it

Demo: https://regex101.com/r/2qMnPt/1

Upvotes: 1

Related Questions