kei
kei

Reputation: 20511

Extract and update URLs in HTML string through powershell

I have this string (hundreds of them actually) containing URLs and I would like to update them.

Here's the old URL format
http://oldDomain/a/b/document.aspx?p1=v1&p2=NEEDED_VALUE&morePsHere=moreVsHere

and here's what I need them to look like after the update
http://newDomain/c/d/NEEDED_VALUE

Pretty much all I needed to do was to extract the value of p2 in the old URL and append it to http://newDomain/c/d/ to create the new URL.

I assumed the string I was going to get would look like this:

$s = "http://oldDomain/a/b/document.aspx?p1=v1&p2=001&morePsHere=moreVsHere,
      http://oldDomain/a/b/document.aspx?p1=v1&p2=002&morePsHere=moreVsHere,
      http://oldDomain/a/b/document.aspx?p1=v1&p2=003&morePsHere=moreVsHere"

and I was able to update it using the following:

$newURLStart = "http://newDomain/c/d/"
$newStr = $null
$s.Split(",") | ForEach {
  if ($_.IndexOf("p2=") -ne 1)
  {
    $neededValue = $_.Substring($_.IndexOf("p2=")+3)
    if ($neededValue.IndexOf("&") -ne -1)
    {
      $neededValue = $neededValue.Substring(0,$neededValue.IndexOf("&"))
    }
    $newStr = $newStr + ", " + $newURLStart + $neededValue
  }
}
$newStr = $newStr.TrimStart(", ")
$s = $newStr

BUT, it turns out that the string I'm going to get isn't plaintext and would actually look something like:

$s = '<div class="someClass"><p>SomeText</p><ul>
      <li><a href="http://oldDomain/a/b/document.aspx?p1=v1&amp;p2=001&amp;morePsHere=moreVsHere">LINK ONE</a></li>
      <li><a href="http://oldDomain/a/b/document.aspx?p1=v1&amp;p2=002&amp;morePsHere=moreVsHere">LINK TWO</a></li>
      <li><a href="http://oldDomain/a/b/document.aspx?p1=v1&amp;p2=003&amp;morePsHere=moreVsHere">LINK THREE</a></li>
      </ul></div>'

This is a bit more complex than my comma-delimited expectations! I need help updating my script to accommodate the fact. I'm thinking regex might come into play here to grab the URLs inside the href but I'm pretty noob when it comes to that.

Upvotes: 2

Views: 4015

Answers (3)

npinti
npinti

Reputation: 52185

If you threw all the strings in a file you could do something like so:

Get-Content "testregex.html" | % {$_ -replace 'href=".+?;.+?=(.+?)&amp;(.+?)"', 'href="http://newdomain/c/$1"'} | Set-Content "newtestregex.html"

Takes as input this file:

<div class="someClass"><p>SomeText</p><ul>
      <li><a href="http://oldDomain/a/b/document.aspx?p1=v1&amp;p2=001&amp;morePsHere=moreVsHere">LINK ONE</a></li>
      <li><a href="http://oldDomain/a/b/document.aspx?p1=v1&amp;p2=002&amp;morePsHere=moreVsHere">LINK TWO</a></li>
      <li><a href="http://oldDomain/a/b/document.aspx?p1=v1&amp;p2=003&amp;morePsHere=moreVsHere">LINK THREE</a></li>
      </ul></div>

Yields:

<div class="someClass"><p>SomeText</p><ul>
      <li><a href="http://newdomain/c/001">LINK ONE</a></li>
      <li><a href="http://newdomain/c/002">LINK TWO</a></li>
      <li><a href="http://newdomain/c/003">LINK THREE</a></li>
      </ul></div>

Upvotes: 1

Vish
Vish

Reputation: 2164

I simplified your input somewhat, but here it is. (BTW please please store this regex in a post-it next to your desk - it helps me again and again! :) )

I make the following assumptions:

  • that the input URL is present only within
  • tags
  • that the URI always contains the arguments (p1 and p2)

Code:

# Heres the input. 
# I assume you can figure out how to extract the <li> tags from your input

$ip = '<li><a href="http://oldDomain/a/b/document.aspx?p1=v1&amp;p2=001&amp;morePsHere=moreVsHere">LINK ONE</a></li>
      <li><a href="http://oldDomain/a/b/document.aspx?p1=v1&amp;p2=002&amp;morePsHere=moreVsHere">LINK TWO</a></li>
      <li><a href="http://oldDomain/a/b/document.aspx?p1=v1&amp;p2=003&amp;morePsHere=moreVsHere">LINK THREE</a></li>
'

# loop through each line.
$ip -split "`n" | foreach {

        $_ -match "(?<=p2=).*(?=&amp;)"
        $matches
        # now insert the logic to put the regex match into your destination URL
} 

More info on the regex used (and a web result):

  • The -match operator puts the regex match in a variable called $matches.
  • In above code, $matches is updated in each line of the string.
  • The (?<=p2=) and (?=&amp;) tell Powershell that it should look for a match that is bounded by the expressions p2= and &amp;. In this case its your match.

Heres the output for $match

Name                           Value
----                           -----
0                              001
0                              002
0                              003
0                              003

Upvotes: 1

carlpett
carlpett

Reputation: 12603

You can make this a bit easier by using Powershell's excellent XML capabilities. First, convert your string into xml: $xmlData = [xml] $s. Now, we can simply navigate it using properties: $xmlData.div.ul.li.a.href will go into the html you got, and automatically expand into collections as needed:

PS C:\Users\carlpett> $xmlData.div.ul.li.a.href
http://oldDomain/a/b/document.aspx?p1=v1&p2=001&morePsHere=moreVsHere
http://oldDomain/a/b/document.aspx?p1=v1&p2=002&morePsHere=moreVsHere
http://oldDomain/a/b/document.aspx?p1=v1&p2=003&morePsHere=moreVsHere

Now, it's just a simple regex to do the actual replacement: $xmlData.div.ul.li.a.href -replace 'http:\/\/oldDomain\/.+p2=([^&]+).+','http://newDomain/c/d/$1'

So, wrapping it up:

$xmlData = [xml] $s
$xmlData.div.ul.li.a.href -replace 'http:\/\/oldDomain\/.+p2=([^&]+).+','http://newDomain/c/d/$1'

Upvotes: 1

Related Questions