samash
samash

Reputation: 23

Regex to match URL in Powershell

I am new to programming and Powershell, I've put together the following script; it parses through all the emails in a specified folder and extract the URLs from them. The script uses a regex pattern to identify the URLs and then extracts them to a text file. The extracted text is then run through another command where I am trying to remove the http:// or https:// portion (I need help with figuring this out), these are placed into another text file, from which I go through again to remove duplicates.

The main issue I am having is that the regex doesnt appear to extract the urls correctly. What I am getting is something like an example I have created below:

URL is http://www.dropbox.com/3jksffpwe/asdj.exe But I end up getting

dropbox.com/3jksffpwe/asdj.exe
dropbox.com 
drop  
dropbox

The script is

#Adjust paths to location of saved Emails
$in_files = ‘C:\temp\*.eml, *.msg’  
$out_file = ‘C:\temp\Output.txt’  
$Working_file = ‘C:\temp\working.txt'  
$Parsed_file = ‘C:\temp\cleaned.txt'  

# Removes the old output file from earlier runs.
if (Test-Path $Parsed_file) {
  Remove-Item $Parsed_file
}

# regex to parse thru each email and extract the URLs to a text file
$regex = ‘([a-zA-Z]{3,})://([\w-]+\.)+[\w-]+(/[\w- ./?%&=]*)*?’  
select-string -Path $in_files -Pattern $regex -AllMatches | % { $_.Matches } | % { $_.Value } > $out_file

#Parses thru the output of urls to strip out the http or https portion  
Get-Content $Out_file | ForEach-Object {$_.SubString(7)} | Out-File    $Working_file


#Parses thru again to remove exact duplicates  
 $set = @{}  
 Get-Content $Working_file | %{  
   if (!$set.Contains($_)) {  
       $set.Add($_, $null)  
        $_  
    }  
} | Set-Content $Parsed_file  


#Removes the files no longer required  
Del $out_file, $Working_file  

#Confirms if the email messages should be removed  
$Response = Read-Host "Do you want to remove the old messages? (Y|N)"  

If ($Response -eq "Y") {del *.eml, *msg}  

#Opens the output file in notepad  
Notepad $Parsed_file  

Exit   

Thanks for any help

Upvotes: 2

Views: 12776

Answers (2)

user5156318
user5156318

Reputation: 41

Try this RegEx:

(http[s]?|[s]?ftp[s]?)(:\/\/)([^\s,]+)

But remember that powershell -match is only capturing the first match. To capture all matches you could do something like this:

$txt="https://test.com, http://tes2.net, http:/test.com, http://test3.ro, text, http//:wrong.value";$hash=@{};$txt|select-string -AllMatches '(http[s]?|[s]?ftp[s]?)(:\/\/)([^\s,]+)'|%{$hash."Valid URLs"=$_.Matches.value};$hash

Best of luck! Enjoy!

Upvotes: 4

nitishagar
nitishagar

Reputation: 9413

RegExp for checking for URL can be like:

(?i)\b((?:[a-z][\w-]+:(?:\/{1,3}|[a-z0-9%])|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}\/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’]))

Check for more info here.

Upvotes: 3

Related Questions