Reputation: 7074
We have a directory of 3000+ HTML files that are migrating to a sharepoint site, and we need to scrub some of the data.
Specific situations:
<?xml version="1.0" encoding="utf-8"?>
that sharepoint doesn't like. We plan to just delete that header line. foo1.htm
or foo.htm
. We want to change both to an absolute link of http:\\sharepoint.site\home.aspx
''
.Here's my function so far:
function scrubXMLHeader {
$srcfiles = Get-ChildItem $backupGuidePath -filter "*htm.*"
$srcfilecount = (Get-ChildItem $backupGuidePath).Count
$selfilecount = $srcfiles.Count
# Input and Ouput Path variables
$sourcePath = $backupGuidePath
$destinationPath = $workScrubPath
"Input From: $($sourcePath)" | Log $messagLog -echo
" Output To: $($destinationPath)" | Log $messageLog -echo
#
$temp01 = Get-ChildItem $sourcePath -filter "*.htm"
foreach($file in $temp01)
{
$outfile = $destinationPath + $file
$content = Get-Content $file.Fullname | ? {$_ -notmatch "<\?xml[^>]+>" }
Set-Content -path $outfile -Force -Value $content
}
}
I want to add the following two edits to each document:
-replace '("foo.htm", "", ">", "Home", "foo1.htm")', '("http:\\sharepoint.site\home.aspx", "", ">", "Home", "http:\\sharepoint.site\home.aspx")
-replace 'addButton("show",BTN_TEXT,"Show","","","","",0,0,"","","");', ''
I'm not sure how to combine those into a single statement so I open the file, perform the changes, save and close the file instead of three separate open-edit-save/close transactions. I'm also not sure, with all the quotes and commas, the best way to escape these characters, or if the single quotes surrounding the whole string are sufficient.
Understanding that "asking regexes to parse arbitrary HTML is like asking Paris Hilton to write an operating system, it's sometimes appropriate to parse a limited, known set of HTML", but being limited in my toolset to PowerShell, I'm trying to understand the best way to add the two -replace
lines to the existing $content
variable...separated by commas within the curly braces? piped to each other?
Is the following these best strategy? or is there something better?
$content = Get-Content $file.Fullname | ? {$_ -notmatch "<\?xml[^>]+>",
-replace '("foo.htm", "", ">", "Home", "foo1.htm")', '("http:\\sharepoint.site\home.aspx", "", ">", "Home", "http:\\sharepoint.site\home.aspx"),
-replace 'addButton("show",BTN_TEXT,"Show","","","","",0,0,"","","");', '' }
Upvotes: 4
Views: 4954
Reputation: 68341
If I'm reading the question correctly, I think this might do what you want:
$Regex0 = '<?xml version="1.0" encoding="utf-8"?> '
$Regex1 = '("foo.htm", "", ">", "Home", "foo1.htm")'
$Replace1 = '("http:\\sharepoint.site\home.aspx", "", ">", "Home", "http:\\sharepoint.site\home.aspx")'
$Regex2 = 'addButton("show",BTN_TEXT,"Show","","","","",0,0,"","","");'
foreach($file in $temp01)
{
$outfile = $destinationPath + $file
(Get-Content $file.Fullname) -notmatch $Regex0,'' -replace $Regex1,$Replace1 -replace $Regex2,'' |
Set-Content -path $outfile -Force -Value $content
}
Upvotes: 2