Reputation: 39
I have a script that scrapes the raw html off of a webpage. When it does so, it has 17 lines at the top of the text file (output) that I want to be removed. How would one delete entire lines in powershell?
The generated lines are unique every time I run the script.
Current code:
$scrape = Invoke-Webrequest -uri "http://example.com/webpage"
$scrape.rawcontent | Out-File -FilePath C:\Users\outputlocation.txt -append
It then creates a file and gives me "stats" of the scraped webpage at the top of the file since it's the raw content. Deleting the first 17 lines would solve my issue.
Thanks!
Upvotes: 2
Views: 496
Reputation: 30113
The following code snippet (and its output) shows that
<!DOCTYPE>
declaration, as well as how-to skip the "stats" lines ($scrapeContent
), and$scrapeContent
does not differ from $scrape.Content
.The code:
$urls = @(
"https://example.com",
"https://www.iana.org/domains/reserved",
"https://stackoverflow.com/questions/72561233",
"https://stackoverflow.com/users/19112607/lordzeus"
)
if ( -not ( Get-Variable scrapes -ErrorAction SilentlyContinue )) {
# computed conditionally to save tome and sources while debugging
$scrapes = $urls |
ForEach-Object {
[PSCustomObject]@{
Url=$PSItem;
HtmlWebResponseObj = Invoke-Webrequest -uri $PSItem
}
}
}
foreach ($aux in $scrapes) {
$scrape = $aux.HtmlWebResponseObj
$scrapeArray = $scrape.RawContent -split '\r?\n'
# or $scrape.RawContent -split [System.Environment]::NewLine
$sDeclarationIndex = $scrapeArray.IndexOf(
($scrapeArray -match '^<!DOCTYPE')[0] )
$scrapeContent = $scrape.RawContent.Substring(
$scrape.RawContent.ToUpper().IndexOf('<!DOCTYPE'))
# Write-Output
(
( $sDeclarationIndex,
($scrapeContent -match '^<!DOCTYPE'),
($scrapeContent -eq $scrape.Content),
$aux.Url
) -join ', '
)
}
Output: .\SO\72561233.ps1
13, True, True, https://example.com
18, True, True, https://www.iana.org/domains/reserved
20, True, True, https://stackoverflow.com/questions/72561233
21, True, True, https://stackoverflow.com/users/19112607/lordzeus
Upvotes: 1
Reputation: 165
I suggest adding an intermediate cleanup step:
$scrape = Invoke-Webrequest -uri "http://example.com/webpage"
$scrape.RawContent|Out-File raw.txt
Get-Content raw.txt| Select-Object -Skip 17 |
Out-File -FilePath C:\Users\outputlocation.txt -Append
Writing to file as scrape object may not contain the proper cr/lf combinations between lines.
Upvotes: 1