LordZeus
LordZeus

Reputation: 39

How do I delete entire lines in powershell?

I have a script that scrapes the raw html off of a webpage. When it does so, it has 17 lines at the top of the text file (output) that I want to be removed. How would one delete entire lines in powershell?

The generated lines are unique every time I run the script.

Current code:

$scrape = Invoke-Webrequest -uri "http://example.com/webpage"

$scrape.rawcontent | Out-File -FilePath C:\Users\outputlocation.txt -append

It then creates a file and gives me "stats" of the scraped webpage at the top of the file since it's the raw content. Deleting the first 17 lines would solve my issue.

Thanks!

Upvotes: 2

Views: 496

Answers (2)

JosefZ
JosefZ

Reputation: 30113

The following code snippet (and its output) shows that

  • (the 1st column): number of "stats" lines could differ from 17,
  • (the 2nd column): all HTML documents must start with a <!DOCTYPE> declaration, as well as how-to skip the "stats" lines ($scrapeContent), and
  • (the 3rd column): computed $scrapeContent does not differ from $scrape.Content.

The code:

$urls = @(
    "https://example.com",
    "https://www.iana.org/domains/reserved",
    "https://stackoverflow.com/questions/72561233",
    "https://stackoverflow.com/users/19112607/lordzeus"
)
if ( -not ( Get-Variable scrapes -ErrorAction SilentlyContinue )) {
    # computed conditionally to save tome and sources while debugging
    $scrapes = $urls |
        ForEach-Object {
            [PSCustomObject]@{
                Url=$PSItem;
                HtmlWebResponseObj = Invoke-Webrequest -uri $PSItem
            }
        }
}
foreach ($aux in $scrapes) {
    $scrape = $aux.HtmlWebResponseObj
    $scrapeArray = $scrape.RawContent -split '\r?\n'
    # or           $scrape.RawContent -split [System.Environment]::NewLine
    $sDeclarationIndex = $scrapeArray.IndexOf(
        ($scrapeArray -match '^<!DOCTYPE')[0] )
    $scrapeContent = $scrape.RawContent.Substring(
        $scrape.RawContent.ToUpper().IndexOf('<!DOCTYPE'))
    # Write-Output
    (
        ( $sDeclarationIndex,
          ($scrapeContent -match '^<!DOCTYPE'),
          ($scrapeContent -eq $scrape.Content),
          $aux.Url
        ) -join ', '
    )
}

Output: .\SO\72561233.ps1

13, True, True, https://example.com
18, True, True, https://www.iana.org/domains/reserved
20, True, True, https://stackoverflow.com/questions/72561233
21, True, True, https://stackoverflow.com/users/19112607/lordzeus

Upvotes: 1

Alex Korobchevsky
Alex Korobchevsky

Reputation: 165

I suggest adding an intermediate cleanup step:

$scrape = Invoke-Webrequest -uri "http://example.com/webpage"
$scrape.RawContent|Out-File raw.txt 
Get-Content raw.txt| Select-Object -Skip 17 | 
        Out-File -FilePath C:\Users\outputlocation.txt -Append

Writing to file as scrape object may not contain the proper cr/lf combinations between lines.

Upvotes: 1

Related Questions