Reputation: 15
I'm trying write a script that will grab the fortune 100 URLs from here, put those into an array, and then write a runspace that uses Invoke-WebRequest
to get the content of those URLs and writes that content to a file. This is the code that I have so far:
#Importing Modules
Import-Module PoshRSJob
#variable declaration
$page = Invoke-WebRequest https://www.zyxware.com/articles/4344/list-of-fortune-500-companies-and-their-websites
$links = $page.Links
$tables = @($page.ParsedHtml.GetElementsByTagName("TABLE"))
$tableRows = $tables[0].Rows
#loops through the table to get only the top 100 urls.
$urlArray = @()
foreach ($tablerow in $tablerows) {
$urlArray += New-Object PSObject -Property @{'URLName' = $tablerow.InnerHTML.Split('"')[1]}
#Write-Host ($tablerow.innerHTML).Split('"')[1]
$i++
if ($i -eq 101) {break}
}
#Number of Runspaces to use
#$RunspaceThreads = 1
#Declaring Variables
$ParamList = @($urlArray)
$webRequest = @()
$urlArray | start-rsjob -ScriptBlock {
#$webRequest = (Invoke-WebRequest $using:ParamList)
#Invoke-WebRequest $urlArray
#Invoke-WebRequest {$urlArray}
#Get-Content $urlArray
}
The problem that I'm running into right now is that I can't get Invoke-WebRequest
or Get-Content
to give me the contents of the URLs that are actually contained in the array. You can see that in the scriptblock, I commented out some lines that didn't work.
My question is: using a runspace, what do I need to do to pull the data from all the URLs in the array using Get-Content
, and then write that to a file?
Upvotes: 0
Views: 1829
Reputation: 1427
You can adjust your current query to get the first 100 company names. This skips the empty company at the front. Consider using [PSCustomObject] @{ URLName = $url }
which replaces the legacy New-Object PSObject
.
$urlArray = @()
$i = 0
foreach ($tablerow in $tablerows) {
$url = $tablerow.InnerHTML.Split('"')[1]
if ($url) {
# Only add an object when the url exists
$urlArray += [PSCustomObject] @{ URLName = $url }
$i++
if ($i -eq 100) {break}
}
}
To run the requests in parallel use Start-RSJob
with a script block. Invoke-Webrequest
is then run in parallel. Note that in this example $_
refers to the current array element that is piped which consists of an object with a URLName
property, but you need to be a little careful what variables you use inside the scriptblock because they might not be resovled they way you expect them to be.
# Run the webrequests in parallel
# $_ refers to a PSCustomObject with the @{ URLName = $url } property
$requests = ($urlArray | start-rsjob -ScriptBlock { Invoke-WebRequest -Uri $_.URLName })
You can then wait for all the jobs to complete and do some post processing of the results. Here only the length of the website contents are written because the pages themself are lengthy.
# Get the results
# $_.Content.Length gets the length of the content to not spam the output with garbage
$result = Get-RSjob | Receive-RSJob | ForEach { $_.Content.Length }
Write-Host $result
Upvotes: 1