Ross Lyons
Ross Lyons

Reputation: 171

Formatting output from Invoke-WebRequest in Powershell

Information

So what I am looking to do is scrape my local intranet where our HR team upload new starter information and be able to either hold that information in a usable format, or export it to a CSV to then be used by another script.

Currently our service desk team manually go looking at this intranet page, and create the users based on the information our HR team enter one by one.

Naturally, this is a very time consuming task that could be easily automated. Unfortunately, our HR team are not open for any changes to the process at the current time due to other work they are focusing on. Internal politics stuff, so sadly they can't be convinced.

Now, I have managed to use Invoke-WebRequest and get the content of the page but the formatting is awful. It returns as a load of HTML and I'm iterating through multiple steps of splitting and string replacing which just doesn't feel optimal to me and I feel like there is a better way to get the results I want.

Current Script

$webRequest = Invoke-WebRequest -Uri "http://intranet-site/HR/NewStarterList.php?action=ItToComp" -Headers @{"Header Info here"} -UseDefaultCredentials

$content = $webRequest.Content

$initialReplace = $content -replace '(?<=<).*?(?=>)', ' '
$split = $initialReplace -split "< >< >< >"
$split = $split -split "< >< >"
$split = $split -replace '< >',""
$split = $split[5..$($split.count)]

As you can see, this is not really ideal, and I'm wondering if there is a better way to grab just the information I need from the page.

The initial content returns as below (I have shortened and replaced any names to make it easy on the eye)

<html>
<head>
<title>New Starter List</title>
<link rel="STYLESHEET" type="text/css" href="/common/StyleSheet/Reports.css" /> <style> TD  {font-family: Verdana; font-size: 8pt; border-left: solid 0px black; border-right: solid 0px black;}    </style>
<script type="text/javascript" src="../../../cgi-bin/calendar/tableH.js"></script>
</head>
<body>
<img src="/common/images/logo.gif" border="0">
<br>
<br>
<b><span style="font-size: 12pt; font-variant: small-caps; ">New Starter List</span></b>
<br>Logged In As &quot;UserName&quot;<br>
<br>
<tableonMouseOver="javascript:trackTableHighlight(this.event,'FFFF66');"onMouseOut="javascript:highlightTableRow(0);" border="4" frame="border" width="80%" rules="none" cellspacing="6%" cellpadding="6%">
<th align="left">Date Started</th>
<th align="left">Name</th>
<th align="left">Initials</th>
<th align="left">Department</th>
<th align="left">Contact</th>
<th align="left">IT Completed?</th>
<th align="left">Supervisor Completed?</th>
<tr colspan="6"><td  align="left">25 Sep 2019</td>
<td  align="left"><a href="NewStarterInfo.php?id=3117">Joe Bloggs</a></td>
<td  align="left">JXBL</td>
<td  align="left">Team A</td>
<td  align="left">Manager 1</td>
<td  align="left">No</td>
<td  align="left">Yes</td></tr>
<tr colspan="6"><td  align="left">08 Jul 2019</td>
<td  align="left"><a href="NewStarterInfo.php?id=3149">Harry Bloggs</a></td>
<td  align="left">HXBL</td>
<td  align="left">Team B</td>
<td  align="left">Manager 2</td>
<td  align="left">No</td>
<td  align="left">Yes</td></tr>
<th align="left" colspan="7">72 starters</th>
</table>
</body>
</html>

After I run my splits and replaces, It looks like below (again, names changed)

25 Sep 2019
Joe Bloggs
JXBL
Team 1
Manager 1
No
Yes
08 Jul 2019
Harry Bloggs
HXBL
Team 2
Manager 2
No
Yes
72 starters

The idea is then to be able to run with this information to automate our on-boarding process.

I feel like I am missing something obvious, like there is a neater or more efficient way to do this, as this is the first time I'm using Invoke-WebRequest and finding it troublesome as it is anyway.

Expected Results

What I want is preferably an array of users with properties for each bit of info, like a CSV or a PSObject.

So when I call a variable holding the info, I want it to return something like the below:

Name              : Joe Bloggs
Initials          : JXBL
Department        : Team 1
Manager           : Manager 1
IT                : No
Supervisor        : No

StartDate         : 08 Jul 2019
Name              : Harry Smith
Initials          : HXSM
Department        : Team 2
Manager           : Manager 2
IT                : Yes
Supervisor        : No

Similar Questions

I only saw one question that looked like it may cover what I wanted, but it ended up being about needing a "try-catch" loop. Similar Question Link

Please let me know if you need any further information, or if you have any questions.

Thanks in advance for the help.

EDIT

Added in an expected results bit, as I realized this was missing.

Upvotes: 0

Views: 1660

Answers (1)

user2883951
user2883951

Reputation:

The trick is to have something to denote the lines you want to keep.

In your sample above, the link stands out:

  <a href="NewStarterInfo.php?id=3117">

So, if you import the page as a single array, you can parse that array finding only lines that contain "NewStarterInfo.php" for example.

$a = @"
<html>
<head>
<title>New Starter List</title>
<link rel="STYLESHEET" type="text/css" href="/common/StyleSheet/Reports.css" /> <style> TD  {font-family: Verdana; font-size: 8pt; border-left: solid 0px black; border-right: solid 0px black;}    </style>
<script type="text/javascript" src="../../../cgi-bin/calendar/tableH.js"></script>
</head>
<body>
<img src="/common/images/logo.gif" border="0">
<br>
<br>
<b><span style="font-size: 12pt; font-variant: small-caps; ">New Starter List</span></b>
<br>Logged In As &quot;UserName&quot;<br>
<br>
<tableonMouseOver="javascript:trackTableHighlight(this.event,'FFFF66');"onMouseOut="javascript:highlightTableRow(0);" border="4" frame="border" width="80%" rules="none" cellspacing="6%" cellpadding="6%">
<th align="left">Date Started</th>
<th align="left">Name</th>
<th align="left">Initials</th>
<th align="left">Department</th>
<th align="left">Contact</th>
<th align="left">IT Completed?</th>
<th align="left">Supervisor Completed?</th>
<tr colspan="6"><td  align="left">25 Sep 2019</td>
<td  align="left"><a href="NewStarterInfo.php?id=3117">Joe Bloggs</a></td>
<td  align="left">JXBL</td>
<td  align="left">Team A</td>
<td  align="left">Manager 1</td>
<td  align="left">No</td>
<td  align="left">Yes</td></tr>
<tr colspan="6"><td  align="left">08 Jul 2019</td>
<td  align="left"><a href="NewStarterInfo.php?id=3149">Harry Bloggs</a></td>
<td  align="left">HXBL</td>
<td  align="left">Team B</td>
<td  align="left">Manager 2</td>
<td  align="left">No</td>
<td  align="left">Yes</td></tr>
<th align="left" colspan="7">72 starters</th>
</table>
</body>
</html>
"@

With $a set to the content of the page, loop thru it.

foreach($x in $a.split("<"))  # break it at the "<" that starts each line.
{
    if ($x.contains("NewStarterInfo.php") -eq $true) { write-host $x.split(">")[1] }
}

This will take all of the lines in a single variable (not an array) and find the lines with a person's name, and display the name.

If you actually have an array, then you can omit the .split("<") from the foreach statement.

Upvotes: 1

Related Questions