Reputation: 2731
I am trying to batch up rows from a large json doc, and insert them into SQL Server.
The following code works, but inserts 1 row at a time into the SQL Server table. I think this is actually every 1000th row.
add-type -path "C:\Program Files\WindowsPowerShell\Modules\newtonsoft.json\1.0.2.201\libs\newtonsoft.json.dll"
install-module sqlserver -AllowClobber
class EliteCoords {
[double] $x
[double] $y
[double] $z
}
class EliteSystem {
[int] $id
[long] $id64
[string] $name
[EliteCoords] $coords
[string] $date
}
$dt = New-Object system.data.datatable
$dt.Columns.Add("ID",[int])
$dt.Columns.Add("ID64",[long])
$dt.Columns.Add("Name",[string])
$dt.Columns.Add("Coordsx",[decimal])
$dt.Columns.Add("Coordsy",[decimal])
$dt.Columns.Add("Coordsz",[decimal])
$dt.Columns.Add("DiscoveryDate",[string])
$stream = (Get-Item c:\code\systemsWithCoordinates.json).OpenText()
$reader = [Newtonsoft.Json.JsonTextReader]::new($stream)
while ($reader.Read()) {
$cnt = 0
$dt.Clear()
while ($cnt -le 1000) {
$cnt = $cnt + 1
if ($reader.TokenType -eq 'StartObject') {
$row = [Newtonsoft.Json.JsonSerializer]::CreateDefault().Deserialize($reader, [EliteSystem])
$dr = $dt.NewRow()
$dr["ID"] = $row.id
$dr["ID64"] = $row.id64
$dr["Name"] = $row.name
$dr["Coordsx"] = $row.coords.x
$dr["Coordsy"] = $row.coords.y
$dr["Coordsz"] = $row.coords.z
$dr["DiscoveryDate"] = $row.date
$dt.Rows.Add($dr)
}
}
write-sqltabledata -ServerInstance ELITEDANGEROUS -Database EDSM -Schema Staging -Table SystemsWithCoordinates -InputData $dt
}
$stream.Close()
I know that the problem is because the if block is being skipped, for all but the first iteration of the inner while loop, because the token type is changing from StartObject to StartArray.
I tried putting an additional reader loop inside but of course, that reads the whole file.
I also tried just reading the array rather than object, but of course, that fails because of the nested json.
How should I structure the loops, so that I can batch up and process 1000 rows?
Upvotes: 2
Views: 442
Reputation: 2731
The answer was to replace the inner loop with a break. Then add an outer loop, which continues until the End of Stream marker hits.
add-type -path "C:\Program Files\WindowsPowerShell\Modules\newtonsoft.json\1.0.2.201\libs\newtonsoft.json.dll"
#install-module sqlserver -AllowClobber
class EliteCoords {
[double] $x
[double] $y
[double] $z
}
class EliteSystem {
[int] $id
[long] $id64
[string] $name
[EliteCoords] $coords
[string] $date
}
$dt = New-Object system.data.datatable
$dt.Columns.Add("ID",[int])
$dt.Columns.Add("ID64",[long])
$dt.Columns.Add("Name",[string])
$dt.Columns.Add("Coordsx",[decimal])
$dt.Columns.Add("Coordsy",[decimal])
$dt.Columns.Add("Coordsz",[decimal])
$dt.Columns.Add("DiscoveryDate",[string])
$stream = (Get-Item c:\code\systemsWithCoordinates.json).OpenText()
$reader = [Newtonsoft.Json.JsonTextReader]::new($stream)
while ($stream.EndOfStream -eq $false) {
$cnt = 0
$dt.Clear()
while ($reader.Read()) {
if ($reader.TokenType -eq 'StartObject') {
$row = [Newtonsoft.Json.JsonSerializer]::CreateDefault().Deserialize($reader, [EliteSystem])
$dr = $dt.NewRow()
$dr["ID"] = $row.id
$dr["ID64"] = $row.id64
$dr["Name"] = $row.name
$dr["Coordsx"] = $row.coords.x
$dr["Coordsy"] = $row.coords.y
$dr["Coordsz"] = $row.coords.z
$dr["DiscoveryDate"] = $row.date
$dt.Rows.Add($dr)
$cnt = $cnt + 1
}
if ($cnt -gt 9999) {break}
}
write-sqltabledata -ServerInstance ELITEDANGEROUS -Database EDSM -Schema Staging -Table SystemsWithCoordinates -InputData $dt -Timeout 0
}
$stream.Close()
Upvotes: 0
Reputation: 116980
Your problem is that you are flushing and clearing the table for every call to $reader.Read()
-- i.e., every row.
Instead, you need to accumulate rows until you reach 1000, and then flush:
$stream = (Get-Item c:\code\systemsWithCoordinates.json).OpenText()
$reader = [Newtonsoft.Json.JsonTextReader]::new($stream)
try {
$serializer = [Newtonsoft.Json.JsonSerializer]::CreateDefault()
while ($reader.Read()) {
# If the reader is positioned at the start of an object then accumulate a row.
if ($reader.TokenType -eq 'StartObject') {
$row = serializer.Deserialize($reader, [EliteSystem])
$dr = $dt.NewRow()
$dr["ID"] = $row.id
$dr["ID64"] = $row.id64
$dr["Name"] = $row.name
$dr["Coordsx"] = $row.coords.x
$dr["Coordsy"] = $row.coords.y
$dr["Coordsz"] = $row.coords.z
$dr["DiscoveryDate"] = $row.date
$dt.Rows.Add($dr)
}
# If we have accumulated 1000 rows, flush them.
if ($dt.Rows.Count -ge 1000) {
write-sqltabledata -ServerInstance ELITEDANGEROUS -Database EDSM -Schema Staging -Table SystemsWithCoordinates -InputData $dt
$dt.Clear()
}
}
# Flush any remaining rows.
if ($dt.Rows.Count -ge 0) {
write-sqltabledata -ServerInstance ELITEDANGEROUS -Database EDSM -Schema Staging -Table SystemsWithCoordinates -InputData $dt
$dt.Clear()
}
}
finally {
$reader.Close()
$stream.Close()
}
Notes:
You should probably dispose of the StreamReader
and JsonTextReader
in a finally
block in the event of an exception. For a pretty way to do this, see How to implement using statement in powershell?.
Allocating the serializer only once should improve performance a little, at no cost.
Without an example of the JSON file we can't say whether there are additional problems with the EliteSystem
data model used for each row. For instance, if the JSON file is actually a jagged 2d array this might not work.
Upvotes: 1