Pete Carter
Pete Carter

Reputation: 2731

Data being inserted row-by-row rather than in batches

I am trying to batch up rows from a large json doc, and insert them into SQL Server.

The following code works, but inserts 1 row at a time into the SQL Server table. I think this is actually every 1000th row.

add-type -path "C:\Program Files\WindowsPowerShell\Modules\newtonsoft.json\1.0.2.201\libs\newtonsoft.json.dll"
install-module sqlserver -AllowClobber

class EliteCoords {
    [double] $x
    [double] $y
    [double] $z
}

class EliteSystem {
    [int]         $id
    [long]        $id64
    [string]      $name
    [EliteCoords] $coords
    [string]      $date 
}

$dt = New-Object system.data.datatable

$dt.Columns.Add("ID",[int])
$dt.Columns.Add("ID64",[long])
$dt.Columns.Add("Name",[string])
$dt.Columns.Add("Coordsx",[decimal])
$dt.Columns.Add("Coordsy",[decimal])
$dt.Columns.Add("Coordsz",[decimal])
$dt.Columns.Add("DiscoveryDate",[string])

$stream = (Get-Item c:\code\systemsWithCoordinates.json).OpenText()
$reader = [Newtonsoft.Json.JsonTextReader]::new($stream)
while ($reader.Read()) {
    $cnt = 0
    $dt.Clear()
    while ($cnt -le 1000) {
        $cnt = $cnt + 1

        if ($reader.TokenType -eq 'StartObject') {

            $row = [Newtonsoft.Json.JsonSerializer]::CreateDefault().Deserialize($reader, [EliteSystem])

            $dr = $dt.NewRow()

            $dr["ID"]            = $row.id
            $dr["ID64"]          = $row.id64
            $dr["Name"]          = $row.name
            $dr["Coordsx"]       = $row.coords.x
            $dr["Coordsy"]       = $row.coords.y
            $dr["Coordsz"]       = $row.coords.z
            $dr["DiscoveryDate"] = $row.date

            $dt.Rows.Add($dr)       
        }
    }

    write-sqltabledata -ServerInstance ELITEDANGEROUS -Database EDSM -Schema Staging -Table SystemsWithCoordinates -InputData $dt
}

$stream.Close()

I know that the problem is because the if block is being skipped, for all but the first iteration of the inner while loop, because the token type is changing from StartObject to StartArray.

I tried putting an additional reader loop inside but of course, that reads the whole file.

I also tried just reading the array rather than object, but of course, that fails because of the nested json.

How should I structure the loops, so that I can batch up and process 1000 rows?

Upvotes: 2

Views: 442

Answers (2)

Pete Carter
Pete Carter

Reputation: 2731

The answer was to replace the inner loop with a break. Then add an outer loop, which continues until the End of Stream marker hits.

add-type -path "C:\Program Files\WindowsPowerShell\Modules\newtonsoft.json\1.0.2.201\libs\newtonsoft.json.dll"
#install-module sqlserver -AllowClobber

class EliteCoords {
    [double] $x
    [double] $y
    [double] $z
}

class EliteSystem {
    [int]         $id
    [long]        $id64
    [string]      $name
    [EliteCoords] $coords
    [string]      $date 
}

$dt = New-Object system.data.datatable

$dt.Columns.Add("ID",[int])
$dt.Columns.Add("ID64",[long])
$dt.Columns.Add("Name",[string])
$dt.Columns.Add("Coordsx",[decimal])
$dt.Columns.Add("Coordsy",[decimal])
$dt.Columns.Add("Coordsz",[decimal])
$dt.Columns.Add("DiscoveryDate",[string])

$stream = (Get-Item c:\code\systemsWithCoordinates.json).OpenText()
$reader = [Newtonsoft.Json.JsonTextReader]::new($stream)
while ($stream.EndOfStream -eq $false) {
$cnt = 0
    $dt.Clear()
    while ($reader.Read()) {

        if ($reader.TokenType -eq 'StartObject') {

                $row = [Newtonsoft.Json.JsonSerializer]::CreateDefault().Deserialize($reader, [EliteSystem])

                $dr = $dt.NewRow()

                $dr["ID"]            = $row.id
                $dr["ID64"]          = $row.id64
                $dr["Name"]          = $row.name
                $dr["Coordsx"]       = $row.coords.x
                $dr["Coordsy"]       = $row.coords.y
                $dr["Coordsz"]       = $row.coords.z
                $dr["DiscoveryDate"] = $row.date

                $dt.Rows.Add($dr)       


        $cnt = $cnt + 1


        }
        if ($cnt -gt 9999) {break}
    }
    write-sqltabledata -ServerInstance ELITEDANGEROUS -Database EDSM -Schema Staging -Table SystemsWithCoordinates -InputData $dt -Timeout 0
}

$stream.Close()

Upvotes: 0

dbc
dbc

Reputation: 116980

Your problem is that you are flushing and clearing the table for every call to $reader.Read() -- i.e., every row.

Instead, you need to accumulate rows until you reach 1000, and then flush:

$stream = (Get-Item c:\code\systemsWithCoordinates.json).OpenText()
$reader = [Newtonsoft.Json.JsonTextReader]::new($stream)
try {
    $serializer = [Newtonsoft.Json.JsonSerializer]::CreateDefault()
    while ($reader.Read()) {
        # If the reader is positioned at the start of an object then accumulate a row.
        if ($reader.TokenType -eq 'StartObject') {                
            $row = serializer.Deserialize($reader, [EliteSystem])

            $dr = $dt.NewRow()

            $dr["ID"]            = $row.id
            $dr["ID64"]          = $row.id64
            $dr["Name"]          = $row.name
            $dr["Coordsx"]       = $row.coords.x
            $dr["Coordsy"]       = $row.coords.y
            $dr["Coordsz"]       = $row.coords.z
            $dr["DiscoveryDate"] = $row.date

            $dt.Rows.Add($dr)       
        }

        # If we have accumulated 1000 rows, flush them.
        if ($dt.Rows.Count -ge 1000) {
            write-sqltabledata -ServerInstance ELITEDANGEROUS -Database EDSM -Schema Staging -Table SystemsWithCoordinates -InputData $dt
            $dt.Clear()
        }
    }

    # Flush any remaining rows.
    if ($dt.Rows.Count -ge 0) {
        write-sqltabledata -ServerInstance ELITEDANGEROUS -Database EDSM -Schema Staging -Table SystemsWithCoordinates -InputData $dt
        $dt.Clear()
    }
}
finally {
    $reader.Close()
    $stream.Close()
}

Notes:

  • You should probably dispose of the StreamReader and JsonTextReader in a finally block in the event of an exception. For a pretty way to do this, see How to implement using statement in powershell?.

  • Allocating the serializer only once should improve performance a little, at no cost.

  • Without an example of the JSON file we can't say whether there are additional problems with the EliteSystem data model used for each row. For instance, if the JSON file is actually a jagged 2d array this might not work.

Upvotes: 1

Related Questions