Igor Kuznetsov
Igor Kuznetsov

Reputation: 415

fast line search in big array - powershell

I have some array (export from csv file) with ~ 100k lines . File size is ~ 22mb on disk All i need is to find line with some data, process it and load to mssql (i need to sync csv data with mssql)

Problem is that search tooks almost 1 second (~ TotalMilliseconds : 655,0788)!

$csv.Where({$_.'Device UUID' -eq 'lalala'})

Any way just to speed it up?

Upvotes: 1

Views: 3040

Answers (3)

js2010
js2010

Reputation: 27606

Playing around with sqlite shell.

'Device UUID' | set-content file.csv
1..2200kb | % { get-random } | add-content file.csv # 1.44 sec, 25mb 

'.mode csv
.import file.csv file' | sqlite3 file  # 2.92 sec, 81mb


# last row
'select * from file where "device uuid" = 2143292650;' | sqlite3 file
# 'select * from file where "device uuid" > 2143292649 and "device uuid" < 2143292651;' | sqlite3 file

2143292650


(history)[-1] | % { $_.endexecutiontime - $_.startexecutiontime }

Days              : 0
Hours             : 0
Minutes           : 0
Seconds           : 0
Milliseconds      : 570
Ticks             : 5706795
TotalDays         : 6.60508680555556E-06
TotalHours        : 0.000158522083333333
TotalMinutes      : 0.009511325
TotalSeconds      : 0.5706795
TotalMilliseconds : 570.6795

# 34 ms after this:
# 'create index deviceindex on file("device uuid");' | sqlite3 file
# with ".timer on", it's 1ms, after the table is loaded

Upvotes: 0

mklement0
mklement0

Reputation: 440337

If you only need one or a few lookups, you can consider the following alternative to Mathias R. Jessen's helpful answer. Note that, like Mathias' solution, it requires reading all rows into memory at once:

# Load all rows into memory.
$allRows = Import-Csv file.csv

# Get the *index* of the row with the column value of interest.
# Note: This lookup is case-SENSITIVE.
$rowIndex = $allRows.'Device UUID'.IndexOf('lalala')

# Retrieve the row of interest by index, if found.
($rowOfInterest = if ($rowIndex -ne -1) { $allRows[$rowIndex] })

Once the rows are loaded into memory (as [pscustomobject] instances, which itself won't be fast), the array lookup - via member-access enumeration - is reasonably fast, thanks to .NET performing the (linear) array search, using the System.Array.IndexOf() method.

The problem with your .Where({ ... }) approach is that iteratively calling a PowerShell script block ({ ... }) many times is computationally expensive.


It comes down to the following trade-off:

  • Either: Spend more time up front to build up a data structure ([hashtable]) that allows efficient lookup (Mathias' answer)

  • Or: Read the file more quickly, but spend more time on each lookup (this answer).

Upvotes: 1

Mathias R. Jessen
Mathias R. Jessen

Reputation: 174990

Load all 100K rows into a hashtable, using the Device UUID property as the key - this will make it much faster to locate a row than by iterating the whole array with .Where({...}):

$deviceTable = @{}
Import-Csv .\path\to\device_list.csv |ForEach-Object {
  $deviceTable[$_.'Device UUID'] = $_
}

This will now take significantly less than 1 second:

$matchingDevice = $deviceTable['lalala']

Upvotes: 5

Related Questions