user2726975
user2726975

Reputation: 1353

Powershell Count lines extremely large file

I have a extremely large text file of size 250 GB that's given to us by a vendor. They also give us a control file that is supposed to have the number of lines in the large file. Sometimes there is a mismatch. How do I count lines in Powershell? I tried this command and it ran for more than half hour and was not done yet.

Get-content C:\test.txt | Measure-Object –Line

(gc C:\test.txt | Measure-object | select count).count

Any help is appreciated Thanks MR

Upvotes: 15

Views: 10235

Answers (3)

dharmatech
dharmatech

Reputation: 9497

The File

I have a 1.39 GB csv file:

Mode                 LastWriteTime         Length Name
----                 -------------         ------ ----
-a----         10/4/2021   1:23 PM     1397998768 XBTUSD.csv

WSL - through NTFS

wc -l in WSL. Accessing file via /mnt/c/Users.

Slower because file is on the NTFS side.

$ time wc -l XBTUSD.csv
41695261 XBTUSD.csv

real    0m10.935s
user    0m0.951s
sys     0m1.427s

WSL - file on Linux side.

wc -l in WSL. File is in /tmp.

time wc -l /tmp/XBTUSD.csv
41695261 /tmp/XBTUSD.csv

real    0m0.447s
user    0m0.258s
sys     0m0.189s

Standard PowerShell approach

Measure-Command { Get-Content .\XBTUSD.csv | Measure-Object -Line }

Days              : 0
Hours             : 0
Minutes           : 7
Seconds           : 52
Milliseconds      : 353
Ticks             : 4723537381
TotalDays         : 0.00546705715393518
TotalHours        : 0.131209371694444
TotalMinutes      : 7.87256230166667
TotalSeconds      : 472.3537381
TotalMilliseconds : 472353.7381

Using LINQ methods in PowerShell: ReadLines and Count

Measure-Command {
    [System.Linq.Enumerable]::Count(
        [System.IO.File]::ReadLines((ls .\XBTUSD.csv).FullName)) 
}
Days              : 0
Hours             : 0
Minutes           : 0
Seconds           : 7
Milliseconds      : 263
Ticks             : 72636842
TotalDays         : 8.40704189814815E-05
TotalHours        : 0.00201769005555556
TotalMinutes      : 0.121061403333333
TotalSeconds      : 7.2636842
TotalMilliseconds : 7263.6842

Switch in PowerShell

Measure-Command { 
    $count = 0; switch -File .\XBTUSD.csv { default { ++$count } }; $count 
}
Days              : 0
Hours             : 0
Minutes           : 0
Seconds           : 44
Milliseconds      : 975
Ticks             : 449752555
TotalDays         : 0.000520546938657407
TotalHours        : 0.0124931265277778
TotalMinutes      : 0.749587591666667
TotalSeconds      : 44.9752555
TotalMilliseconds : 44975.2555

Summary

  • Fastest is wc -l in WSL.

  • Fastest in PowerShell is the LINQ methods

Upvotes: 13

mklement0
mklement0

Reputation: 437062

If performance matters, avoid the use of cmdlets and the pipeline; use switch -File:

$count = 0
switch -File C:\test.txt { default { ++$count } }

switch -File enumerates the lines of the specified file; condition default matches any line.


To give a sense of the performance difference:

# Create a sample file with 100,000 lines.
1..1e5 > tmp.txt
# Warm up the file cache
foreach ($line in [IO.File]::ReadLines("$pwd/tmp.txt")) { }

(Measure-Command { (Get-Content tmp.txt | Measure-Object).Count }).TotalSeconds

(Measure-Command { $count = 0; switch -File tmp.txt { default { ++$count } } }).TotalSeconds

Sample results from my Windows 10 / PSv5.1 machine:

1.3081307  # Get-Content + Measure-Object
0.1097513  # switch -File

That is, on my machine the switch -File command was about 12 times faster.

Upvotes: 20

Mike Twc
Mike Twc

Reputation: 2355

For such a huge file I'd rather go with some C written utility. Install gitbash, it should have wc command:

wc -l yourfile.txt

I tested it on 5GB/50M line file (on HDD), it took about 40s. The best powershell solution took about 2 minutes. You also may check your file, it might have some auto incremental indexes or constant row size.

Upvotes: 6

Related Questions