Reputation: 1353
I have a extremely large text file of size 250 GB that's given to us by a vendor. They also give us a control file that is supposed to have the number of lines in the large file. Sometimes there is a mismatch. How do I count lines in Powershell? I tried this command and it ran for more than half hour and was not done yet.
Get-content C:\test.txt | Measure-Object –Line
(gc C:\test.txt | Measure-object | select count).count
Any help is appreciated Thanks MR
Upvotes: 15
Views: 10235
Reputation: 9497
I have a 1.39 GB csv file:
Mode LastWriteTime Length Name
---- ------------- ------ ----
-a---- 10/4/2021 1:23 PM 1397998768 XBTUSD.csv
wc -l
in WSL. Accessing file via /mnt/c/Users
.
Slower because file is on the NTFS side.
$ time wc -l XBTUSD.csv
41695261 XBTUSD.csv
real 0m10.935s
user 0m0.951s
sys 0m1.427s
wc -l
in WSL. File is in /tmp
.
time wc -l /tmp/XBTUSD.csv
41695261 /tmp/XBTUSD.csv
real 0m0.447s
user 0m0.258s
sys 0m0.189s
Measure-Command { Get-Content .\XBTUSD.csv | Measure-Object -Line }
Days : 0
Hours : 0
Minutes : 7
Seconds : 52
Milliseconds : 353
Ticks : 4723537381
TotalDays : 0.00546705715393518
TotalHours : 0.131209371694444
TotalMinutes : 7.87256230166667
TotalSeconds : 472.3537381
TotalMilliseconds : 472353.7381
ReadLines
and Count
Measure-Command {
[System.Linq.Enumerable]::Count(
[System.IO.File]::ReadLines((ls .\XBTUSD.csv).FullName))
}
Days : 0
Hours : 0
Minutes : 0
Seconds : 7
Milliseconds : 263
Ticks : 72636842
TotalDays : 8.40704189814815E-05
TotalHours : 0.00201769005555556
TotalMinutes : 0.121061403333333
TotalSeconds : 7.2636842
TotalMilliseconds : 7263.6842
Measure-Command {
$count = 0; switch -File .\XBTUSD.csv { default { ++$count } }; $count
}
Days : 0
Hours : 0
Minutes : 0
Seconds : 44
Milliseconds : 975
Ticks : 449752555
TotalDays : 0.000520546938657407
TotalHours : 0.0124931265277778
TotalMinutes : 0.749587591666667
TotalSeconds : 44.9752555
TotalMilliseconds : 44975.2555
Fastest is wc -l
in WSL.
Fastest in PowerShell is the LINQ methods
Upvotes: 13
Reputation: 437062
If performance matters, avoid the use of cmdlets and the pipeline; use switch -File
:
$count = 0
switch -File C:\test.txt { default { ++$count } }
switch -File
enumerates the lines of the specified file; condition default
matches any line.
To give a sense of the performance difference:
# Create a sample file with 100,000 lines.
1..1e5 > tmp.txt
# Warm up the file cache
foreach ($line in [IO.File]::ReadLines("$pwd/tmp.txt")) { }
(Measure-Command { (Get-Content tmp.txt | Measure-Object).Count }).TotalSeconds
(Measure-Command { $count = 0; switch -File tmp.txt { default { ++$count } } }).TotalSeconds
Sample results from my Windows 10 / PSv5.1 machine:
1.3081307 # Get-Content + Measure-Object
0.1097513 # switch -File
That is, on my machine the switch -File
command was about 12 times faster.
Upvotes: 20
Reputation: 2355
For such a huge file I'd rather go with some C written utility. Install gitbash, it should have wc command:
wc -l yourfile.txt
I tested it on 5GB/50M line file (on HDD), it took about 40s. The best powershell solution took about 2 minutes. You also may check your file, it might have some auto incremental indexes or constant row size.
Upvotes: 6