Reputation: 25
I have a batch to check the duplicate line in TXT file (over one million line) with 13MB, that will be running over 2hr...how can I speed up that? Thank you!!
TXT file
11
22
33
44
.
.
.
44 (over one million line)
Existing Batch
setlocal
set var1=*
sort original.txt>sort.txt
for /f %%a in ('type sort.txt') do (call :run %%a)
goto :end
:run
if %1==%var1% echo %1>>duplicate.txt
set var1=%1
goto :eof
:end
Upvotes: 1
Views: 89
Reputation: 67216
This method use findstr
command as in aschipfl's answer, but in this case each line and its duplicates are removed from the file after being revised by findstr
. This method could be faster if the number of duplicates in the file is high; otherwise it will be slower because the high volume data manipulated in each turn. Just a test may confirm this point...
@echo off
setlocal EnableDelayedExpansion
del duplicate.txt 2>NUL
copy /Y original.txt input.txt > NUL
:nextTurn
for %%a in (input.txt) do if %%~Za equ 0 goto end
< input.txt (
set /P "line="
findstr /X /C:"!line!"
find /V "!line!" > output.txt
) >> duplicate.txt
move /Y output.txt input.txt > NUL
goto nextTurn
:end
Upvotes: 2
Reputation: 67216
This should be the fastest method using a Batch file:
@echo off
setlocal EnableDelayedExpansion
set var1=*
sort original.txt>sort.txt
(for /f %%a in (sort.txt) do (
if "%%a" == "!var1!" (
echo %%a
) else (
set "var1=%%a"
)
)) >duplicate.txt
Upvotes: 2
Reputation: 34949
Supposing you provide the text file as the first command line argument, you could try the following:
@echo off
for /F "usebackq delims=" %%L in ("%~1") do (
for /F "delims=" %%K in ('
findstr /X /C:"%%L" "%~1" ^| find /C /V ""
') do (
if %%K GTR 1 echo %%L
)
)
This returns all duplicate lines, but multiple times each, namely as often as each occurs in the file.
Upvotes: 0
Reputation: 80113
@echo off
setlocal enabledelayedexpansion
set var1=*
(
for /f %%a in ('sort q42574625.txt') do (
if "%%a"=="!var1!" echo %%a
set "var1=%%a"
)
)>"u:\q42574625_2.txt"
GOTO :EOF
This may be faster - I don't have your file to test against
I used a file named q42574625.txt
containing some dummy data for my testing.
It's not clear whether you want only one instance of a duplicate line or not. Your code would produce 5 "duplicate" lines if there were 6 identical lines in the source file.
Here's a version which will report each duplicated line only once:
@echo off
setlocal enabledelayedexpansion
set var1=*
set var2=*
(
for /f %%a in ('sort q42574625.txt') do (
if "%%a"=="!var1!" IF "!var2!" neq "%%a" echo %%a&SET "var2=%%a"
set "var1=%%a"
)
)>"u:\q42574625.txt"
GOTO :EOF
Upvotes: 0