Reputation: 25

Windows Batch FOR Loop improvement

I have a batch to check the duplicate line in TXT file (over one million line) with 13MB, that will be running over 2hr...how can I speed up that? Thank you!!

TXT file

11
22
33
44
.
.
.
44 (over one million line)

Existing Batch

setlocal
set var1=*
sort original.txt>sort.txt
for /f %%a in ('type sort.txt') do (call :run %%a)
goto :end
:run
if %1==%var1% echo %1>>duplicate.txt
set var1=%1
goto :eof
:end

Upvotes: 1

Answers (4)

Aacini

Reputation: 67216

This method use findstr command as in aschipfl's answer, but in this case each line and its duplicates are removed from the file after being revised by findstr. This method could be faster if the number of duplicates in the file is high; otherwise it will be slower because the high volume data manipulated in each turn. Just a test may confirm this point...

@echo off
setlocal EnableDelayedExpansion

del duplicate.txt 2>NUL
copy /Y original.txt input.txt > NUL

:nextTurn
for %%a in (input.txt) do if %%~Za equ 0 goto end

< input.txt (
   set /P "line="
   findstr /X /C:"!line!"
   find /V "!line!" > output.txt
) >> duplicate.txt

move /Y output.txt input.txt > NUL
goto nextTurn

:end

Upvotes: 2

Aacini

Reputation: 67216

This should be the fastest method using a Batch file:

@echo off
setlocal EnableDelayedExpansion

set var1=*
sort original.txt>sort.txt
(for /f %%a in (sort.txt) do (
   if "%%a" == "!var1!" (
      echo %%a
   ) else (
      set "var1=%%a"
   )
)) >duplicate.txt

Upvotes: 2

aschipfl

Reputation: 34949

Supposing you provide the text file as the first command line argument, you could try the following:

@echo off
for /F "usebackq delims=" %%L in ("%~1") do (
    for /F "delims=" %%K in ('
        findstr /X /C:"%%L" "%~1" ^| find /C /V ""
    ') do (
        if %%K GTR 1 echo %%L
    )
)

This returns all duplicate lines, but multiple times each, namely as often as each occurs in the file.

Upvotes: 0

Magoo

Reputation: 80113

@echo off
setlocal enabledelayedexpansion
set var1=*
(
for /f %%a in ('sort q42574625.txt') do (
 if "%%a"=="!var1!" echo %%a
 set "var1=%%a"
)
)>"u:\q42574625_2.txt"

GOTO :EOF

This may be faster - I don't have your file to test against

I used a file named q42574625.txt containing some dummy data for my testing.

It's not clear whether you want only one instance of a duplicate line or not. Your code would produce 5 "duplicate" lines if there were 6 identical lines in the source file.

Here's a version which will report each duplicated line only once:

@echo off
setlocal enabledelayedexpansion
set var1=*
set var2=*
(
for /f %%a in ('sort q42574625.txt') do (
 if "%%a"=="!var1!" IF "!var2!" neq "%%a" echo %%a&SET "var2=%%a"
 set "var1=%%a"
)
)>"u:\q42574625.txt"

GOTO :EOF

Upvotes: 0

Windows Batch FOR Loop improvement

Answers (4)

Related Questions