Tester
Tester

Reputation: 11

Multiple FINDSTR command to get the desired results

The idea is to get the urls where it is found with 404 error and the ids above them to indicate the urls belong to them and further to find the filename text and add to the output file.

I have been trying in to loop findSTR to get the find the line from the previously found line number. Can anybody help?

Sample file:

FileName:  LastABC-1563220.xml
-------------------------------
123456786
12348
1234DEF
-------------------------------
http://Product.com/1234DEF
HTTP/1.1 404 Not Found - 0.062000
http://Product.com/1234DEF_1
HTTP/1.1 200 OK - 0.031000
123456785
12349
1234EFG
-------------------------------
http://Product.com/1234EFG
HTTP/1.1 200 OK - 0.031000
123456784
12340
1234FGH
-------------------------------
http://Product.com/1234FGH
HTTP/1.1 200 OK - 0.031000
http://Product.com/1234FGH_1
HTTP/1.1 404 Not Found - 0.079000
http://Product.com/1234FGH_2
HTTP/1.1 404 Not Found - 0.067000
http://Product.com/1234FGH_4
HTTP/1.1 404 Not Found - 0.047000

Desired output:

FileName:  LastABC-1563220.xml
123456786 12348 1234DEF
http://Product.com/1234DEF

123456784 12340 1234FGH
http://Product.com/1234FGH_1
http://Product.com/1234FGH_2
http://Product.com/1234FGH_4

Script I have so far:

del "%FailingURLS%" 2>nul
    set numbers=
        for /F "delims=:" %%a in ('findstr /I /N /C:"404 Not Found" %Formatedfile%') do (
            set /A before=%%a-1
            set "numbers=!numbers!!before!: "
        )
        (for /F "tokens=1* delims=:" %%a in ('findstr /N "^" %Formatedfile% ^| findstr /B "%numbers%"') do echo %%b) > %FailingURLS%

Upvotes: 1

Views: 200

Answers (3)

aschipfl
aschipfl

Reputation: 34909

Here is a script (let us call it extract-failed-urls.bat) that demonstrates a possible way to fulfil your task — with quite some explanatory rem remarks to help you to understand what happens:

@echo off
setlocal EnableExtensions DisableDelayedExpansion

rem // Define constants here:
set "_FILE=%~1"      & rem // (`%~1` represents the first command line argument)
set "_URLP=://"      & rem // (partial string that every listed URL contains)
set "_RESP=HTTP/1.1" & rem // (partial string that every response begins with)
set "_ERRN=404"      & rem // (specific error number in response to recognise)

rem // Determine the total number of lines contained in the given file:
(for /F %%C in ('^< "%_FILE%" find /C /V ""') do set "CNT=%%C") || goto :EOF
rem // Read from the given file:
< "%_FILE%" (
    rem // Clear IDs and URL buffers, and preset flag:
    set "IDS=" & set "URL=" & set "FLAG=#"
    setlocal EnableDelayedExpansion
    rem // Read and write first line of file separately:
    set /A "CNT-=1" & set "LINE=" & set /P LINE="" & < nul set /P ="!LINE!"
    rem // Loop through the remaining lines:
    for /L %%I in (1,1,!CNT!) do (
        rem // Read a line and process only non-empty one:
        set /P LINE="" && (
            rem // Try to split off response prefix:
            set "REST=!LINE:*%_RESP% =!"
            rem // Determine kind of current line:
            if "!LINE:-=!" == "" (
                rem // Line contains only hyphens `-`, so clear URL buffer:
                set "URL="
            ) else if not "!LINE!" == "!LINE:*%_URLP%=!" (
                rem // Line contains an URL, so store to URL buffer, set flag:
                set "URL=!LINE!" & set "FLAG=#"
            ) else if "!LINE!" == "%_RESP% !REST!" (
                rem // Line contains a response, so gather number:
                for /F %%R in ("!REST!") do (
                    rem /* Specific error encountered, hence write IDs, if any,
                    rem    clear IDs buffer, then write stored URL, if any: */
                    if "%%R" == "%_ERRN%" (
                        if defined IDS echo/& echo(!IDS!
                        set "IDS=" & if defined URL echo(!URL!
                    )
                )
                rem // Clear URL buffer and set flag:
                set "URL=" & set "FLAG=#"
            ) else (
                rem /* No other condition fulfilled, hence line contains an ID,
                rem    so put ID into IDs buffer, clear URL buffer and flag: */
                if defined FLAG (set "IDS=!LINE!") else set "IDS=!IDS! !LINE!"
                set "URL=" & set "FLAG="
            )
        )
    )
    endlocal
)

endlocal
exit /B

To run it against an input file named sample.txt use a command line like this:

extract-failed-urls.bat "sample.txt"

To write the output to another file named failed-urls.txt use this:

extract-failed-urls.bat "sample.txt" > "failed-urls.txt"

With the data from the sample input file from the question, the output would be the following:

FileName:  LastABC-1563220.xml
123456786 12348 1234DEF
http://Product.com/1234DEF

123456784 12340 1234FGH
http://Product.com/1234FGH_1
http://Product.com/1234FGH_2
http://Product.com/1234FGH_4

This approach distinguishes between the following different types of input lines, whose recognition trigger certain respective activities:

  1. the first line (the one beginning with FileName:):
    • just output the line unedited (without a trailing line-break);
  2. lines that contain only hyphens (-------------------------------):
    • clear the buffer that holds the (last) URL;
  3. lines that hold an URL, which are those containing ://:
    • store (overwrite) the URL to a buffer;
    • set a flag to clear the buffer for IDs (later);
  4. lines that hold a response, which are those beginning with HTTP/1.1 + SPACE:
    • if the error number is 404:
      • output the content of the buffer for IDs (if any);
      • clear the buffer for IDs;
      • output the content of the buffer for the URL (if any);
    • clear the buffer that holds the (last) URL;
    • set a flag to clear the buffer for the IDs (later);
  5. lines that contain an ID, so all the others:
    • if the flag to clear the buffer for the IDs is set, then, well, clear the buffer;
    • append the ID to the buffer for IDs (SPACE-separated);
    • clear the buffer that holds the (last) URL;
    • reset the flag to clear the buffer for the IDs;

Here is a simpler approach that relies on the fact that an ID block in the input file always contains three lines, then a hyphen-only line follows, and then URL and response pairs occur (if not, an error message appears):

@echo off
setlocal EnableExtensions DisableDelayedExpansion

rem // Define constants here:
set "_FILE=%~1"      & rem // (`%~1` represents the first command line argument)
set "_URLP=://"      & rem // (partial string that every listed URL contains)
set "_RESP=HTTP/1.1" & rem // (partial string that every response begins with)
set "_ERRN=404"      & rem // (specific error number in response to recognise)

rem // Determine the total number of lines contained in the given file:
(for /F %%C in ('^< "%_FILE%" find /C /V ""') do set "CNT=%%C") || goto :EOF

rem // Read from the given file:
< "%_FILE%" (
    rem // Clear IDs buffer and such for previous lines:
    set "IDS=#" & set "PREV1=" & set "PREV2="
    setlocal EnableDelayedExpansion
    rem // Read and write first line of file separately:
    set /A "CNT-=1" & set "LINE=" & set /P LINE="" & < nul set /P ="!LINE!"
    rem // Read and check second line of file separately:
    set /A "CNT-=1" & set "LINE=" & set /P LINE="" & if not "!LINE:-=!" == "" goto :ERROR
    rem // Loop through the remaining lines:
    set /A "CNT/=2" & for /L %%I in (1,1,!CNT!) do (
        rem // Read a line and process only non-empty one:
        set /P LINE1="" && (
            rem // Read another line and process only non-empty one:
            set /P LINE2="" && (
                rem // Determine kind of first line:
                if not "!LINE1!" == "!LINE1:*%_URLP%=!" (
                    rem // First line contains an URL, so next line must be a response;
                    rem    hence try to split off response prefix: */
                    set "REST=!LINE2:*%_RESP% =!"
                    rem // Check second line whether it is really a response:
                    if "!LINE2!" == "%_RESP% !LINE2:*%_RESP% =!" (
                        rem // Line indeed contains a response, so gather number:
                        for /F %%R in ("!REST!") do (
                            rem /* Specific error encountered, hence write IDs, if any,
                            rem    clear IDs buffer, then write URL from first line: */
                            if "%%R" == "%_ERRN%" (
                                if defined IDS echo/& echo(!IDS!
                                set "IDS=" & echo(!LINE1!
                            )
                        )
                    ) else goto :ERROR
                    rem // Clear buffers for previous lines:
                    set "PREV1=" & set "PREV2="
                ) else (
                    rem /* First line does not contain an URL, so it contains an ID,
                    rem    hence check if buffers for previous lines already contain
                    rem    data, which must be IDs, so store them all in IDs buffer,
                    rem    and check if the second line contains only hyphens `-`: */
                    if defined PREV1 if "!LINE2:-=!" == "" (
                        set "IDs=!PREV1! !PREV2! !LINE1!"
                    ) else goto :ERROR
                    rem // Store both lines into buffer for previous lines:
                    set "PREV1=!LINE1!" & set "PREV2=!LINE2!"
                )
            ) || exit /B 0
        ) || exit /B 0
    )
    endlocal
)

endlocal
exit /B

:ERROR
if defined IDS > con echo/
if "!" == "" endlocal
>&2 echo ERROR: expected file format violated!
exit /B 2

The calling convention as well as the output based on your input data are the same as above.

Upvotes: 0

Aacini
Aacini

Reputation: 67216

This is the way I would do it:

@echo off
setlocal EnableDelayedExpansion

del PreviousLines.txt 2>nul
set "ids="
(for /F "delims=" %%a in (test.txt) do (
   set "line=%%a"
   if "!line:~0,9!" equ "FileName:" (
      echo(!line!>> PreviousLines.txt
   ) else if "!line:~0,5!" equ "http:" (
      if defined ids echo(!ids!>> PreviousLines.txt
      set "ids="
      echo(!line!>> PreviousLines.txt
   ) else if "!line:~0,4!" equ "HTTP" (
      rem It is an "OK" or "Not Found" line...
      rem If is "Not Found", show previous lines
      if "!line:Not Found=!" neq "!line!" type PreviousLines.txt
      rem Anyway, reset previous lines
      del PreviousLines.txt 2>nul
      set "ids="
   ) else if "!line:~0,5!" neq "-----" (
      set "ids=!ids!!line! "
   )
)) > FailingURLS.txt

Output:

FileName:  LastABC-1563220.xml
123456786 12348 1234DEF 
http://Product.com/1234DEF
http://Product.com/1234FGH_1
http://Product.com/1234FGH_2
http://Product.com/1234FGH_4

I don't understand why you show the 123456784 12340 1234FGH ids before the http://Product.com/1234FGH_1 because such an ids belongs to http://Product.com/1234FGH that is OK...

Upvotes: 1

Compo
Compo

Reputation: 38613

Your question is too broad as it stands, so the following is an example to show a method of retrieving the '404' URL's from the file, which I assume to be your main issue.

@Echo Off
SetLocal EnableExtensions DisableDelayedExpansion
Set "Src=formattedfile.txt"
Set "Str=404 Not Found"
(Set LF=^
% 0x0A %
)
For /F %%A In ('Copy /Z "%~f0" Nul')Do Set "CR=%%A"
SetLocal EnableDelayedExpansion
FindStr /RC:".*!CR!*!LF!.*%Str%" "%Src%"
EndLocal
Pause

Just modify the value on line 3 to match the name of your formatted text file

Output from your provided file content:

http://Product.com/1234DEF
http://Product.com/1234FGH_1
http://Product.com/1234FGH_2
http://Product.com/1234FGH_4
Press any key to continue . . .

Upvotes: 0

Related Questions