Michael Heneghan
Michael Heneghan

Reputation: 307

Using awk how do i pull out matched strings and other data in one command

I'm trying to parse over a number of files in a path for a certain string patterns (for e.g. new File()), which can occur over multiple lines in that file.

The information I'm trying to return is;

1 Filename/path
2 Number of occurrences of string pattern in the file
3 Code found i.e new File()
4 Line number code found on

Here is an example file contents of test.txt;

    new
    File()
    new File()
    new
    
    
    
    File()
    Fil[![1]][1]e() new
    new File() test new File()

Here is a picture of the file in notepad++ test.txt

A more realistic real world example would be (made up code, not compilable);

package gw.plugin.document.impl


@Export
abstract class BaseLocalDocumentContentSource implements 
InitializablePlugin
{

  private static var DOCUMENTS_PATH = "documents.path"

  public property get DemoDocumentsURL() : URL {
    return new URL("file", "", DemoDocumentsPath)
  }


  construct() {
  }

  protected function buildDocumentsPath(documentRootDir : String, 
  documentTmpDir : String) {
    if (DocumentsPathParameter.HasContent) {
      DemoDocumentsPath = getAbsolutePath(DocumentsPathParameter, 
documentRootDir)
      if (!new test 
      File(DemoDocumentsPath).equals(new File(DocumentsPathParameter))) {
          Logger.DOCUMENT.warn((typeof this).RelativeName + " has a 
relative path specified for its documents.path parameter, so it will store 
documents in the app container's temporary directory. For production use, 
the configuration should be changed to a full directory path, not a 
relative path")
          DocumentsPath = getAbsolutePath(DocumentsPathParameter, documentTmpDir)
      var file = new File(DocumentsPath)
      if (!file.exists() && file.isDirectory()) {
          file.mkdirs()
      }
  } else {
      DocumentsPath = DemoDocumentsPath
  }
}
Logger.DOCUMENT.info("Documents path: " + DocumentsPath)
  }

  protected function updateDocument(strDocUID : String, isDocument : InputStream) {
try {
    var file = getDocumentFile(strDocUID)
    if (!FileUtil.isFile( file ) || file.isReservedFileName()) {
        throw new IllegalArgumentException("Document ${strDocUID} does not exist!")
    }
    var backupFile = new File(file.getPath() + ".bak")
    if (not file.renameTo(backupFile) ) { // renamed physical file, 'file' still has previous name
      throw new RuntimeException("Failed to rename file to ${backupFile}")
        }
    copyToFile(isDocument, file)
    try {
      backupFile.delete()
    }
    catch (e : Throwable) {
      Logger.DOCUMENT.warn("DocMgmt failed to delete '${backupFile}'")
    }
} catch (e : Exception) {
    throw new RuntimeException("Exception encountered trying to update document with doc UID: ${strDocUID}", e)
}
  }

protected function getDocumentFile(relativePath : String, checkDemoFolder : boolean) : File {
var file = new File(getDocumentsDir(), relativePath)
if (!file.exists() && checkDemoFolder) {
    file = new File(getDemoDocumentsDir(), relativePath)
}
return file
}

  protected function makeSubDirPath(diw : IDocumentInfoWrapper) : String {
  var subDirPath = diw.getSubDirForDocument()
  var dirDoc = new File(getDocumentsDir() + subDirPath)
  if (not dirDoc.Directory) {
      dirDoc.mkdirs()
  }
  return subDirPath
  }


 private static function getAbsolutePath(path : String, rootPath : String) : String {
    var retVal = path
    if (path.startsWith("\\") || path.startsWith("/") || (path.length() > 1 && path.charAt(1) == ":" as char)) {
    retVal = path
    } else {
    retVal = rootPath + File.separator + path
    }
    try {
    retVal = (new File(retVal)).getCanonicalPath()
    } catch (e : IOException) {
    throw new RuntimeException("Could not get absolute path from relative path: ${path}", e)
    }
    return retVal.replaceAll("\\\\","/")
  }

}

I have looked at grep, pcregrep, sed and awk. The folder I'm searching is very large so I'm trying to return all data required in one command instead of running four commands and having to traverse the folder more than once.

I've found awk the most applicable but have very limited experience in all of the programs I mentioned and I'm I don't have authorisation to install pcregrep in the env so can't use that.

Here's my attempt for awk so far, it is wrong and probably poorly done so be gentle :)

    awk '{
       if(/new[[:space:]]*/) {
         line1=NR;
         code1=$0;
       } if(/File\(\)/) { 
         count[$0]++; 
         line2=NR; 
         if(line1 != line2) {
           code2=$0;
           printf "Found on lines %d, %d, code = %s %s \nNumber of occurrences = %d", line1, line2, code1, code2, count[$0]
         } else { 
           printf "Found on line %d, code = %s \nNumber of occurrences = %d", line1, code1, count[$0]
         } 
       }
    }' test.txt 

I know that my count of occurences is incorrect as I'm counting the occurrences per match as opposed to the total in the file. I'm getting some weird output such as the below;

     File()n lines 1, 2, code = new
     Number of occurrences = 1
     ound on line 3, code = new File()
    Number of occurrences = 1
     File()n lines 4, 8, code = new
     Number of occurrences = 2
     ound on line 9, code = File() new
    Number of occurrences = 1

Where code2 is overwriting the first few words of the print statement and not printed where I'd expect.

Expected output would be something like;

    test.txt (Filename) 
    5 (number of occurrences of new File() pattern) 
    new File() Found on lines 1 & 2 
    new File() Found on line 3 
    new File() Found on lines 4 & 9 
    new File() Found on line 10 
    new File() Found on line 10 

Or something similar to this

Output of cat -vte test.txt is;

    new^M$
    File()^M$
    new File()^M$
    new ^M$
    ^M$
    ^M$
    ^M$
    File()^M$
    File() new^M$
    new File() test new File()

Any help would be appreciated.

Upvotes: 1

Views: 175

Answers (2)

RavinderSingh13
RavinderSingh13

Reputation: 133770

Taking code from @anubhava sir's post and editing it to work for multiple files edited and tested in GNU awk. It needs GNU awk since I am using ENDFILE option of it here, which will be executed at the completion of each file.

Added -v IGNORECASE="1" option since OP confirmed in comments that ignorecase is needed to match values.

awk -v IGNORECASE="1" -v msg='new File() Found on line ' '
BEGIN {print ARGV[1], "(Filename)"}
FNR==1 { p=n=0 }
{
   while(match($0, /new[[:blank:]]+File\(\)/)) {
      print msg FNR
      ++n
      $0 = substr($0, RSTART+RLENGTH)
   }
}
/new[[:blank:]]*$/ {
   p = FNR
   next
}
p && NF {
   if (/^[[:blank:]]*File\(\)/) {
      print msg p, "&", FNR
      ++n
   }
   p = 0
}
ENDFILE {
   print n, "(number of occurrences of new File() pattern)"
}' *.txt

Upvotes: 1

anubhava
anubhava

Reputation: 786291

You may use this awk:

awk -v msg='new File() Found on line ' 'BEGIN {print ARGV[1], "(Filename)"} {while(match($0, /new[[:blank:]]+File\(\)/)) {print msg NR; ++n; $0 = substr($0, RSTART+RLENGTH)}} /new[[:blank:]]*$/ {p = NR; next} p && NF {if (/^[[:blank:]]*File\(\)/) {print msg p, "&", NR; ++n} p = 0} END {print n, "(number of occurrences of new File() pattern)"}' test.txt

test.txt (Filename)
new File() Found on line 1 & 2
new File() Found on line 3
new File() Found on line 4 & 8
new File() Found on line 10
new File() Found on line 10
5 (number of occurrences of new File() pattern)

A more readable form:

awk -v msg='new File() Found on line ' '
BEGIN {print ARGV[1], "(Filename)"}
{
   while(match($0, /new[[:blank:]]+File\(\)/)) {
      print msg NR
      ++n
      $0 = substr($0, RSTART+RLENGTH)
   }
}
/new[[:blank:]]*$/ {
   p = NR
   next
}
p && NF {
   if (/^[[:blank:]]*File\(\)/) {
      print msg p, "&", NR
      ++n
   }
   p = 0
}
END {
   print n, "(number of occurrences of new File() pattern)"
}' test.txt

Upvotes: 2

Related Questions