Sriharsha Kalluru
Sriharsha Kalluru

Reputation: 1823

grep -f file to print in order as a file

I have a a requirement to grep patterns from a file but need them in order.

$ cat patt.grep
name1
name2

$ grep -f patt.grep myfile.log
name2:some xxxxxxxxxx
name1:some xxxxxxxxxx

I am getting the output as name2 was found first it was printed then name1 is found it is also printed. But my requirement is to get the name1 first as per the order of patt.grep file.

I am expecting the output as

name1:some xxxxxxxxxx
name2:some xxxxxxxxxx

Upvotes: 11

Views: 6831

Answers (7)

spawn
spawn

Reputation: 390

Here's a python script that wraps grep to do it. Features:

  • Patterns occurring multiple times are printed consecutively
  • The whole line is printed while utilizing grep's --only-matching option
  • Prints a warning, if pattern not found
  • It's reasonable fast (but does not work for regexes), so better use grep -Fw
#!/usr/bin/env python3

# grep -f in order of pattern file.
# If a pattern occurs multiple times in the input, all matches are printed thereunder.

import argparse
import sys
import subprocess
from collections import defaultdict

def eprint(*args, **kwargs):
    print('kgrep.py', *args, file=sys.stderr, **kwargs)


class FileHelper:
    def __init__(self, filepath):
        self.file = open(filepath, "rb", buffering=1024*1024)
        self.line_nb = 0

    # Loop through our file until the specified line number
    def readline(self, line_nb):
        if self.line_nb == line_nb:
            # already got that one
            return None
        assert line_nb > self.line_nb
        line = None
        while self.line_nb < line_nb:
            line = self.file.readline()
            self.line_nb += 1
        if line is None:
            eprint("line_nb", line_nb , "not found")
            exit(1)
        # we use the \n later anyway, so do not line.rstrip()
        return line


def main():
    parser = argparse.ArgumentParser()
    parser.add_argument('--file', '-f' , help="", required=True)
    parser.add_argument('--only-matching', '-o', action='store_true', help="")

    args, unknown_args = parser.parse_known_args()
    input_file = None
    for arg in unknown_args:
        if arg.startswith('-'):
            continue
        if input_file is not None:
            eprint('multiple input files not supported:', input_file, arg)
            exit(1)
        input_file = arg

    if input_file is None:
        eprint('missing input file')
        exit(1)
    grep_args = 'grep -f - -o -n'.split(' ')
    grep_args.extend(unknown_args)

    proc = subprocess.Popen(grep_args, stdin=subprocess.PIPE, stdout=subprocess.PIPE,
        stderr=sys.stderr, bufsize=1024*1024)

    # First pass all needles to grep (but remember them)
    input_ = sys.stdin.buffer if args.file == '-' else open(args.file, "rb")
    needles = []
    while True:
        line = input_.readline()
        if not line:
            break
        proc.stdin.write(line)
        needles.append(line.rstrip())

    proc.stdin.flush()
    proc.stdin.close() # close stdin to signal end of input

    only_m = args.only_matching
    helper_file = FileHelper(input_file)
    matches_dict = defaultdict(list)
    # Read grep's line-number prefixed output and extract the full line
    while True:
        line = proc.stdout.readline()
        if not line:
            break
        line_nb, grep_match = line.split(b':', 1)
        full_line = grep_match if only_m else helper_file.readline(int(line_nb))
        if full_line is not None:
            matches_dict[grep_match.rstrip()].append(full_line)

    for needle in needles:
        line = matches_dict.get(needle)
        if line is None:
            eprint("warning: needle not found:", needle.decode())
            continue

        # we remember that we already printed a match by setting the first el to None
        if line[0] is None:
            continue
        for m in line:
            sys.stdout.buffer.write(m)
        line[0] = None

    exit(proc.wait())


if __name__ == '__main__':
    try:
        main()
    except (BrokenPipeError, KeyboardInterrupt) as e:
        # avoid additional broken pipe error. s. https://stackoverflow.com/a/26738736
        sys.stderr.close()
        exit(e.errno)

Upvotes: 0

Isidor Lipsch
Isidor Lipsch

Reputation: 51

This should do it

awk -F":" 'NR==FNR{a[$1]=$0;next}{ if ($1 in a) {print a[$0]} else {print $1, $1} }' myfile.log patt.grep > z

Upvotes: 1

mklement0
mklement0

Reputation: 438073

This can't be done in grep alone.

For a simple and pragmatic, but inefficient solution, see owlman's answer. It invokes grep once for each pattern in patt.grep.

If that's not an option, consider the following approach:

grep -f patt.grep myfile.log |
 awk -F: 'NR==FNR { l[$1]=$0; next } $1 in l {print l[$1]}' - patt.grep
  • Passes all patterns to grep in a single pass,
  • then sorts them based on the order of patterns in patt.grep using awk:
    • first reads all output lines (passed via stdin, -, i.e., through the pipe) into an assoc. array using the 1st :-based field as the key
    • then loops over the lines of patt.grep and prints the corresponding output line, if any.

Constraints:

  • Assumes that all patterns in patt.grep match the 1st :-based token in the log file, as implied by the sample output data in the question.
  • Assumes that each pattern only matches once - if multiple matches are possible, the awk solution would have to be made more sophisticated.

Upvotes: 0

owlman
owlman

Reputation: 161

You can pipe patt.grep to xargs, which will pass the patterns to grep one at a time.

By default xargs appends arguments at the end of the command. But in this case, grep needs myfile.log to be the last argument. So use the -I{} option to tell xargs to replace {} with the arguments.

cat patt.grep | xargs -Ihello grep hello myfile.log

Upvotes: 6

devnull
devnull

Reputation: 123518

A simple workaround would be to sort the log file before grep:

grep -f patt.grep <(sort -t: myfile.log)

However, this might not yield results in the desired order if patt.grep is not sorted.

In order to preserve the order specified in the pattern file, you might use awk instead:

awk -F: 'NR==FNR{a[$0];next}$1 in a' patt.grep myfile.log

Upvotes: 1

Tajinder
Tajinder

Reputation: 2338

i tried the same situation and easily solved using below command:

I think if your data in the same format as you represent then you can use this.

grep -f patt.grep myfile.log | sort

enter image description here

Upvotes: 1

J. Katzwinkel
J. Katzwinkel

Reputation: 1953

Use the regexes in patt.grep one after another in order of appearance by reading line-wise:

while read ptn; do grep $ptn myfile.log; done < patt.grep

Upvotes: 2

Related Questions