David Ebbo
David Ebbo

Reputation: 43183

How to do a grep-like search on a very long line?

grep is great at finding lines that match a pattern. But what if you have a file with a single extremely long line (say a 100MB file), and you want to find chunks within it that match a pattern?

For each match, you'd want to print the character offset, and the matched string, with extra characters on either side for context.

In Python, you could write something like this (would need boundary checks):

[(m.start(), s[m.start()-50:m.end()+50]) for m in re.finditer(regex, s)]

But is there some way to do the equivalent using standard linux command line tools?

Upvotes: 2

Views: 315

Answers (2)

How to truncate long matching lines returned by grep or ack gives a good one supposing the line fits into memory:

grep -Eo '.{0,15}needle.{0,15}' longlines.txt

If the lines don't necessarily into memory, then have a look at bgrep: https://unix.stackexchange.com/questions/223078/best-way-to-grep-a-big-binary-file/758528#758528

bgrep `printf %s needle | od -t x1 -An -v | tr -d '\n '` myfile.bin

Upvotes: 2

oguz ismail
oguz ismail

Reputation: 50750

For each match, you'd want to print the offset, and the matched string, with extra characters on either side for context.

You can do that with awk like this:

awk '{
  i = 1
  while (match(substr($0, i), /regex/)) {
    off = i + RSTART - 1
    print off, substr($0, off > 50 ? off - 50 : 1, RLENGTH + 100)
    i = off + RLENGTH
  }
}' file

Upvotes: 3

Related Questions