Mousumi
Mousumi

Reputation: 33

How to find a multiple line string in a file using shell script?

I want to find the string "Time series prediction with ensemble models" in a pdf file using shell script. I am using pdftotext

$file - | grep $string

where $file is the pdf file name and $string is the above string. It can find out the line if the entire string contains in a line but it can't find out line like this

Time series prediction with
ensemble models

How can I resolve it? I am new to linux so explanation in detail is appreciated.
Thanks in advance.

Upvotes: 0

Views: 323

Answers (2)

Ozan
Ozan

Reputation: 1084

pdftotext may put some spaces between words because of the nature of the pdf format. So to catch all posibilities. It runs as you want.

pdftotext "$file" | grep -ozi "Time\s\+series\s\+prediction\s\+with\s\+ensemble\s\+models" 

from man of grep

-o, --only-matching
          Print only the matched (non-empty) parts  of  a  matching  line,
          with each such part on a separate output line.

-z, --null-data
          Treat  the  input  as  a set of lines, each terminated by a zero
          byte (the ASCII NUL character) instead of a newline.   Like  the
          -Z  or --null option, this option can be used with commands like
          sort -z to process arbitrary file names.

-i, --ignore-case
          Ignore  case  distinctions  in  both  the  PATTERN and the input
          files.  (-i is specified by POSIX.)

Upvotes: 1

anubhava
anubhava

Reputation: 786091

You can use -z option available with gnu-grep for this:

pdftotext "$file" | grep -z "Time series prediction with.*ensemble models"

As per man grep:

-z, --null-data
     Treat  the  input  as  a  set  of  lines,  each terminated by a zero byte (the ASCII
     NUL character) instead of a newline. Like the -Z or --null option, this option can be
     used with commands like sort -z to process  arbitrary file names.

Upvotes: 0

Related Questions