Reputation: 33
I want to find the string "Time series prediction with ensemble models"
in a pdf file using shell script. I am using pdftotext
$file - | grep $string
where $file
is the pdf file name and $string
is the above string. It can find out the line if the entire string contains in a line but it can't find out line like this
Time series prediction with
ensemble models
How can I resolve it? I am new to linux so explanation in detail is appreciated.
Thanks in advance.
Upvotes: 0
Views: 323
Reputation: 1084
pdftotext may put some spaces between words because of the nature of the pdf format. So to catch all posibilities. It runs as you want.
pdftotext "$file" | grep -ozi "Time\s\+series\s\+prediction\s\+with\s\+ensemble\s\+models"
from man of grep
-o, --only-matching
Print only the matched (non-empty) parts of a matching line,
with each such part on a separate output line.
-z, --null-data
Treat the input as a set of lines, each terminated by a zero
byte (the ASCII NUL character) instead of a newline. Like the
-Z or --null option, this option can be used with commands like
sort -z to process arbitrary file names.
-i, --ignore-case
Ignore case distinctions in both the PATTERN and the input
files. (-i is specified by POSIX.)
Upvotes: 1
Reputation: 786091
You can use -z
option available with gnu-grep
for this:
pdftotext "$file" | grep -z "Time series prediction with.*ensemble models"
As per man grep
:
-z, --null-data
Treat the input as a set of lines, each terminated by a zero byte (the ASCII
NUL character) instead of a newline. Like the -Z or --null option, this option can be
used with commands like sort -z to process arbitrary file names.
Upvotes: 0