user3702916
user3702916

Reputation: 523

grep full sentences containing a word into a document

I would like to extract a full sentence "." to "." into a document given a word. So for example given this text:

Dijkstra's original algorithm does not use a min-priority queue. For a given source vertex (node) in the graph, the algorithm finds the path with lowest cost (i.e. the shortest path) between that vertex and every other vertex. It can also be used for finding costs of shortest paths from a single vertex to a single destination vertex by stopping the algorithm once the shortest path to the destination vertex has been determined.

I would like to have the entire sentence that contains "graph"

For a given source vertex (node) in the graph, the algorithm finds the path with lowest cost (i.e. the shortest path) between that vertex and every other vertex.

Also it would be useful to find a way to include in the results the starting sentence if it contains graph, because there is no dot before it.

Upvotes: 3

Views: 3306

Answers (4)

Sylvain Leroux
Sylvain Leroux

Reputation: 52040

A crude heuristic:

cat text |
    tr '\n' ' ' |
    sed 's|[[:alpha:]]\{3\}\.[[:blank:]]*|&\'$'\n''|g' |
    grep -Fi 'graph'
  • First, tr remove all end-of-lines in the input file (don't know if this is required for you)
  • Then, sed put each sentence on its own line, assuming a dot preceded by three letters denotes the end of a sentence. Depending your input file, you might need to adjust this part to lower the "false positive" rate
  • Finally, a simple grep will keep only the sentences containing the required word (case insensitive).

Given your input file, this will produce the following result:

For a given source vertex (node) in the graph, the algorithm finds the path with lowest cost (i.e. the shortest path) between that vertex and every other vertex.


This answer has been made POSIX-compliant with the kind help of mklement0
(see comments below)

Upvotes: 1

Tom Fenech
Tom Fenech

Reputation: 74685

Assuming the text file dijk doesn't actually contain any newlines, you could do this in perl:

perl -MLingua::EN::Sentence=get_sentences -ne '
print "$_\n" for grep { /graph/ } @{get_sentences($_)}' dijk

The Lingua::EN::Sentence module is smart enough to deal with well-known abbreviations and you can add your own if necessary.

Output:

For a given source vertex (node) in the graph, the algorithm finds the path with lowest cost (i.e. the shortest path) between that vertex and every other vertex.

If the newlines do actually exist in the input, it should be possible to adapt the script without too much difficulty.


edit

If there are newlines in the input, you could do this instead:

perl -MLingua::EN::Sentence=get_sentences -00 -e '
$t = <>;         # slurp the whole file
$t =~ tr{\n}{ }; # convert newlines to spaces
print "$_\n" for grep { /graph/ } @{get_sentences($t)}' dijk

Of course, by now this is looking a lot more like a full-blown perl script rather than a one-liner!

Alternatively, as mentioned by @mklement0, you could use the external tool tr to perform the translation and pass the result to the original script:

perl -MLingua::EN::Sentence=get_sentences -ne '
print "$_\n" for grep { /graph/ } @{get_sentences($_)}' <(tr '\n' ' ' < dijk)

Upvotes: 4

Noufal Ibrahim
Noufal Ibrahim

Reputation: 72805

Here's one way to do it.

tr '\n' ' ' < input.txt | tr '.' '\n' | grep graph > output.txt

It converts all newlines into spaces (so that the whole text is on a single line). It then converts all .s into newlines so that you one sentence per line. It then greps for the relevant string and puts the matched sentences into the output file.

When run on your paragraph, it sort of works but the . in i.e. confuses it. That can be rectified by changing a few fixed strings like i.e. and e.g. into ie and eg for the process.

Upvotes: 0

Matthew Jacobs
Matthew Jacobs

Reputation: 167

grep -o "\.([^.\r\n]+\.)" inputfile > outputfile

If there are no line breaks in the original file than it's a little simpler:

grep -o "\.([^.]+\.)" inputfile > outputfile

Upvotes: 0

Related Questions