Reputation: 523
I would like to extract a full sentence "." to "." into a document given a word. So for example given this text:
Dijkstra's original algorithm does not use a min-priority queue. For a given source vertex (node) in the graph, the algorithm finds the path with lowest cost (i.e. the shortest path) between that vertex and every other vertex. It can also be used for finding costs of shortest paths from a single vertex to a single destination vertex by stopping the algorithm once the shortest path to the destination vertex has been determined.
I would like to have the entire sentence that contains "graph"
For a given source vertex (node) in the graph, the algorithm finds the path with lowest cost (i.e. the shortest path) between that vertex and every other vertex.
Also it would be useful to find a way to include in the results the starting sentence if it contains graph, because there is no dot before it.
Upvotes: 3
Views: 3306
Reputation: 52040
A crude heuristic:
cat text |
tr '\n' ' ' |
sed 's|[[:alpha:]]\{3\}\.[[:blank:]]*|&\'$'\n''|g' |
grep -Fi 'graph'
tr
remove all end-of-lines in the input file (don't know if this is required for you)sed
put each sentence on its own line, assuming a dot preceded by three letters denotes the end of a sentence. Depending your input file, you might need to adjust this part to lower the "false positive" rategrep
will keep only the sentences containing the required word (case insensitive).Given your input file, this will produce the following result:
For a given source vertex (node) in the graph, the algorithm finds the path with lowest cost (i.e. the shortest path) between that vertex and every other vertex.
This answer has been made POSIX-compliant with the kind help of mklement0
(see comments below)
Upvotes: 1
Reputation: 74685
Assuming the text file dijk
doesn't actually contain any newlines, you could do this in perl:
perl -MLingua::EN::Sentence=get_sentences -ne '
print "$_\n" for grep { /graph/ } @{get_sentences($_)}' dijk
The Lingua::EN::Sentence module is smart enough to deal with well-known abbreviations and you can add your own if necessary.
Output:
For a given source vertex (node) in the graph, the algorithm finds the path with lowest cost (i.e. the shortest path) between that vertex and every other vertex.
If the newlines do actually exist in the input, it should be possible to adapt the script without too much difficulty.
If there are newlines in the input, you could do this instead:
perl -MLingua::EN::Sentence=get_sentences -00 -e '
$t = <>; # slurp the whole file
$t =~ tr{\n}{ }; # convert newlines to spaces
print "$_\n" for grep { /graph/ } @{get_sentences($t)}' dijk
Of course, by now this is looking a lot more like a full-blown perl script rather than a one-liner!
Alternatively, as mentioned by @mklement0, you could use the external tool tr
to perform the translation and pass the result to the original script:
perl -MLingua::EN::Sentence=get_sentences -ne '
print "$_\n" for grep { /graph/ } @{get_sentences($_)}' <(tr '\n' ' ' < dijk)
Upvotes: 4
Reputation: 72805
Here's one way to do it.
tr '\n' ' ' < input.txt | tr '.' '\n' | grep graph > output.txt
It converts all newlines into spaces (so that the whole text is on a single line). It then converts all .
s into newlines so that you one sentence per line. It then greps for the relevant string and puts the matched sentences into the output file.
When run on your paragraph, it sort of works but the .
in i.e.
confuses it. That can be rectified by changing a few fixed strings like i.e.
and e.g.
into ie
and eg
for the process.
Upvotes: 0
Reputation: 167
grep -o "\.([^.\r\n]+\.)" inputfile > outputfile
If there are no line breaks in the original file than it's a little simpler:
grep -o "\.([^.]+\.)" inputfile > outputfile
Upvotes: 0