Reputation: 43
My AutoIt script parses text by sentences. Because they most likely end in a period, question mark or exclamation point, I used this to split text by sentence:
$LineArray = StringSplit($displayed_file, "!?.", 2)
The problem; it deletes delimiters (periods, question marks, and exclamation points at the end of sentences). For example, the string One. Two. Three.
is split into One
, Two
, and Three
.
How can I split into sentences while retaining the periods, question marks, and exclamation points that end these sentences?
Upvotes: 0
Views: 305
Reputation: 56180
Using StringSplit()
the delimiters are consumed in the process (and so are lost for the result). Using StringRegExp()
:
#include <array.au3>
$string="This is a text. It has several sentences. Really? Of Course!"
$a = stringregexp($string,"(?U)(.*[.?!])",3)
_ArrayDisplay($a)
To remove leading space(s), change the pattern to "(?U)[ ]*?(.*[.?!])"
. Or to "(?U) *?(.*[.?!] )"
to split at [.!?]
plus <space>
(adding a space to the last sentence):
#include <array.au3>
$string = "Do you know Pi? Yes! What's it? It's 3.14159! That's correct."
$a = StringRegExp($string & " ", "(?U)[ ]*?(.*[.?!] )", 3)
_ArrayDisplay($a)
To preserve @CRLF
(\r\n
) inside sentences:
#include <array.au3>
$string = "Do you " & @CRLF & "know Pi? Yes! What's it? It's" & @CRLF & "3.14159! That's correct."
$a = StringRegExp($string & " ", "(?s)(?U)[ ]*?(.*[.?!][ \R] )", 3)
_ArrayDisplay($a,"Sentences") ;_ArrayDisplay doesn't show @CRLF
For $i In $a
;MsgBox(0,"",$i)
ConsoleWrite(StringStripWS($i, 3) & @CRLF & "---------" & @CRLF)
Next
This does not keep @CRLF
when end of line is same as end of sentence: ...line end!" & @CRLF & "Next line...
.
Upvotes: 0
Reputation: 2151
Try this:
#include<Array.au3>
Global $str = "One. Two. Three. This is a test! Does it work? Yes, man! "
$re = StringRegExp($str, '(.*?[.!?])', 3)
_ArrayDisplay($re)
This pattern works without the space at the beginning of a sentence
#include<Array.au3>
Global $str = "One. Two. Three.This is a test! Does it work? Yes, man! "
$re = StringRegExp($str, '(\S.*?[.!?])', 3)
_ArrayDisplay($re)
Upvotes: 0