Reputation: 15
Right now I'm trying to find the longest sentence in text and print out number of characters including spaces and things like that. The problem is when I encounter characters like 'š' or 'á' it counts them twice. I tried to subtract one in those cases, but that doesn't seem to work either, because it subtracts them twice too. Any idea how I could fix that? Here is the code for the counter.
for i:=1 to length(text) do
case text[i] of
'.','!','?': begin
if len>p2 then p2:=len;
len:=0
end;
else inc(len);
end;
p2 is a counter for longest sentence and len is current sentence.
Upvotes: 1
Views: 220
Reputation: 125689
This works for me with ANSI characters, including those with diacritics. As you've not mentioned any specific character set, and your question is simply tagged as pascal, it should work for you as well. If you're dealing with other character sets, then you need to indicate which specific Pascal compiler you're using, as support for multi-byte characters differs between various Pascal dialects.
function LongestSentenceCharCount(const Text: string): Integer;
var
Len: Integer;
LongLen: Integer;
i, CurrLen: Integer;
begin
Len := Length(Text);
CurrLen := 0;
LongLen := 0;
for I := 1 to Len do
begin
if Text[i] in ['.', '!', '?'] then
begin
if CurrLen > LongLen then
LongLen := CurrLen;
CurrLen := 0;
end
else
Inc(CurrLen);
end;
Result := LongLen;
end;
To deal with multi-byte character sets such as UTF-8 and Unicode -
Based on some code donated to Cary Jensen for his white paper (PDF) Delphi Unicode Migration for Mere Mortals: Stories and Advice from the Front Lines by Seppy Bloom (at the time Team Leader for RTL/VCL at Embarcadero), you can use some of the normalization functionality available in Windows since Vista and later. I've adapted my function above to use that code from Seppy (included below), along with a sample app to demonstrate using it. The code was developed, compiled and tested in Delphi 10.1 Berlin, so if you're using a different compiler you'll have to adjust it, and clearly it won't work if you're not running under Windows Vista or higher.
program Project1;
{$APPTYPE CONSOLE}
uses
System.SysUtils, WinAPI.Windows;
const
NormalizationOther = 0;
NormalizationC = 1;
NormalizationD = 2;
NormalizationKC = 5;
NormalizationKD = 6;
function IsNormalizedString(NormForm: Integer; lpString: LPCWSTR;
cwLength: Integer): BOOL; stdcall; external 'Normaliz.dll';
function NormalizeString(NormForm: Integer; lpSrcString: LPCWSTR;
cwSrcLength: Integer; lpDstString: LPWSTR;
cwDstLength: Integer): Integer; stdcall; external 'Normaliz.dll';
function NormalizedStringLength(const Str: string): Integer;
var
Buf: string;
begin
if not IsNormalizedString(NormalizationC, PChar(Str), -1) then
begin
SetLength(Buf, NormalizeString(NormalizationC, PChar(Str),
Length(Str), nil, 0));
Result := NormalizeString(NormalizationC, PChar(Str),
Length(Str), PChar(Buf), Length(Buf));
end
else
Result := Length(Str);
end;
function LongestSentenceLen(const Text: string): Integer;
var
Len: Integer;
i, CurrLen: Integer;
begin
Len := Length(Text);
CurrLen := 0;
Result := 0;
for i := 1 to Len do
begin
// Replaces 'if Text[i] in ['.', '!', '?']', which will work
// but generates a compiler warning.
if CharInSet(Text[i], ['.', '!', '?']) then
begin
if CurrLen > Result then
Result := CurrLen;
CurrLen := 0;
end
else
Inc(CurrLen, NormalizedStringLength(Text[i]));
end;
end;
var
Test: string;
begin
Test := 'Ahoj, jak se máš? Hello World.';
WriteLn(Test);
WriteLn(Format('Longest: %d', [LongestSentenceLen(Test)]));
ReadLn;
end.
The output of the above is
Ahoj, jak se más? Hello World.
Longest: 16
Upvotes: 2
Reputation: 263257
You haven't said how the input text is represented, but the symptoms you're seeing are consistent with UTF-8 input.
ASCII is a 7-bit character set that does not include any accented letters. Your variable text
is presumably an array of characters. For a string like Ahoj, jak se mas?
, each character occupies one slot in the array. For a string like Ahoj, jak se máš?
, the 'á'
and 'š'
characters are outside the ASCII range, and each is represented as 2 bytes and therefore 2 slots in the array.
The Wikipedia article on UTF-8 explains how the UTF-8 encoding works.
I suggest temporarily adding something like:
writeln('text[', i, '] = ''', text[i], ''' = ', ord(s[i]));
after the begin
of your for
loop so you can see the value of each character.
That explains the problem you're seeing, but not how to solve it. That depends on what kind of support your Pascal implementation has for non-ASCII text. As far as I know, the Pascal language itself has no such support, but some particular implementations mmight.
Upvotes: 1
Reputation: 15
Lately I was only working on this inside the online compiler I mentioned. Everywhere else I tried (free pascal and turbo pascal) it works just fine.
Thank you for help, I didn't think different compilers would make a difference.
Upvotes: 0