Reputation: 15

Some characters count twice

Right now I'm trying to find the longest sentence in text and print out number of characters including spaces and things like that. The problem is when I encounter characters like 'š' or 'á' it counts them twice. I tried to subtract one in those cases, but that doesn't seem to work either, because it subtracts them twice too. Any idea how I could fix that? Here is the code for the counter.

for i:=1 to length(text) do
      case text[i] of 
        '.','!','?': begin
                        if len>p2 then p2:=len;
                        len:=0
                     end;
         else inc(len);
       end;

p2 is a counter for longest sentence and len is current sentence.

Upvotes: 1

Answers (3)

Ken White

Reputation: 125689

This works for me with ANSI characters, including those with diacritics. As you've not mentioned any specific character set, and your question is simply tagged as pascal, it should work for you as well. If you're dealing with other character sets, then you need to indicate which specific Pascal compiler you're using, as support for multi-byte characters differs between various Pascal dialects.

function LongestSentenceCharCount(const Text: string): Integer;
var
  Len: Integer;
  LongLen: Integer;
  i, CurrLen: Integer;
begin
  Len := Length(Text);
  CurrLen := 0;
  LongLen := 0;
  for I := 1 to Len do
  begin
    if Text[i] in ['.', '!', '?'] then
    begin
      if CurrLen > LongLen then
        LongLen := CurrLen;
      CurrLen := 0;
    end
    else
      Inc(CurrLen);

  end;
  Result := LongLen;
end;

To deal with multi-byte character sets such as UTF-8 and Unicode -

Based on some code donated to Cary Jensen for his white paper (PDF) Delphi Unicode Migration for Mere Mortals: Stories and Advice from the Front Lines by Seppy Bloom (at the time Team Leader for RTL/VCL at Embarcadero), you can use some of the normalization functionality available in Windows since Vista and later. I've adapted my function above to use that code from Seppy (included below), along with a sample app to demonstrate using it. The code was developed, compiled and tested in Delphi 10.1 Berlin, so if you're using a different compiler you'll have to adjust it, and clearly it won't work if you're not running under Windows Vista or higher.

program Project1;

{$APPTYPE CONSOLE}

uses
  System.SysUtils, WinAPI.Windows;

const
  NormalizationOther = 0;
  NormalizationC     = 1;
  NormalizationD     = 2;
  NormalizationKC    = 5;
  NormalizationKD    = 6;

function IsNormalizedString(NormForm: Integer; lpString: LPCWSTR;
  cwLength: Integer): BOOL; stdcall; external 'Normaliz.dll';

function NormalizeString(NormForm: Integer; lpSrcString: LPCWSTR;
  cwSrcLength: Integer; lpDstString: LPWSTR;
  cwDstLength: Integer): Integer; stdcall; external 'Normaliz.dll';

function NormalizedStringLength(const Str: string): Integer;
var
  Buf: string;
begin
  if not IsNormalizedString(NormalizationC, PChar(Str), -1) then
  begin
    SetLength(Buf, NormalizeString(NormalizationC, PChar(Str),
                                   Length(Str), nil, 0));
    Result := NormalizeString(NormalizationC, PChar(Str),
                                   Length(Str), PChar(Buf), Length(Buf));
  end
  else
    Result := Length(Str);
end;

function LongestSentenceLen(const Text: string): Integer;
var
  Len: Integer;
  i, CurrLen: Integer;
begin
  Len := Length(Text);
  CurrLen := 0;
  Result := 0;
  for i := 1 to Len do
  begin
    // Replaces 'if Text[i] in ['.', '!', '?']', which will work
    // but generates a compiler warning.
    if CharInSet(Text[i], ['.', '!', '?']) then 
    begin
      if CurrLen > Result then
        Result := CurrLen;
      CurrLen := 0;
    end
    else
      Inc(CurrLen, NormalizedStringLength(Text[i]));
  end;
end;

var
  Test: string;

begin
  Test := 'Ahoj, jak se máš? Hello World.';
  WriteLn(Test);
  WriteLn(Format('Longest: %d', [LongestSentenceLen(Test)]));
  ReadLn;
end.

The output of the above is

Ahoj, jak se más? Hello World.
Longest: 16

Upvotes: 2

Keith Thompson

Reputation: 263257

You haven't said how the input text is represented, but the symptoms you're seeing are consistent with UTF-8 input.

ASCII is a 7-bit character set that does not include any accented letters. Your variable text is presumably an array of characters. For a string like Ahoj, jak se mas?, each character occupies one slot in the array. For a string like Ahoj, jak se máš?, the 'á' and 'š' characters are outside the ASCII range, and each is represented as 2 bytes and therefore 2 slots in the array.

The Wikipedia article on UTF-8 explains how the UTF-8 encoding works.

I suggest temporarily adding something like:

writeln('text[', i, '] = ''', text[i], ''' = ', ord(s[i]));

after the begin of your for loop so you can see the value of each character.

That explains the problem you're seeing, but not how to solve it. That depends on what kind of support your Pascal implementation has for non-ASCII text. As far as I know, the Pascal language itself has no such support, but some particular implementations mmight.

Upvotes: 1

Piskot

Reputation: 15

Lately I was only working on this inside the online compiler I mentioned. Everywhere else I tried (free pascal and turbo pascal) it works just fine.

Thank you for help, I didn't think different compilers would make a difference.

Upvotes: 0

Some characters count twice

Answers (3)

Related Questions