Oliver Cox
Oliver Cox

Reputation: 293

How do I correctly capitalize words that contain non-alphanumeric characters in Common Lisp?

Thanks in advance for the help!

Here's the situation:

  1. I want to capitalize the first word in a sentence. I have the individual words in a sentence, stored as strings. In the case of the first word, I capitalize it.
  2. I have tried to use the following seemingly standard approaches:
(string-capitalize str)
(format nil "~@(~A~)" str)
  1. Both of them however seem to "reset" after a special character, so if my sentence starts with the word "I'm" it is rendered as "I'M".
  2. I can of course write my own function to take the first character and convert it to the capital version, but I feel like I am missing something about the above standard approaches/the format function: the creators can't have overlooked such a common feature of text. And it feels better on the whole to use the stuff in the standard library.

Thoughts?

Upvotes: 1

Views: 69

Answers (2)

ignis volens
ignis volens

Reputation: 9282

As others have pointed out there's no general solution to this, in any language, which does not involve some hairy library. CL is no exception.

But the format trick does, in fact, work well enough in simple cases. Although the spec is not completely clear on this, I am pretty sure that the string capitalisation options (variations on ~( ... ~)) use the same definition of 'word' that string-capitalize does:

For the purposes of string-capitalize, a 'word' is defined to be a consecutive subsequence consisting of alphanumeric characters, delimited at each end either by a non-alphanumeric character or by an end of the string.

(From string-capitalize)

This means that, for instance (format nil "~@(~A~)" "i'm") will treat the string "i'm" as two words and capitalize the first, resulting in "I'm". And indeed it does:

> (format nil "~@(~A~)" "i'm")
"I'm"

Assuming your implementation's unicode support is competent this will work for non-ASCII characters:

 (let ((sentence '("štar" "means" "four" "in" "some" "Romani" "dialects")))
    (format nil "~@(~A~)~{ ~A~}" (first sentence) (rest sentence)))
"Štar means four in some Romani dialects"

Upvotes: 3

Shawn
Shawn

Reputation: 52579

It's not standard, and locks you into the one implementation, but SBCL's sb-unicode package has a titlecase function that capitalizes each word in its argument, using Unicode rules to figure out the word and character breaks instead of string-capitalize's rules about what words are.

CL-USER> (use-package :sb-unicode)
T
CL-USER> (sb-unicode:titlecase "I'M")
"I'm"

You can also use the sb-unicode:words function to break a sentence up into component words more robustly than just doing things like splitting on whitespace:

(ql:quickload :str :silent t) ; for str:join
(use-package :sb-unicode)

(defun capitalize-sentence (string)
  "Capitalize the first word of `string` and lowercase the rest."
  (let ((words (sb-unicode:words string)))
    (if words
      (str:join "" (cons (sb-unicode:titlecase (car words))
                         (mapcar #'sb-unicode:lowercase (cdr words))))
      string)))

Upvotes: 3

Related Questions