Real Dreams
Real Dreams

Reputation: 18038

Unicode character for marking

We are going to digitize a lot of books. We want to mark place of line break in original book without influencing the flow of digital book. Which invisible Unicode charter can be used to mark some special places in a raw file?

(\n will used to indicate end of paragraph)

This  is  a  sentence 
in the original book that
I want to mark      line
break places.

What is the proper character to replace *:

This  is  a sentence * in the original book that * I want to mark line *break places.

Upvotes: 1

Views: 962

Answers (2)

Jukka K. Korpela
Jukka K. Korpela

Reputation: 201886

Unicode has no concept of a hidden character that represents a line break in some original but does not cause line break in rendering. Unicode encodes plain text data, and its control characters for line breaks have an effect when plain text is rendered.

What matters here is how the files will be used. If they need to be processable with plain text editors, then you need to decide: either the line breaks are replicated in default rendering, or they are omitted when creating the file. You can’t make them invisible. And different text editors, like Notepad and Emacs, may well use different line control conventions; one program’s end of line is another program’s end of paragraph.

If the files will only be processed by programs that you create, then you can use whatever conventions you like. The most logical one is this: “Line and Paragraph Separator. The Unicode Standard provides two unambiguous characters, U+2028 line separator and U+2029 paragraph separator, to separate lines and paragraphs. They are considered the default form of denoting line and paragraph boundaries in Unicode plain text. A new line is begun after each line separator. A new paragraph is begun after each paragraph separator. As these characters are separator codes, it is not necessary either to start the first line or paragraph or to end the last line or paragraph with them. Doing so would indicate that there was an empty paragraph or line following. The paragraph separator can be inserted between paragraphs of text. Its use allows the creation of plain text files, which can be laid out on a different line width at the receiving end. The line separator can be used to indicate an unconditional end of line.” http://www.unicode.org/versions/Unicode6.1.0/ch16.pdf (pages 6 and 7 in the PDF)

Beware that U+2028 and U+2029 are generally not understood by text editors. They are suitable for storing data in plain text format. When the text is to be rendered, the rendering software has the option of ignoring the original division into lines and treating U+2028 as equivalent to a space, except if preceded by a hyphen (which poses a problem that cannot be resolved without higher level information: a line that ends with “foo-” and is follod by a line beginning with “bar” could represent the word “foobar” as hyphenated for line breaking, or a hyphenated compound “foo-bar” or, in some cases, the combination “foo- bar”).

Upvotes: 5

deceze
deceze

Reputation: 522626

Use the line feed character (LF, "\n", 0x0A) and/or maybe carriage return (CR, "\r", 0x0D).
I.e., the regular characters for this purpose.

Upvotes: 0

Related Questions