Reputation: 1984
I'm sure folks will get a good laugh out of this one, but for the life of me I cannot find a seperator that will indicate when a new paragraph has begun in a string of text. Word, and line? Easy peasy, but paragraph seems to be much harder to find. I've tried two line breaks in a row, the Unicode representation of paragraph break and line break, with no luck.
EDIT: I apologize for the vagueness of my original question. To answer some of the questions, it is a basic text file originally created on windows. I'm testing some code for opening and analyzing it's contents with the Blackberry JDE 4.5 using the RIM eclipse plugin. While the source of the file will be windows (at least for the foreseeable future) and be basic text, I have no control over how they are created (it's a third party source that I dont' have access to the way it is created)
Upvotes: 8
Views: 18759
Reputation: 1222
First, your best bet would be to define a paragraph. Whether it is a line break, a double line break, or a line break followed by a tab. Assuming that you have no control over the input and want to determine the number of paragraphs in various samples of text, any of these situations may exist. Furthermore, they might be used to the same purpose within the same document. So some analysis is needed for this, and keep in mind it won't be 100% accurate all the time.
Start by initializing the various possible paragraph breaks:
and all of those, but twice, and all those variations with an additional tab character ('\t') on the end.
The inefficient way to do this would be to load the input into a string and then call buffer.split().length
to determine how many paragraphs there were. The efficient, scalable way would be to use a stream and go over the input, taking into account how long the paragraph is, and throwing out those paragraphs beneath a given "threshold". A more advanced algorithm might even switch what it considers to be a paragraph after it encounters a switch in the way line breaks are handled (several very short lines, or several very long ones, for example).
And all of this is assuming that you are dealing with unformatted text without section titles, etc. What it comes down to is the concept of asking how many paragraphs are in a particular piece of text is like asking how many weeks are in a year. It's not exactly 52, but it's around there.
Upvotes: 2
Reputation: 718826
There is no such paragraph break character in common usage.
You might be able to get away with assuming that two or more line breaks in a row (with optional horizontal whitespace) indicates a paragraph break. But there are numerous exceptions to this "rule". For example, when a paragraph
and then continues on ... like this one. For that kind of thing, there is probably no solution.
EDIT per @Aiden's comment below. (It is now clear that this is not relevant to the OP, but it may be relevant to others who find the question via Google, etc)
Instead of trying to reverse engineer paragraphs from text, perhaps you should consider specifying that your input should be in (for example) Markdown syntax; i.e. as supported by StackOverflow. The Markdown Wiki includes links to markdown parser implementations in many languages, including Java.
(This assumes that you have some control over the input format of the text you are trying to parse into paragraphs, etcetera.)
Upvotes: 5
Reputation: 75232
Paragraphs in plain text documents are usually separated by two or more line separators. A line separator may be a linefeed (\n
), a carriage-return (\r
), or a carriage-return followed by a linefeed (\r\n
). These three kinds of separator are typically associated with operating systems, but any application is free to write text using any kind of line separator. In fact, text that's been assembled from diverse sources (like a web page) may well contain two or more kinds of separator. When your app reads text, no matter what platform it's running on, it should always check for all three kinds of line separator.
BufferedReader#readLine()
does that, but of course it only reads one line at a time. Simple prose will usually be returned as an alternating sequence of non-empty lines representing paragraphs, and empty lines representing the spaces between them. But don't count on it; watch for multiple empty lines, and be aware that "empty" lines may in fact contain whitespace characters like space (\u0020
) and TAB (\u0009
).
If you choose not to go with a BufferedReader
, you may have to write the detection code from scratch. Java ME doesn't include regex support, so split()
and java.util.Scanner
are not available; and StringTokenizer makes no distinction between a single delimiter character and several in a row unless you use the returnDelims
option. Then it returns the delimiters one character at a time, so you still have to write your own code to figure out what kind of separator you're looking at, if any.
Upvotes: 5
Reputation: 3549
I assume you have a text file and not a complex document like MS-Word or RTF.
The concept of paragraph in text document is not well defined. Most cases new paragraph will be recognized by the fact that when you open a document in text editor, you will see next set of text starting on next line.
There are two special characters viz. new-line (LF - '\n'
) and carriage-return (CR - '\r'
) that causes the text to start on next line. Which character is used for next line depends on operating system you use. Further more, sometimes combination of both is also used like CRLF ('\r\n'
).
In java you can determine character or set of characters used to seprate lines/paragraphs using System.getProperty("line.separator");
. But this brings in new problem. What if you create a text file in MS Windows and then open it in Unix? Line seprator in text file in this case is that of windows, but java is running on unix.
.
My recommendation is:
IF length of text(docuemnt) is zero, THEN paragraphs = 0.
IF length of text(docuemnt) is NOT zero, THEN
'\n'
and '\r'
as line
break characters.Note, exceptions pointed by Stephen still applies here as well.
.
public class ParagraphTest {
public static void main(String[] args) {
String document =
"Hello world.\n" +
"This is line 2.\n\r" +
"Line 3 here.\r" +
"Yet another line 4.\n\r\n\r" +
"Few more lines 5.\r";
printParaCount(document);
}
public static void printParaCount(String document) {
String lineBreakCharacters = "\r\n";
StringTokenizer st = new StringTokenizer(
document, lineBreakCharacters);
System.out.println("ParaCount: " + st.countTokens());
}
}
Output
ParaCount: 5
Upvotes: 2
Reputation: 1108722
String lineSeparator = System.getProperty("line.separator");
This returns the platform's default line separator.
Thus, e.g. the following should work:
String[] paragraphs = text.split(lineSeparator);
Upvotes: 2
Reputation: 8362
It is possible that instead on line feed you need to look for a CR LF sequence (\r\n) - obviously the answer would depend on the text format.
Upvotes: 3