Reputation: 4433
I have a section of a book, complete with punctuation, line breaks etc. and I want to be able to extract the first n words from the text, and divide that into 5 parts. Regex mystifies me. This is what I am trying. I creates an array of index size 0, with all the input text:
public static String getNumberWords2(String s, int nWords){
String[] m = s.split("([a-zA-Z_0-9]+\b.*?)", (nWords / 5));
return "Part One: \n" + m[1] + "\n\n" +
"Part Two: \n" + m[2] + "\n\n" +
"Part Three: \n" + m[3] + "\n\n" +
"Part Four: \n" + m[4] + "\n\n" +
"Part Five: \n" + m[5];
}
Thanks!
Upvotes: 1
Views: 4988
Reputation: 1485
(See below the break for the next go at this. Leaving this top part here because of thought process...)
Based on my reading of the split()
javadoc, I think I know what's going on.
You want to split the string based on whitespace, up to n times.
String [] m = s.split("\\b", nWords);
Then stitch them back together with token whitespace if you must:
StringBuffer strBuf = new StringBuffer();
for (int i = 0; i < nWords; i++) {
strBuf.append(m[i]).append(" ");
}
Finally, chop that into five equal strings:
String [] out = new String[5];
String str = strBuf.toString();
int length = str.length();
int chopLength = length / 5;
for (int i = 0; i < 5; i++) {
int startIndex = i * chopLength;
out[i] = str.substring(startIndex, startIndex + choplength);
}
It's late at night for me, so you might want to check that one yourself for correctness. I think I got it somewhere in the area code of correct.
OK, here's try number 3. Having run it through a debugger, I can verify that the only problem left is the integer math of slicing strings that aren't factors of 5 into five pieces, and how best to deal with the remaining characters.
It ain't pretty, but it works.
String[] sliceAndDiceNTimes(String victim, int slices, int wordLimit) {
// Add one to the wordLimit here, because the rest of the input string
// (past the number of times split() does its magic) will be in the last
// array member
String [] words = victim.split("\\s", wordLimit + 1);
StringBuffer partialVictim = new StringBuffer();
for (int i = 0; i < wordLimit; i++) {
partialVictim.append(words[i]).append(' ');
}
String [] resultingSlices = new String[slices];
String recycledVictim = partialVictim.toString().trim();
int length = recycledVictim.length();
int chopLength = length / slices;
for (int i = 0; i < slices; i++) {
int chopStartIdx = i * chopLength;
resultingSlices[i] = recycledVictim.substring(chopStartIdx, chopStartIdx + chopLength);
}
return resultingSlices;
}
Important notes:
Upvotes: 0
Reputation: 383746
I'm just going to guess what you need here; hopefully this is close:
public static void main(String[] args) {
String text = "Lorem ipsum dolor sit amet, consectetur adipisicing elit, " +
"sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. " +
"Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris " +
"nisi ut aliquip ex ea commodo consequat. Rosebud.";
String[] words = text.split("\\s+");
final int N = words.length;
final int C = 5;
final int R = (N + C - 1) / C;
for (int r = 0; r < R; r++) {
for (int x = r, i = 0; (i < C) && (x < N); i++, x += R) {
System.out.format("%-15s", words[x]);
}
System.out.println();
}
}
This produces:
Lorem sed dolore quis ex
ipsum do magna nostrud ea
dolor eiusmod aliqua. exercitation commodo
sit tempor Ut ullamco consequat.
amet, incididunt enim laboris Rosebud.
consectetur ut ad nisi
adipisicing labore minim ut
elit, et veniam, aliquip
This uses java.util.Scanner
:
static String nextNwords(int n) {
return "(\\S+\\s*){N}".replace("N", String.valueOf(n));
}
static String[] splitFive(String text, final int N) {
Scanner sc = new Scanner(text);
String[] parts = new String[5];
for (int r = 0; r < 5; r++) {
parts[r] = sc.findInLine(nextNwords(N / 5 + (r < (N % 5) ? 1 : 0)));
}
return parts;
}
public static void main(String[] args) {
String text = "Lorem ipsum dolor sit amet, consectetur adipisicing elit, " +
"sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. " +
"Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris " +
"nisi ut aliquip ex ea commodo consequat. Rosebud.";
for (String part : splitFive(text, 23)) {
System.out.println(part);
}
}
This prints the first 23 words of text
,
Lorem ipsum dolor sit amet,
consectetur adipisicing elit, sed do
eiusmod tempor incididunt ut labore
et dolore magna aliqua. Ut
enim ad minim
Or if 7:
Lorem ipsum
dolor sit
amet,
consectetur
adipisicing
Or if 3:
Lorem
ipsum
dolor
<blank>
<blank>
Upvotes: 0
Reputation: 4433
I have a really really ugly solution:
public static Object[] getNumberWords(String s, int nWords, int offset){
Object[] os = new Object[2];
Pattern p = Pattern.compile("(\\w+)");
Matcher m = p.matcher(s);
m.region(offset, m.regionEnd());
int wc = 0;
String total = "";
while (wc <= nWords && m.find()) {
String word = m.group();
total += word + " ";
wc++;
}
os[0] = total;
os[1] = total.lastIndexOf(" ") + offset;
return os; }
String foo(String s, int n){
Object[] os = getNumberWords(s, n, 0);
String a = (String) os[0];
String m[] = new String[5];
int indexCount = 0;
int lastEndIndex = 0;
for(int count = (n / 5); count <= n; count += (n/5)){
if(a.length()<count){count = a.length();}
os = getNumberWords(a, (n / 5), lastEndIndex);
lastEndIndex = (Integer) os[1];
m[indexCount] = (String) os[0];
indexCount++;
}
return "Part One: \n" + m[0] + "\n\n" +
"Part Two: \n" + m[1] + "\n\n" +
"Part Three: \n" + m[2] + "\n\n" +
"Part Four: \n" + m[3] + "\n\n" +
"Part Five: \n" + m[4];
}
Upvotes: -1
Reputation:
there is a better alternative made just for this using BreakIterator. That would be the most correct way to parse for words in Java.
Upvotes: 2
Reputation: 66886
I think the simplest, and most efficient way, is to simply repeatedly find a "word":
Pattern p = Pattern.compile("(\\w+)");
Matcher m = p.matcher(chapter);
while (m.find()) {
String word = m.group();
...
}
You can vary the definition of "word" by modifying the regex. What I wrote just uses regex's notion of word characters, and I wonder if it might be more appropriate than what you're trying to do. But it won't for instance include quote characters, which you may need to allow within a word.
Upvotes: 5