Reputation: 169
I want to remove all the hyperlinks of a Word document and keep the text. I have these two methods to read word documents with doc and docx extensions.
private void readDocXExtensionDocument(){
File inputFile = new File(inputFolderDir, "test.docx");
try {
XWPFDocument document = new XWPFDocument(OPCPackage.open(new FileInputStream(inputFile)));
XWPFWordExtractor extractor = new XWPFWordExtractor(document);
extractor.setFetchHyperlinks(true);
String context = extractor.getText();
System.out.println(context);
} catch (InvalidFormatException e) {
e.printStackTrace();
} catch (FileNotFoundException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
}
}
private void readDocExtensionDocument(){
File inputFile = new File(inputFolderDir, "test.doc");
POIFSFileSystem fs;
try {
fs = new POIFSFileSystem(new FileInputStream(inputFile));
HWPFDocument document = new HWPFDocument(fs);
WordExtractor wordExtractor = new WordExtractor(document);
String[] paragraphs = wordExtractor.getParagraphText();
System.out.println("Word document has " + paragraphs.length + " paragraphs");
for(int i=0; i<paragraphs.length; i++){
paragraphs[i] = paragraphs[i].replaceAll("\\cM?\r?\n", "");
System.out.println(paragraphs[i]);
}
} catch (IOException e) {
e.printStackTrace();
}
}
Is it possible to remove all the links of a word document with using apache poi library? If it is not, are there any other libraries that can provide this?
Upvotes: 1
Views: 2492
Reputation: 9708
My solution, at least for the .docx category, would be to use regular expressions. Check this one out
private void readDocXExtensionDocument(){
Pattern p = Pattern.compile("\\<(.+?)\\>");
File inputFile = new File(inputFolderDir, "test.docx");
try {
XWPFDocument document = new XWPFDocument(OPCPackage.open(new FileInputStream(inputFile)));
XWPFWordExtractor extractor = new XWPFWordExtractor(document);
extractor.setFetchHyperlinks(true);
String context = extractor.getText();
Matcher m = p.matcher(context);
while (m.find()) {
String link = m.group(0); // the bracketed part
String textString = m.group(1); // the text of the link without the brackets
context = context.replaceAll(link, ""); // ordering important. Link then textString
context = context.replaceAll(textString, "");
}
System.out.println(context);
} catch (InvalidFormatException e) {
e.printStackTrace();
} catch (FileNotFoundException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
}
}
The only caveat to this approach is that if there is material with these angled brackets that is not a link, that too could be removed. If you have a better idea of what kind of links might appear, you might try a more specific regular expression instead of the one I provided.
Upvotes: 2