Reputation: 823
I am trying to read one file in java, following is the code :
public void readFile(String fileName){
try {
BufferedReader reader= new BufferedReader(new FileReader(fileName));
String line=null;
while((line=reader.readLine()) != null ){
System.out.println(line);
}
}catch (Exception ex){}
}
It is working fine in case of txt file. However in case of docx file, it is printing weird characters. How can i read .docx file in Java.
Upvotes: 5
Views: 59317
Reputation: 41
I had more complicated docx file and these solutions with document.getParagraphs()
wasn't working. What works in my case was:
import org.apache.poi.xwpf.extractor.XWPFWordExtractor;
import org.apache.poi.xwpf.usermodel.XWPFDocument;
...
protected List<String> readDocxFile(String fileName, Path path) {
XWPFWordExtractor xwpfWordExtractor = null;
try (var doc = new XWPFDocument(Files.newInputStream(path.resolve(fileName)))) {
xwpfWordExtractor = new XWPFWordExtractor(doc);
var docText = xwpfWordExtractor.getText().split("\n");
return Arrays.asList(docText).stream().collect(Collectors.toList());
} catch (IOException e) {
log.error(String.format("Can not read file %s.", fileName));
return Collections.emptyList();
} finally {
if (xwpfWordExtractor != null) {
try {
xwpfWordExtractor.close();
} catch (IOException e) {
log.info("Not able to close xwpfWordExtractor.");
}
}
}
}
And I used these maven dependencies:
<dependency>
<groupId>org.apache.poi</groupId>
<artifactId>poi</artifactId>
<version>5.2.3</version>
</dependency>
<dependency>
<groupId>org.apache.poi</groupId>
<artifactId>poi-ooxml</artifactId>
<version>5.2.3</version>
</dependency>
<dependency>
<groupId>org.apache.poi</groupId>
<artifactId>poi-scratchpad</artifactId>
<version>3.9</version>
</dependency>
Upvotes: 0
Reputation: 551
you must have following 6 jar:
Code:
import java.io.File;
import java.io.FileInputStream;
import java.util.Iterator;
import java.util.List;
import org.apache.poi.hwpf.HWPFDocument;
import org.apache.poi.hwpf.extractor.WordExtractor;
import org.apache.poi.xwpf.usermodel.XWPFDocument;
import org.apache.poi.xwpf.usermodel.XWPFParagraph;
public class test {
public static void readDocxFile(String fileName) {
try {
File file = new File(fileName);
FileInputStream fis = new FileInputStream(file.getAbsolutePath());
XWPFDocument document = new XWPFDocument(fis);
List<XWPFParagraph> paragraphs = document.getParagraphs();
for(int i=0;i<paragraphs.size();i++){
System.out.println(paragraphs.get(i).getParagraphText());
}
fis.close();
} catch (Exception e) {
e.printStackTrace();
}
}
public static void main(String[] args) {
readDocxFile("C:\\Users\\sp0c43734\\Desktop\\SwatiPisal.docx");
}
}
Upvotes: 2
Reputation:
Internally .docx files are organized as zipped XML-files, whereas .doc is a binary file format. So you can not read either one directly. Have a look at docx4j or Apache POI.
If you are trying to create or manipulate a .docx file, try docx4j Here is the source
or go for apachePOI
Upvotes: 7
Reputation: 165
import java.io.File;
import java.io.FileInputStream;
import java.util.List;
import org.apache.poi.xwpf.usermodel.XWPFDocument;
import org.apache.poi.xwpf.usermodel.XWPFParagraph;
public void readDocxFile() {
try {
File file = new File("C:/NetBeans Output/documentx.docx");
FileInputStream fis = new FileInputStream(file.getAbsolutePath());
XWPFDocument document = new XWPFDocument(fis);
List<XWPFParagraph> paragraphs = document.getParagraphs();
for (XWPFParagraph para : paragraphs) {
System.out.println(para.getText());
}
fis.close();
} catch (Exception e) {
e.printStackTrace();
}
}
Upvotes: 14
Reputation: 7515
You cannot read the docx file or doc file directly. You need to have an API to read word files. Use Apache POI http://poi.apache.org/. If you get any doubts, please refer this thread on stackoverflow.com How read Doc or Docx file in java?
Upvotes: 2