Mark Miller
Mark Miller

Reputation: 13123

read an MSWord file into R

Is it possible to read an MSWord 2010 file into R? I have Windows 7 and a Dell PC.

I am using the line:

my.data <- readLines('c:/users/mark w miller/simple R programs/test_for_r.docx')

to try to read an MSWord file containing the following text:

A   20  1000    AA
B   30  1001    BB
C   10  1500    CC

I get a warning message that says:

Warning message: In readLines("c:/users/mark w miller/simple R programs/test_for_r.docx") : incomplete final line found on 'c:/users/mark w miller/simple R programs/test_for_r.docx'

and my.data appears to be gibberish:

# [1] "PK\003\004\024" "¤l"             "ÈFÃË‹Átí"

I know with this simple example I could easily convert the MSWord file to a different format. However, my actual data files consist of complex tables that were typed decades ago and then scanned into pdf documents later. Age of the original paper document and perhaps imperfections in the original paper, typing and/or scanning process has resulted in some letters and numbers not being very clear. So far converting the pdf files to MSWord seems to be the most successful at correctly translating the tables. Converting the MSWord files to Excel or rich text, etc, has not been very successful. Even after conversion to MSWord the resulting files are very complex and contain numerous errors. I thought if I could read the MSWord files into R that might be the most efficient way to edit and correct them.

I am aware of 'package tm' that I guess can read MSWord files into R, but I am a little concerned about using it because it seems to require installing third-party software.

Thank you for any suggestions.

Upvotes: 14

Views: 21109

Answers (4)

Amit Kohli
Amit Kohli

Reputation: 2950

In case it helps anyone else, https://cran.r-project.org/web/packages/readtext/vignettes/readtext_vignette.html, it appears there's a new package dedicated specifically to reading text data, including Word files (also new .docx format).

Upvotes: 8

Khaynes
Khaynes

Reputation: 1986

You can do this with RDCOMClient very easily. In saying so, some characters will not read in correctly.

require(RDCOMClient)
# Create the connection
wordApp <- COMCreate("Word.Application")
# Let's set visible to true so you can see it run
wordApp[["Visible"]] <- TRUE

# Define the file we want to open
wordFileName <- "c:/path/to/word/doc.docx"
# Open the file
doc <- wordApp[["Documents"]]$Open(wordFileName)
# Print the text
print(doc$range()$text()) 

Upvotes: 1

Mark Miller
Mark Miller

Reputation: 13123

I have not figured out how to read the MSWord file into R, but I have gotten the contents into a format that R can read.

  1. I converted a pdf to MSWord with Acrobat X Pro

  2. The original tables had solid vertical lines separating columns. It turns out these vertical lines were disrupting the format of the data when I converted an MSWord file to a text file, but I was able to delete the lines from an MSWord file before creating a text file.

  3. Convert the MSWord file to a text file after deleting vertical lines in Step 2.

  4. Resulting text files still require extensive editing, but at least the data are largely present in a format R can read and I will not have to re-enter all data in the pdfs by hand, saving many hours of work.

Upvotes: 1

neilfws
neilfws

Reputation: 33802

First, readLines() is not the correct solution, since a Word file is not a text (that is plain, ASCII text) file.

The Word-related function in the tm package is called readDOC() but both it and the required third-party tool (Antiword) are for older Word files (up to Word 2003) and will not work using newer .docx files.

The best I can suggest is that you try readPDF(), also found in the tm package. Note: it requires that the tool pdftotext is installed on your system. Easy for Linux, no idea about Windows. Alternatively, find a Windows tool which converts PDF to plain, ASCII text files (not Word files) - they should open and display correctly using Notepad on Windows - then try readLines() again. However, given that your PDF files are old and come from a scanner, conversion to text might be difficult.

Finally: I realise that you did not make the original decision in this instance, but for anybody else - Word and PDF are not appropriate formats for storing data that you want to parse.

Upvotes: 8

Related Questions