wdefreit
wdefreit

Reputation: 73

Path_To_Connection Error when reading in Docx file to R with Officer and Docxtractor

I have several hundred documents from a legal database in ".docx" format. I am trying to do some NLP work on the docs, but can't seem to get past 0. Don't want to post the test doc because of copyright. I keep getting an xml2 path_to_connection error when using officer and docxtractr and can't figure out how to fix it.

If I create a dummy docx, I can read it into R without an issue. However, the same steps applied to the downloaded documents fail. Initially, I thought the issue may have been with the filename, so I modified one file name to test.docx and the issue still remained. I then thought maybe I was running into some network issue that was presenting the temp directory from writing correctly on my work computer (windows), but I ran into the same issue on a home computer (Mac). Next I dug into the underlying functions and realized that there is something weird going on with the paths.

On the work computer (windows) I had a combination of backslashes and forward slashes in the path. I also noticed that the path contained subdirectories that when I explored the path manually were not there. I don't know if this is because the function failed partially or if it was just pointing to an incorrect location. The subdirectory "word" was not there and appeared in both the Mac and windows tests. Here is the Mac reprex

library(tidyverse)
library(officer)
#> Warning: package 'officer' was built under R version 4.4.1

test <- officer::read_docx("~/Downloads/test.docx")
#> Warning in read_core_properties(package_dir): No properties found. Using
#> properties from template file in officer.
#> Error in `path_to_connection()`:
#> ! 
#>   '/var/folders/q2/8qcfd1qd5rqfvhr2x9l2k24m0000gp/T//RtmpZ6jZ5U/file95017ae662ea/word/document.xml'
#>   does not exist.
reprex::reprex()
#> ℹ Non-interactive session, setting `html_preview = FALSE`.
#> CLIPR_ALLOW has not been set, so clipr will not run interactively
#> Error in switch(where, expr = stringify_expression(x_expr), clipboard = ingest_clipboard(), : EXPR must be a length 1 vector

I think the traceback shows it aborts on the path check:

11. signal_abort(cnd, .file)
10. rlang::abort(message, ..., call = call, use_cli_format = TRUE,.frame = .frame)
9. cli::cli_abort(msg, call = call)
8. check_path(path)
7. path_to_connection(x)
6. read_xml.character(file)
5. read_xml(file)
4. super$feed(file.path(private$package_dir, "word", main_file))
3. initialize(...)
2. docx_part$new(package_dir, main_file = "document.xml", cursor = "/w:document/w:body/*[1]",body_xpath = "/w:document/w:body")
1. officer::read_docx("~/Downloads/test.docx")

I think my error may have something to do with using the template files and mapping that back to the path when it can't find the document properties. I just don't know how to get around it. If I am not doing something incredibly stupid, is there another method I could use to read in the documents?

Created on 2024-11-01 with reprex v2.1.1

Upvotes: 0

Views: 65

Answers (1)

wdefreit
wdefreit

Reputation: 73

Ok. Here is my solution. Just opening and saving the document to a new name removed whatever formatting issues I had that prevented XML2 from doing its thing with Officer and Readtext. I had over 800 documents so doing this one at a time wouldn't work. Instead, I used RDCOMclient to open the file and resave it as paste0(basename(x), "1.docx"). This simple action allowed me to parse the document into an r dataframe.

library(tidyverse)
library(RDCOMClient)
library(R.utils)
library(here)

here()

#Word to Word conversion
the_source_path <- here("the_data", "enacted") 
the_new_path <- here("output", "new_docx") 
the_final_path <- here("output", "new_docx_final") 

#copying files to new directory because changing file names
for (f in the_source_path) file.copy(from = f, to = the_new_path, recursive=TRUE)

#renaming files so there are no spaces in file names. necessary to use RDCOMCLient
old_names <- list.files(path=the_source_path, recursive=TRUE, pattern=c(".docx"))
new_names <- str_replace_all(old_names, "[:space:]", "_") |> 
  str_replace_all("-","_") |> 
  str_replace_all("\\(","")  |> 
  str_replace_all("\\)","") 

file.rename(paste0(the_new_path,"/enacted/", old_names),
            paste0(the_new_path,"/enacted/", basename(new_names)))

#creating the vector of file paths to pass ot the conversion function
word_docs <- list.files(path=paste0(the_new_path,"/enacted/"), recursive=TRUE, pattern=c(".docx"))

#the function that converts the word file to pdfs
the_word_conversion <- function(x) {
  
for(i in seq_along(x)){

wordApp <- COMCreate("Word.Application")  # create COM object
file <- getAbsolutePath(paste0(the_new_path, "/enacted/", x[i])) # convert to absolute path
file <- str_replace_all(file, "/", "\\\\")
wordApp[["Documents"]]$Open(Filename=file) #opens your docx in wordApp
wordApp[["ActiveDocument"]]$SaveAs(paste0(the_final_path, "\\enacted\\", gsub(pattern = "\\.docx$", "", basename(x[i])),"1.docx"), FileFormat=16) 
wordApp$Quit() #quit wordApp  
Sys.sleep(5)

}}

#running the function
the_word_conversion(word_docs)

t1 <- readtext::readtext(here("output/new_docx_final/enacted/test.docx"))

Upvotes: 0

Related Questions