Reputation: 493
I have an excel file which contains some data inside an Text box. The goal is to read the file as it is with the row number intact. I dont want to read what is inside the text box. When I read the file using readxl and openxlsx package, all the rows with text box are skipped.
For example, in the image attached, when I read the file, the dataframe contains only row 9 to 14. Row 9 in the excel is row 1 in the dataframe. Is there any way I can read the file so that my data frame also has 14 rows. It does not matter to if first 8 rows contains any value or not.
testEmptyRow1 <- openxlsx::read.xlsx(filtered_list[1], skipEmptyRows=FALSE)# not working
Edit 1- link to a smaple file
Edit 2 I dont want to read content from the text boxes. The goal is to have the same number of rows inside the dataframe. It does not matter if the first 8 rows,see attached image, are empty or have null values.
Upvotes: 0
Views: 688
Reputation: 47036
Below is a simplified version of one of Emmanuel Hamel's functions
get_excel_textbox <- function(filename) {
xpath <- file.path(tempdir(), "texel", basename(filename))
dir.create(xpath, FALSE, TRUE)
unzip(filename, exdir=xpath)
fdraw <- list.files(file.path(xpath, "xl", "drawings"), pattern = "\\.xml$", full.names=TRUE)
out <- sapply(fdraw, function(f) {
xml <- xml2::read_xml(f)
text <- xml2::xml_text(xml, trim = TRUE)
gsub("\\d{12,200}", "", text)
})
unlink(xpath)
out
}
Upvotes: 0
Reputation: 2213
Here is another approach that can be considered to extract a text box in the first Excel sheet :
library(RDCOMClient)
xlApp <- COMCreate("Excel.Application")
path_To_Excel_File <- "D:\\excel_File.xlsx"
xlWbk <- xlApp$Workbooks()$Open(path_To_Excel_File)
xlWbk$Worksheets(1)$Shapes(1)$TextFrame()$Characters()$Text()
Upvotes: 0
Reputation: 2213
I have been able to extract the text from an Excel textbox with the following function :
extract_Textbox_From_Excel <- function(dir, excel_File_Name)
{
library(stringr)
library(xml2)
setwd(dir)
file_Name_No_Ext <- str_replace(string = excel_File_Name, pattern = "\\.xlsx", replacement = "")
file_Name_Zip <- paste0(file_Name_No_Ext, ".zip")
dir.create(tmp <- tempfile())
file.copy(from = paste0(dir, excel_File_Name), to = paste0(tmp, "\\", excel_File_Name))
setwd(tmp)
file.rename(from = excel_File_Name, to = file_Name_Zip)
unzip(zipfile = file_Name_Zip)
setwd(paste0(tmp, "\\xl\\drawings\\"))
drawing_files <- list.files(pattern = "\\.xml")
nb_Drawing_Files <- length(drawing_files)
vector_Text <- character(nb_Drawing_Files)
for(i in 1 : nb_Drawing_Files)
{
xml_Text <- read_xml(drawing_files[i])
text <- xml_text(xml_Text, trim = TRUE)
text <- str_replace_all(text, pattern = "\\d{12,200}", replacement = "")
vector_Text[i] <- text
}
vector_Text <- vector_Text[str_detect(vector_Text, "[:alpha:]")]
return(vector_Text)
}
Upvotes: 3
Reputation: 263301
Excel xlsx files are really zipped collections of xml files. I don't have Excel because M@#$%^&S were assholes about not letting me restore an installation when one of my hard-drives crashed on my system that was capable of running it, but when I save a LibreOffice file that had a text box in it as an xlsx file on my Unix box, I can then unzip it and see this:
The contents of the drawing1.xml file has the text "This is a test" that is in the box:
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<xdr:wsDr xmlns:xdr="http://schemas.openxmlformats.org/drawingml/2006/spreadsheetDrawing" xmlns:a="http://schemas.openxmlformats.org/drawingml/2006/main" xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships"><xdr:twoCellAnchor editAs="absolute"><xdr:from><xdr:col>6</xdr:col><xdr:colOff>713880</xdr:colOff><xdr:row>9</xdr:row><xdr:rowOff>114840</xdr:rowOff></xdr:from><xdr:to><xdr:col>7</xdr:col><xdr:colOff>777240</xdr:colOff><xdr:row>11</xdr:row><xdr:rowOff>129600</xdr:rowOff></xdr:to><xdr:sp><xdr:nvSpPr><xdr:cNvPr id="0" name="TextShape 1"/><xdr:cNvSpPr txBox="1"/></xdr:nvSpPr><xdr:spPr><a:xfrm rot="1800000"><a:off x="5446440" y="1774080"/><a:ext cx="876240" cy="339840"/></a:xfrm><a:prstGeom prst="rect"><a:avLst/></a:prstGeom><a:noFill/><a:ln><a:noFill/></a:ln></xdr:spPr><xdr:txBody><a:bodyPr lIns="0" rIns="0" tIns="0" bIns="0"></a:bodyPr><a:p><a:r><a:rPr b="0" lang="en-US" sz="1200" spc="-1" strike="noStrike"><a:latin typeface="Times New Roman"/></a:rPr><a:t>This is a test</a:t></a:r><a:endParaRPr b="0" lang="en-US" sz="1200" spc="-1" strike="noStrike"><a:latin typeface="Times New Roman"/></a:endParaRPr></a:p></xdr:txBody></xdr:sp><xdr:clientData/></xdr:twoCellAnchor></xdr:wsDr>
I very much doubt that there are functions in any of the R packages that read Excel files which are designed to extract text content from such text boxes. But a bit of searching brings up an extended discussion that may give you some clues:
https://community.rstudio.com/t/best-r-package-to-create-textboxes-in-ms-excel/947/10
And here is couple of links that shows how to do it in different languages, but with enough detail to help the clueful user:
How to read the text in textbox by using openpyxl
Upvotes: 0