rajan.sngh
rajan.sngh

Reputation: 493

How to read excel in R with Text Box

I have an excel file which contains some data inside an Text box. The goal is to read the file as it is with the row number intact. I dont want to read what is inside the text box. When I read the file using readxl and openxlsx package, all the rows with text box are skipped.

For example, in the image attached, when I read the file, the dataframe contains only row 9 to 14. Row 9 in the excel is row 1 in the dataframe. Is there any way I can read the file so that my data frame also has 14 rows. It does not matter to if first 8 rows contains any value or not. enter image description here

testEmptyRow1 <- openxlsx::read.xlsx(filtered_list[1], skipEmptyRows=FALSE)# not working

Edit 1- link to a smaple file

R reads the file as - enter image description here

Edit 2 I dont want to read content from the text boxes. The goal is to have the same number of rows inside the dataframe. It does not matter if the first 8 rows,see attached image, are empty or have null values.

Upvotes: 0

Views: 688

Answers (4)

Robert Hijmans
Robert Hijmans

Reputation: 47036

Below is a simplified version of one of Emmanuel Hamel's functions

get_excel_textbox <- function(filename) {
    xpath <- file.path(tempdir(), "texel", basename(filename))
    dir.create(xpath, FALSE, TRUE)
    unzip(filename, exdir=xpath)
    fdraw <- list.files(file.path(xpath, "xl", "drawings"), pattern = "\\.xml$", full.names=TRUE)
    out <- sapply(fdraw, function(f) {
        xml <- xml2::read_xml(f)
        text <- xml2::xml_text(xml, trim = TRUE)
        gsub("\\d{12,200}", "", text)
    })
    unlink(xpath)
    out
}

Upvotes: 0

Emmanuel Hamel
Emmanuel Hamel

Reputation: 2213

Here is another approach that can be considered to extract a text box in the first Excel sheet :

library(RDCOMClient)
xlApp <- COMCreate("Excel.Application")
path_To_Excel_File <- "D:\\excel_File.xlsx"
xlWbk <- xlApp$Workbooks()$Open(path_To_Excel_File)
xlWbk$Worksheets(1)$Shapes(1)$TextFrame()$Characters()$Text()

Upvotes: 0

Emmanuel Hamel
Emmanuel Hamel

Reputation: 2213

I have been able to extract the text from an Excel textbox with the following function :

extract_Textbox_From_Excel <- function(dir, excel_File_Name)
{
  library(stringr)
  library(xml2)
  setwd(dir)
  file_Name_No_Ext <- str_replace(string = excel_File_Name, pattern = "\\.xlsx", replacement = "")
  file_Name_Zip <- paste0(file_Name_No_Ext, ".zip")
  dir.create(tmp <- tempfile())
  file.copy(from = paste0(dir, excel_File_Name), to = paste0(tmp, "\\", excel_File_Name))
  setwd(tmp)
  file.rename(from = excel_File_Name, to = file_Name_Zip)
  unzip(zipfile = file_Name_Zip)
  setwd(paste0(tmp, "\\xl\\drawings\\"))
  drawing_files <- list.files(pattern = "\\.xml")
  nb_Drawing_Files <- length(drawing_files)
  vector_Text <- character(nb_Drawing_Files)

  for(i in 1 : nb_Drawing_Files)
  {
    xml_Text <- read_xml(drawing_files[i])
    text <- xml_text(xml_Text, trim = TRUE)
    text <- str_replace_all(text, pattern = "\\d{12,200}", replacement = "")
    vector_Text[i] <- text
  }

  vector_Text <- vector_Text[str_detect(vector_Text, "[:alpha:]")]

  return(vector_Text)
}

Upvotes: 3

IRTFM
IRTFM

Reputation: 263301

Excel xlsx files are really zipped collections of xml files. I don't have Excel because M@#$%^&S were assholes about not letting me restore an installation when one of my hard-drives crashed on my system that was capable of running it, but when I save a LibreOffice file that had a text box in it as an xlsx file on my Unix box, I can then unzip it and see this:

enter image description here

The contents of the drawing1.xml file has the text "This is a test" that is in the box:

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<xdr:wsDr xmlns:xdr="http://schemas.openxmlformats.org/drawingml/2006/spreadsheetDrawing" xmlns:a="http://schemas.openxmlformats.org/drawingml/2006/main" xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships"><xdr:twoCellAnchor editAs="absolute"><xdr:from><xdr:col>6</xdr:col><xdr:colOff>713880</xdr:colOff><xdr:row>9</xdr:row><xdr:rowOff>114840</xdr:rowOff></xdr:from><xdr:to><xdr:col>7</xdr:col><xdr:colOff>777240</xdr:colOff><xdr:row>11</xdr:row><xdr:rowOff>129600</xdr:rowOff></xdr:to><xdr:sp><xdr:nvSpPr><xdr:cNvPr id="0" name="TextShape 1"/><xdr:cNvSpPr txBox="1"/></xdr:nvSpPr><xdr:spPr><a:xfrm rot="1800000"><a:off x="5446440" y="1774080"/><a:ext cx="876240" cy="339840"/></a:xfrm><a:prstGeom prst="rect"><a:avLst/></a:prstGeom><a:noFill/><a:ln><a:noFill/></a:ln></xdr:spPr><xdr:txBody><a:bodyPr lIns="0" rIns="0" tIns="0" bIns="0"></a:bodyPr><a:p><a:r><a:rPr b="0" lang="en-US" sz="1200" spc="-1" strike="noStrike"><a:latin typeface="Times New Roman"/></a:rPr><a:t>This is a test</a:t></a:r><a:endParaRPr b="0" lang="en-US" sz="1200" spc="-1" strike="noStrike"><a:latin typeface="Times New Roman"/></a:endParaRPr></a:p></xdr:txBody></xdr:sp><xdr:clientData/></xdr:twoCellAnchor></xdr:wsDr>

I very much doubt that there are functions in any of the R packages that read Excel files which are designed to extract text content from such text boxes. But a bit of searching brings up an extended discussion that may give you some clues:

https://community.rstudio.com/t/best-r-package-to-create-textboxes-in-ms-excel/947/10

And here is couple of links that shows how to do it in different languages, but with enough detail to help the clueful user:

How to read the text in textbox by using openpyxl

https://social.msdn.microsoft.com/Forums/office/en-US/874624c6-c923-4f2f-a60e-333bc9f24949/looking-for-code-to-update-a-textbox-on-a-spreadsheetml-doc-client-using-vbnet-but-i-can-read-and?forum=oxmlsdk

Upvotes: 0

Related Questions