Reputation: 31
I am working in rmarkdown to produce a report that extracts and displays images extracted from word and powerpoint.
To do this, I am using the officer package. It has a function called media_extract which can 'extract files from an rdocx or rpptx object'.
I have two issues:
media_path
column.I have been able to locate an image in pptx using this function: the pptx_summary function creates a data frame with a media_path column, which displays a file path for image elements. The media_path is then used as an argument in the media_extract function to locate the image. See example code from package documentation below:
example_pptx <- system.file(package = "officer",
"doc_examples/example.pptx")
doc <- read_pptx(example_pptx)
content <- pptx_summary(doc)
image_row <- content[content$content_type %in% "image", ]
media_file <- image_row$media_file
png_file <- tempfile(fileext = ".png")
media_extract(doc, path = media_file, target = png_file)
However, when I run media_extract it returns 'TRUE', which is the example output, but I am unsure how to add the image to my report. I've tried assigning the media_extract as a value eg
image <- media_extract(doc, path = media_file, target = png_file)
but this returns 'FALSE'.
How do I include the image as an image in my report?
The second issue I'm having is how to locate an image in word. The documentation for media_extract
says it can be used to extract images from both .docx and .pptx, I have only managed to get it to work for the latter. I haven't been able to create a file path for .docx.
The file path is generated using either; docx_summary
or pptx_summary
, depending on the file type, which create a data frame summary of the files. The pptx_summary
includes a column media_path
, which displays a file path for the image. The docx_summary
data frame doesn't include this column. Another stackoverflow post posed a solution for this using word/media/
subdir which seemed to work, however I'm not sure what this means or how to use it?
How do I extract an image from a word doc, using word/media/
subdir as the media path?
Upvotes: 2
Views: 610
Reputation: 31
I have continued to research the second issue and found an answer, so thought I would share!
The difficultly I was having extracting images from docx was due to the absence of a media_file
column in the summary data frame (produced using docx_summary
), which is used to locate the desired image. This column is present in the data frame produced for pptx pptx_summary
and is used in the example code from the package documentation.
In the absence of this column you instead need to locate the image using the document subdirectory (file path when the docx is in XML format), which looks like: media_path <- "/word/media/image3.png"
If you want see what this structure looks like you can right click on your document >7-Zip>Extract files.. and a folder containing the document contents will be created, otherwise just change the image number to select the desired image. Note: sometimes images have names that do not follow the image.png format so you may need to extract the files to find the name of the desired image.
Example using media_extract with docx.
#extracting image from word doc using officer package
report <- read_docx("/Users/user.name/Documents/mydoc.docx")
png_file <- tempfile(fileext = ".png")
media_file <- "/word/media/image3.png"
media_extract(report, path = media_file, target = png_file)
Upvotes: 0
Reputation: 10695
media_extract()
is a function that copy the media where you want. We can show the extracted images using R Markdown with at least 3 methods:
knitr::include_graphics()
magick::image_read()
They are illustrated below:
---
title: "media_extract usage"
output: html_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
library(officer)
library(flextable)
example_pptx <- system.file(package = "officer",
"doc_examples/example.pptx")
doc <- read_pptx(example_pptx)
content <- pptx_summary(doc)
image_row <- content[content$content_type %in% "image", ]
media_file <- image_row$media_file
png_file <- tempfile(fileext = ".png")
media_extract(doc, path = media_file, target = png_file)
```
## include_graphics
```{r out.width="200px"}
knitr::include_graphics(png_file)
```
## markdown
You can't use `tempfile()` here - path is better when defined as relative.
Let's write it to "./file.png".
```{r results='hide'}
media_extract(doc, path = media_file, target = "file.png")
```
{style="width:200px;"}
## magick
```{r out.width="200px"}
magick::image_read(png_file)
```
Upvotes: 0