l.iles
l.iles

Reputation: 31

How to extract images from word and powerpoint using media_extract in r?

I am working in rmarkdown to produce a report that extracts and displays images extracted from word and powerpoint.

To do this, I am using the officer package. It has a function called media_extract which can 'extract files from an rdocx or rpptx object'.

I have two issues:

  1. How to view or use the image after I have located it.
  2. In word, how to locate the image without the media_path column.

I have been able to locate an image in pptx using this function: the pptx_summary function creates a data frame with a media_path column, which displays a file path for image elements. The media_path is then used as an argument in the media_extract function to locate the image. See example code from package documentation below:

example_pptx <- system.file(package = "officer",
  "doc_examples/example.pptx")
doc <- read_pptx(example_pptx)
content <- pptx_summary(doc)
image_row <- content[content$content_type %in% "image", ]
media_file <- image_row$media_file
png_file <- tempfile(fileext = ".png")
media_extract(doc, path = media_file, target = png_file)

However, when I run media_extract it returns 'TRUE', which is the example output, but I am unsure how to add the image to my report. I've tried assigning the media_extract as a value eg

image <- media_extract(doc, path = media_file, target = png_file)

but this returns 'FALSE'.

How do I include the image as an image in my report?

The second issue I'm having is how to locate an image in word. The documentation for media_extract says it can be used to extract images from both .docx and .pptx, I have only managed to get it to work for the latter. I haven't been able to create a file path for .docx.

The file path is generated using either; docx_summary or pptx_summary, depending on the file type, which create a data frame summary of the files. The pptx_summary includes a column media_path, which displays a file path for the image. The docx_summary data frame doesn't include this column. Another stackoverflow post posed a solution for this using word/media/ subdir which seemed to work, however I'm not sure what this means or how to use it?

How do I extract an image from a word doc, using word/media/ subdir as the media path?

Upvotes: 2

Views: 610

Answers (2)

l.iles
l.iles

Reputation: 31

I have continued to research the second issue and found an answer, so thought I would share!

The difficultly I was having extracting images from docx was due to the absence of a media_file column in the summary data frame (produced using docx_summary), which is used to locate the desired image. This column is present in the data frame produced for pptx pptx_summary and is used in the example code from the package documentation.

In the absence of this column you instead need to locate the image using the document subdirectory (file path when the docx is in XML format), which looks like: media_path <- "/word/media/image3.png"

If you want see what this structure looks like you can right click on your document >7-Zip>Extract files.. and a folder containing the document contents will be created, otherwise just change the image number to select the desired image. Note: sometimes images have names that do not follow the image.png format so you may need to extract the files to find the name of the desired image.

Example using media_extract with docx.

#extracting image from word doc using officer package 

report <- read_docx("/Users/user.name/Documents/mydoc.docx")

png_file <- tempfile(fileext = ".png")

media_file <- "/word/media/image3.png"

media_extract(report, path = media_file, target = png_file)

Upvotes: 0

David Gohel
David Gohel

Reputation: 10695

media_extract() is a function that copy the media where you want. We can show the extracted images using R Markdown with at least 3 methods:

  • knitr::include_graphics()
  • regular markdown
  • magick::image_read()

They are illustrated below:

---
title: "media_extract usage"
output: html_document
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
library(officer)
library(flextable)
example_pptx <- system.file(package = "officer",
  "doc_examples/example.pptx")
doc <- read_pptx(example_pptx)
content <- pptx_summary(doc)
image_row <- content[content$content_type %in% "image", ]
media_file <- image_row$media_file
png_file <- tempfile(fileext = ".png")
media_extract(doc, path = media_file, target = png_file)
```

## include_graphics

```{r out.width="200px"}
knitr::include_graphics(png_file)
```

## markdown

You can't use `tempfile()` here - path is better when defined as relative.
Let's write it to "./file.png". 

```{r results='hide'}
media_extract(doc, path = media_file, target = "file.png")
```

![](file.png){style="width:200px;"}

## magick

```{r out.width="200px"}
magick::image_read(png_file)
```

enter image description here

Upvotes: 0

Related Questions