Reputation: 1107
I want to write some documents with officer
and I have some predefined styles in my word document that I load with read_docx()
. Now I can look at the styles but I especially want to know which font type or which font size each style has and I cannot find that.
This is all I can find:
Document <- read_docx(FILEPATH)
head(Document$styles)
style_type style_id style_name is_custom is_default
1 paragraph Normal Normal FALSE TRUE
2 paragraph Heading1 heading 1 FALSE FALSE
3 paragraph Heading2 heading 2 FALSE FALSE
4 paragraph Heading3 heading 3 FALSE FALSE
5 paragraph Heading4 heading 4 FALSE FALSE
6 paragraph Heading5 heading 5 FALSE FALSE
Unfortunately there is no column with the font size or font type.
I really need to have the font size (for example 10) and font type (for example "Times New Roman") of heading 1 in R
because the argument style
of the function body_add_par
is not enough for my purposes. Is there a way to get this?
Edit: It also would be great if the solution is not from officer
.
Upvotes: 1
Views: 1628
Reputation: 173813
I couldn't find a way to do this in officer. In fact, in the end I had to parse the xml contents of the docx to get the fonts.
It turns out that not all styles have a font set. Some inherit from other styles, and some just take the default value given by Word. Anyway, parsing the xml is pretty involved, so this is a bit involved / messy.
First you need to unzip the docx to get its style xml. If you have officer
you will also have the required zip
package, so we'll use this:
library(zip)
doc_path <- "my_file_path.docx"
unzip(doc_path, files = "word/styles.xml", exdir = path.expand("~/"))
Now we need to parse the xml:
As pointed out in the comments by @TobiSonne, the sz
values are in half points, not points, so we need to half them to get the fonts' point sizes.
read_xml(path.expand("~/word/styles.xml")) %>%
xml_nodes(xpath = "//w:style") %>%
lapply(xml_new_root) %>%
lapply(function(x) data.frame(
name = x %>% xml_node(xpath = "//w:name") %>% xml_attr("val"),
based_on = x %>% xml_node(xpath = "//w:basedOn") %>% xml_attr("val"),
font = x %>% xml_node(xpath = "//w:rFonts") %>% xml_attr("ascii"),
size = x %>% xml_node(xpath = "//w:sz") %>% xml_attr("val") %>% as.numeric() %>% `/`(2),
stringsAsFactors = F)) %>%
{do.call("rbind", .)} -> font_table
This gives us the font table, but there are lots of missing values to infer from inheritance etc:
read_xml(path.expand("~/word/styles.xml")) %>%
xml_node(xpath = "//w:docDefaults//w:rPr") %>%
xml_new_root -> defaults
default_size <- xml_node(defaults, xpath = "//w:sz") %>%
xml_attr("val") %>%
as.numeric() %>%
`/`(2)
default_font <- xml_node(defaults, xpath = "//w:rFonts") %>% xml_attr("ascii")
if(is.na(default_font))
default_font <- xml_node(defaults, xpath = "//w:rFonts") %>% xml_attr("asciiTheme")
font_table$size[is.na(font_table$size) & is.na(font_table$based_on)] <- default_size
font_table$font[is.na(font_table$font)] <- default_font
font_table$based_on[is.na(font_table$based_on)] <- "default"
Now we have:
font_table
#> name based_on font size
#> 1 Normal default minorHAnsi 12
#> 2 heading 2 Normal minorHAnsi 13
#> 3 Default Paragraph Font default minorHAnsi 12
#> 4 Normal Table default minorHAnsi 12
#> 5 No List default minorHAnsi 12
#> 6 Table Grid TableNormal minorHAnsi <NA>
#> 7 List Paragraph Normal minorHAnsi <NA>
#> 8 Normal (Web) Normal Times New Roman <NA>
#> 9 Balloon Text Normal Tahoma 8
#> 10 Balloon Text Char DefaultParagraphFont Tahoma 8
#> 11 Heading 2 Char DefaultParagraphFont minorHAnsi 13
Upvotes: 3