Flo
Flo

Reputation: 31

Exclude sections from word count in R Markdown

I am writing a paper with Rmarkdown (exported to PDF via LaTeX) and I need to count the number of words in the main text. With LaTeX documents, I use texcount from the command line, specifying the section that I want to exclude from word count using the following tags in my tex document:

%TC:ignore 
The part that is to be ignored (e.g., Appendix)
%TC:endignore 

How can I include LaTeX comments in my Rmd file to avoid manually adding the %TC lines in my tex file each time that I regenerate it?

Here is my MWE .Rmd:

---
output:
  pdf_document: 
    keep_tex: yes
---

Words that I want to count. 

<!-- TC:ignore -->
Words that I want to exclude but cannot, 
because the comments do not appear in the `.tex` file. 
<!-- TC:endignore --> 

%TC:ignore 
Other words that I want to exclude but this does not work either 
because `%` is not interpreted as \LaTeX comment. 
%TC:endignore 

#%TC:ignore
Another attempt; it does not work because `#` is for sections, 
not comments. 
#%TC:endignore

Once I have knitted the .Rmd file and have the output .tex file, I'd type:

texcount MWE.tex

and the answer should be 6 words. Thanks!

UPDATE 1:

On Twitter, @amesoudi suggested using a RStudio add-in (WordCountAddIn) to count words in my Rmd document. The add-in is available there https://github.com/benmarwick/wordcountaddin . However, this is no automated and there is still some pointing and clicking involved.

UPDATE 2:
Another solution would be to

Upvotes: 3

Views: 2618

Answers (2)

Omar Wasow
Omar Wasow

Reputation: 2020

Combining texcount + knitr + R allows for dynamic in-text word count estimation. However, if we use the %TC:Ignore options, the tex file will be written as \%TC:Ignore. That extra slash is easy to remove in R but we need to use a two-step process because our tex file is written after the R code runs. In short, we need to run a second batch of R code after knitting to clean up the escaped \%TC:ignore command and, finally, knit the updated tex file to pdf. The code below does the following:

  1. Identify the file name
  2. Search for escaped \%TC:ignore and \%TC:endignore commands and replace with %TC:ignore and %TC:endignore.
  3. Run a system call to the Texcount Perl script, and returning a limited set of word count stats. You may want to adjust the texcount options.
  4. Once various word counts are extracted, a final count is summed and a comma is added (if appropriate). This count is then saved so it can be referenced inline elsewhere.
  5. During knitting, the corrected tex file is overwritten so we now manually run a code chunk to again search for escaped \%TC:ignore and \%TC:endignore commands and replace with %TC:ignore and %TC:endignore. Finally, we knit the corrected tex file to pdf.

The word count will always be for the second-to-last compile but compiling twice will solve that (much as with bibtex or table/figure references). I believe the code to call Perl from within R varies by platform so you may need to adjust the system() command below for non-Macs.

Note: code below builds on a simpler version I wrote here: https://tex.stackexchange.com/questions/534/is-there-any-way-to-do-a-correct-word-count-of-a-latex-document/239703#239703

```{r word_count, cache = FALSE, eval = TRUE, include = FALSE}

# adds comma for printing numbers nicely
# from scales package by Hadley Wickham
add_comma <- function(x, ...) {
  format(x, ..., big.mark = ",", scientific = FALSE, trim = TRUE)
  }

# identify file name manually or automatically during knit
name_of_file  <- "your_file_name.Rmd"   # manually enter file name
#name_of_file <- knitr::current_input() # automatically get file name

# extract name, drop extension
name_of_file_no_ext <- strsplit(name_of_file, "\\.")[[1]][1] 
name_of_file_tex    <- paste(name_of_file_no_ext, ".tex", sep = "") # add .tex extension
    

# Markdown escapes %TC:ignore in tex file to \%TC:ignore
# To fix, read tex file, replace with unescaped version
tex <- readLines(name_of_file_tex)
tex <- gsub("\\\\%TC:ignore",    "%TC:ignore",    tex)
tex <- gsub("\\\\%TC:endignore", "%TC:endignore", tex)

fileConn <- file(name_of_file_tex)
writeLines(tex, fileConn)
close(fileConn)


# paste together a texcount system command
# you may want other texcount options (see below)
systemcall <- paste("system('texcount -incbib -sum ", name_of_file_tex, "', intern=TRUE)", sep = "") 

# See TeXcount manual for more details on customizing query: http://app.uio.no/ifi/texcount/DOC/TeXcount_2_2.pdf
# -inc Parse included files(as separate files).
# -incbib Include bibliography in count, include bbl file if needed.
# -total Only give total sum, no per file sums.
# -sum[=n,n,...] Produces total sum, default being all words and formulae, but customizable to any weighted sum of the seven counts (list of weights for text words, header words, caption words, headers, floats, inlined formulae, displayed formulae).

# run texcount on last compiled .tex file
texcount.out <- eval(parse(text = systemcall)) 

# write texcount.out to texcount.txt
fileConn <- file("texcount.txt")
writeLines(texcount.out, fileConn)
close(fileConn)

# Alternatively manually write name of myfile.tex, 
# uncomment and modify line below
# texcount.out <- system("texcount -total -sum myfile.tex", intern=TRUE)

# extract relevant rows (depending on what matters for your count)
words_in_text_row      <- grep("Words in text", texcount.out, value = TRUE)
words_in_headers_row   <- grep("Words in headers", texcount.out, value = TRUE)
words_outside_text_row <- grep("Words outside text", texcount.out, value = TRUE)

pattern <- "(\\d)+" # regex pattern for digits

count1 <- regmatches(words_in_text_row, regexpr(pattern, words_in_text_row)) # extract digits
count2 <- regmatches(words_in_headers_row, regexpr(pattern, words_in_headers_row)) # extract digits
count3 <- regmatches(words_outside_text_row, regexpr(pattern, words_outside_text_row)) # extract digits

# can include or exclude types of text as you want
count <- as.numeric(count1) + as.numeric(count2) + as.numeric(count3)
    
count <- add_comma(as.numeric(count)) # add comma

# write count so it can be read on next knit
fileConn <- file("count.txt")
writeLines(count, fileConn)
close(fileConn)

# optional, when run interactively print count 
# (or use `count` variable elsewhere like title page)
count
```

The above chunk calculates the word count on the prior tex file and ignores anything marked by %TC:ignore BUT the corrected tex file gets overwritten in the knitting process and so you'll see the text %TC:Ignore in the resulting pdf. The word count is correct but we need a second step to again remove %TC:Ignore from the tex file and create a pdf WITHOUT %TC:Ignore. This step should be done interactively (that is, not through knitting).

```{r tex-to-pdf, eval = FALSE, include = FALSE}
# Because the tex file is created AFTER R code is executed
# you need to run this chunk manually / interactively to
# clean the escaped `\%TC:ignore` commands
# and create a new corrected pdf

# Markdown escapes %TC:ignore in tex file to \%TC:ignore
# To fix, read tex file, replace with unescaped version
tex <- readLines(name_of_file_tex)
tex <- gsub("\\\\%TC:ignore",    "%TC:ignore",    tex)
tex <- gsub("\\\\%TC:endignore", "%TC:endignore", tex)

fileConn <- file(name_of_file_tex)
writeLines(tex, fileConn)
close(fileConn)

# finally, knit updated tex to pdf
tools::texi2pdf("my_file.tex")

```

To use count.txt in the title page, I have the following in the abstract part of my YAML (again, this means you need to knit twice for count.txt to be current):

abstract: |
  \noindent Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.
  \newline \newline
  Keywords&#58; keyword1, keyword2, keyword3
  \newline \newline
  Word Count&#58; `r readr::read_file(here::here("count.txt"))`

Upvotes: 0

Michael Harper
Michael Harper

Reputation: 15369

Rather than doing a word count on the outputted LaTex file, we can actually just use the .Rmd file directly.

This code is similar in it's approach to the wordcountaddin you mentioned but any text between the tags <!---TC:ignore---> and <!---TC:endignore---> will not be included within the count:

library(stringr)
library(tidyverse)

RmdWords <- function(file) {

  # Creates a string of text
  file_string <- file %>%
    readLines() %>%
    paste0(collapse = " ") %>%
    # Remove YAML header
    str_replace_all("^<--- .*?--- ", "") %>%    
    str_replace_all("^--- .*?--- ", "") %>%
    # Remove code
    str_replace_all("```.*?```", "") %>%
    str_replace_all("`.*?`", "") %>%
    # Remove LaTeX
    str_replace_all("[^\\\\]\\$\\$.*?[^\\\\]\\$\\$", "") %>%
    str_replace_all("[^\\\\]\\$.*?[^\\\\]\\$", "") %>%
    # Deletes text between tags
    str_replace_all("TC:ignore.*?TC:endignore", "") %>%
    str_replace_all("[[:punct:]]", " ") %>%
    str_replace_all("  ", "") %>%
    str_replace_all("<", "") %>%
    str_replace_all(">", "")

  # Save several different results
  word_count <- str_count(file_string, "\\S+")
  char_count <- str_replace_all(string = file_string, " ", "") %>% str_count()

   return(list(num_words = word_count, num_char = char_count, word_list = file_string))
}

The function returns three items in the list:

  • num_words: the number of words in the file
  • num_char: the number of characters
  • word_list: a list of all the words in the document

If you want to display the results within the compiled report, you can write the r code inline as follows:

```{r}
words <- RmdWords("MWE.Rmd")
```
 There are seven words with 34 characters.

<!-- TC:ignore -->
Words that I want to exclude but cannot, 
because the comments do not appear in the `.tex` file. 
<!-- TC:endignore --> 

<!-- TC:ignore -->
Word Count: `r words$num_words` \newline
Character Count: `r words$num_char`
<!-- TC:endignore --> 

enter image description here

Note: some of the original script was adapted from http://www.questionflow.org/2017/10/13/how-to-scrape-pdf-and-rmd-to-get-inspiration/

Upvotes: 2

Related Questions