Reputation: 31
I am writing a paper with Rmarkdown (exported to PDF via LaTeX) and I need to count the number of words in the main text. With LaTeX documents, I use texcount
from the command line, specifying the section that I want to exclude from word count using the following tags in my tex document:
%TC:ignore
The part that is to be ignored (e.g., Appendix)
%TC:endignore
How can I include LaTeX comments in my Rmd file to avoid manually adding the %TC
lines in my tex file each time that I regenerate it?
Here is my MWE .Rmd
:
---
output:
pdf_document:
keep_tex: yes
---
Words that I want to count.
<!-- TC:ignore -->
Words that I want to exclude but cannot,
because the comments do not appear in the `.tex` file.
<!-- TC:endignore -->
%TC:ignore
Other words that I want to exclude but this does not work either
because `%` is not interpreted as \LaTeX comment.
%TC:endignore
#%TC:ignore
Another attempt; it does not work because `#` is for sections,
not comments.
#%TC:endignore
Once I have knitted the .Rmd file and have the output .tex file, I'd type:
texcount MWE.tex
and the answer should be 6 words. Thanks!
UPDATE 1:
On Twitter, @amesoudi suggested using a RStudio add-in (WordCountAddIn) to count words in my Rmd document. The add-in is available there https://github.com/benmarwick/wordcountaddin . However, this is no automated and there is still some pointing and clicking involved.
UPDATE 2:
Another solution would be to
use a specific expression to identify what should be LaTeX comments, e.g. LATEXCOMMENT%TC:ignore
in the .Rmd
file,
have a script that automatically removes the LATEXCOMMENT
expressions in the generated .tex
document (e.g. sed
)
Upvotes: 3
Views: 2618
Reputation: 2020
Combining texcount + knitr + R allows for dynamic in-text word count estimation. However, if we use the %TC:Ignore
options, the tex file will be written as \%TC:Ignore
. That extra slash is easy to remove in R but we need to use a two-step process because our tex file is written after the R code runs. In short, we need to run a second batch of R code after knitting to clean up the escaped \%TC:ignore
command and, finally, knit the updated tex file to pdf. The code below does the following:
\%TC:ignore
and \%TC:endignore
commands and replace with %TC:ignore
and %TC:endignore
.\%TC:ignore
and \%TC:endignore
commands and replace with %TC:ignore
and %TC:endignore
. Finally, we knit the corrected tex file to pdf.The word count will always be for the second-to-last compile but compiling twice will solve that (much as with bibtex or table/figure references). I believe the code to call Perl from within R varies by platform so you may need to adjust the system() command below for non-Macs.
Note: code below builds on a simpler version I wrote here: https://tex.stackexchange.com/questions/534/is-there-any-way-to-do-a-correct-word-count-of-a-latex-document/239703#239703
```{r word_count, cache = FALSE, eval = TRUE, include = FALSE}
# adds comma for printing numbers nicely
# from scales package by Hadley Wickham
add_comma <- function(x, ...) {
format(x, ..., big.mark = ",", scientific = FALSE, trim = TRUE)
}
# identify file name manually or automatically during knit
name_of_file <- "your_file_name.Rmd" # manually enter file name
#name_of_file <- knitr::current_input() # automatically get file name
# extract name, drop extension
name_of_file_no_ext <- strsplit(name_of_file, "\\.")[[1]][1]
name_of_file_tex <- paste(name_of_file_no_ext, ".tex", sep = "") # add .tex extension
# Markdown escapes %TC:ignore in tex file to \%TC:ignore
# To fix, read tex file, replace with unescaped version
tex <- readLines(name_of_file_tex)
tex <- gsub("\\\\%TC:ignore", "%TC:ignore", tex)
tex <- gsub("\\\\%TC:endignore", "%TC:endignore", tex)
fileConn <- file(name_of_file_tex)
writeLines(tex, fileConn)
close(fileConn)
# paste together a texcount system command
# you may want other texcount options (see below)
systemcall <- paste("system('texcount -incbib -sum ", name_of_file_tex, "', intern=TRUE)", sep = "")
# See TeXcount manual for more details on customizing query: http://app.uio.no/ifi/texcount/DOC/TeXcount_2_2.pdf
# -inc Parse included files(as separate files).
# -incbib Include bibliography in count, include bbl file if needed.
# -total Only give total sum, no per file sums.
# -sum[=n,n,...] Produces total sum, default being all words and formulae, but customizable to any weighted sum of the seven counts (list of weights for text words, header words, caption words, headers, floats, inlined formulae, displayed formulae).
# run texcount on last compiled .tex file
texcount.out <- eval(parse(text = systemcall))
# write texcount.out to texcount.txt
fileConn <- file("texcount.txt")
writeLines(texcount.out, fileConn)
close(fileConn)
# Alternatively manually write name of myfile.tex,
# uncomment and modify line below
# texcount.out <- system("texcount -total -sum myfile.tex", intern=TRUE)
# extract relevant rows (depending on what matters for your count)
words_in_text_row <- grep("Words in text", texcount.out, value = TRUE)
words_in_headers_row <- grep("Words in headers", texcount.out, value = TRUE)
words_outside_text_row <- grep("Words outside text", texcount.out, value = TRUE)
pattern <- "(\\d)+" # regex pattern for digits
count1 <- regmatches(words_in_text_row, regexpr(pattern, words_in_text_row)) # extract digits
count2 <- regmatches(words_in_headers_row, regexpr(pattern, words_in_headers_row)) # extract digits
count3 <- regmatches(words_outside_text_row, regexpr(pattern, words_outside_text_row)) # extract digits
# can include or exclude types of text as you want
count <- as.numeric(count1) + as.numeric(count2) + as.numeric(count3)
count <- add_comma(as.numeric(count)) # add comma
# write count so it can be read on next knit
fileConn <- file("count.txt")
writeLines(count, fileConn)
close(fileConn)
# optional, when run interactively print count
# (or use `count` variable elsewhere like title page)
count
```
The above chunk calculates the word count on the prior tex file and ignores anything marked by %TC:ignore
BUT the corrected tex file gets overwritten in the knitting process and so you'll see the text %TC:Ignore
in the resulting pdf. The word count is correct but we need a second step to again remove %TC:Ignore
from the tex file and create a pdf WITHOUT %TC:Ignore
. This step should be done interactively (that is, not through knitting).
```{r tex-to-pdf, eval = FALSE, include = FALSE}
# Because the tex file is created AFTER R code is executed
# you need to run this chunk manually / interactively to
# clean the escaped `\%TC:ignore` commands
# and create a new corrected pdf
# Markdown escapes %TC:ignore in tex file to \%TC:ignore
# To fix, read tex file, replace with unescaped version
tex <- readLines(name_of_file_tex)
tex <- gsub("\\\\%TC:ignore", "%TC:ignore", tex)
tex <- gsub("\\\\%TC:endignore", "%TC:endignore", tex)
fileConn <- file(name_of_file_tex)
writeLines(tex, fileConn)
close(fileConn)
# finally, knit updated tex to pdf
tools::texi2pdf("my_file.tex")
```
To use count.txt
in the title page, I have the following in the abstract part of my YAML (again, this means you need to knit twice for count.txt
to be current):
abstract: |
\noindent Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.
\newline \newline
Keywords: keyword1, keyword2, keyword3
\newline \newline
Word Count: `r readr::read_file(here::here("count.txt"))`
Upvotes: 0
Reputation: 15369
Rather than doing a word count on the outputted LaTex file, we can actually just use the .Rmd file directly.
This code is similar in it's approach to the wordcountaddin you mentioned but any text between the tags <!---TC:ignore--->
and <!---TC:endignore--->
will not be included within the count:
library(stringr)
library(tidyverse)
RmdWords <- function(file) {
# Creates a string of text
file_string <- file %>%
readLines() %>%
paste0(collapse = " ") %>%
# Remove YAML header
str_replace_all("^<--- .*?--- ", "") %>%
str_replace_all("^--- .*?--- ", "") %>%
# Remove code
str_replace_all("```.*?```", "") %>%
str_replace_all("`.*?`", "") %>%
# Remove LaTeX
str_replace_all("[^\\\\]\\$\\$.*?[^\\\\]\\$\\$", "") %>%
str_replace_all("[^\\\\]\\$.*?[^\\\\]\\$", "") %>%
# Deletes text between tags
str_replace_all("TC:ignore.*?TC:endignore", "") %>%
str_replace_all("[[:punct:]]", " ") %>%
str_replace_all(" ", "") %>%
str_replace_all("<", "") %>%
str_replace_all(">", "")
# Save several different results
word_count <- str_count(file_string, "\\S+")
char_count <- str_replace_all(string = file_string, " ", "") %>% str_count()
return(list(num_words = word_count, num_char = char_count, word_list = file_string))
}
The function returns three items in the list:
If you want to display the results within the compiled report, you can write the r code inline as follows:
```{r}
words <- RmdWords("MWE.Rmd")
```
There are seven words with 34 characters.
<!-- TC:ignore -->
Words that I want to exclude but cannot,
because the comments do not appear in the `.tex` file.
<!-- TC:endignore -->
<!-- TC:ignore -->
Word Count: `r words$num_words` \newline
Character Count: `r words$num_char`
<!-- TC:endignore -->
Note: some of the original script was adapted from http://www.questionflow.org/2017/10/13/how-to-scrape-pdf-and-rmd-to-get-inspiration/
Upvotes: 2