Adam_G
Adam_G

Reputation: 7879

Convert HTML to R Markdown

Is there a way to convert an html file, such as https://cran.r-project.org/web/packages/tidytext/vignettes/tidytext.html, and convert it to an executable R Markdown file (rmd)?

Upvotes: 5

Views: 15844

Answers (5)

Mark Neal
Mark Neal

Reputation: 1206

You can get a 98% result by:

  1. Opening a new rmarkdown file (in RStudio v 1.4+),
  2. Click on the "Switch to visual markdown editor" button*,
  3. Select and copy the html output from the browser
  4. Paste into your rmarkdown file.

To get the last 2%, you will want ensure R code chunks are recognised:

  1. Click on the "Switch to source editor" button (same button as above).
  2. Find and replace <!-- --> with ```{r} and after finish the code chunks with ```

And ensure data is available as required by the code. Good luck!

*To switch into visual mode for a markdown document, use the button with the compass icon at the top-right of the editor toolbar - described here: https://blog.rstudio.com/2020/09/30/rstudio-v1-4-preview-visual-markdown-editing/

Upvotes: 6

:~$ ## convert .html to .md :
:~$ pandoc Assessment-Week2B.html -o Assessment-Week2B.md
:~$ 
:~$ ## rename .md to .rmd
:~$ mv Assessment-Week2B.md Assessment-Week2B.rmd
:~$ 
:~$ ## edit via RStudio
:~$ rstudio Assessment-Week2B.Rmd

Tried to modify via terminal as below short MV, but modify via RStudio will be easier.

Upvotes: 1

MLdish
MLdish

Reputation: 71

Here is the solution I use:

  • convert .html to .md :
pandoc ./test.html -o test.md
  • rename .md to .rmd
mv test.md test.rmd
  • post-process the code to organize chunk and paragraphs
# chunks r marker: replace ' {\.sourceCode \.r}' by '{r}'
sed -i 's/ {\.sourceCode \.r/{r/' test.rmd
# delete lines beginning wit ':::'
sed -i '/^:::/d' test.rmd
# delete lines beginning '![](data:image' (html plot)
sed -i '/^\!\[\](data:image/d' test.rmd
# delete paragraph separator lines
sed -i '/^=====/d' test.rmd
sed -i '/^-----/d' test.rmd
# replace paragraph marks
#'[1]{.header-section-number}' by '#'
sed -i 's/\[[0-9]\+\]{\.header-section-number}/#/' test.rmd
#'[1.1]{.header-section-number}' by '##'
sed -i 's/\[[0-9]\+\.[0-9]\+\]{\.header-section-number}/##/' test.rmd
#'[1.1.1]{.header-section-number}' by '###'
sed -i 's/\[[0-9]\+\.[0-9]\+\\.[0-9]\+]{\.header-section-number}/###/' test.rmd
  • add YAML header
echo "$(echo -e "\n" | cat - test.rmd)" > test.rmd
echo "$(echo '---' | cat - test.rmd)" > test.rmd
echo "$(echo 'title: '\"'test'\" | cat - test.rmd)" > test.rmd
echo "$(echo '---' | cat - test.rmd)" > test.rmd

Of course you can have these lines in a .sh to simplify the task

Upvotes: 7

Dirk is no longer here
Dirk is no longer here

Reputation: 368241

In short, no.

The pandoc binary is almost pure awesomeness, and I use it eg to convert the html output from an Rd file back into markdown (to be included in other markdown documents).

But that uses pandoc for what it knows: convert from markdown to html etc. pandoc itself knows nothing about R. So apart from the metaphysical difficulty of getting the code back from the output it created, you have a tool mismatch.

So in some: you probably want the original source code as you cannot recreate Rmd from the html output it produces.

Upvotes: 5

G. Grothendieck
G. Grothendieck

Reputation: 269586

If a markdown file (.md) is sufficient then download and install pandoc if you don't already have it. Then run this from the commmand line or use system("pandoc ...") or shell("pandoc ...") from within R.

pandoc https://cran.r-project.org/web/packages/tidytext/vignettes/tidytext.html -o out.md

For a particular file, it would be possible to post-process the source code and output sections but would represent some additional effort, possibly substantial.

Upvotes: 5

Related Questions