Reputation: 1378
There are numerous ways to convert a data frame into a Markdown table. But how does one convert back to a dataframe, given a Markdown table
Given a table of a form:
Table Header | Second Header
------------- | -------------
Table Cell | Cell 2
Cell 3 | Cell 4
Or, even worse, in a form
Table Header | Second Header \n------------- | ------------- \nTable Cell | Cell 2 \nCell 3 | Cell 4
how does one get it into a data frame?
Upvotes: 12
Views: 3629
Reputation: 100
I wrote a package specifically dedicated to reading these tables: {readMDTable). It can read markdown tables as tibbles from strings, files, and URLs.
Upvotes: 1
Reputation: 43354
I wrote a couple functions to deal with these issues, though I suspect this is more a problem for people writing SO answers than anyone else. Regardless:
# base R version
read.markdown <- function(file, stringsAsFactors = FALSE, strip.white = TRUE, ...){
if (length(file) > 1) {
lines <- file
} else if (grepl('\n', file)) {
con <- textConnection(file)
lines <- readLines(con)
close(con)
} else {
lines <- readLines(file)
}
lines <- lines[!grepl('^[[:blank:]+-=:_|]*$', lines)]
lines <- gsub('(^\\s*?\\|)|(\\|\\s*?$)', '', lines)
read.delim(text = paste(lines, collapse = '\n'), sep = '|',
stringsAsFactors = stringsAsFactors, strip.white = strip.white, ...)
}
# readr version
read_markdown <- function(file, trim_ws = TRUE, ...){
if (length(file) > 1) {
lines <- file
} else {
lines <- readr::read_lines(file)
}
lines <- lines[!grepl('^[[:blank:]+-=:_|]*$', lines)]
lines <- gsub('(^\\s*?\\|)|(\\|\\s*?$)', '', lines)
readr::read_delim(paste(lines, collapse = '\n'), delim = '|',
trim_ws = trim_ws, ...)
}
They can handle many variations of markdown tables, and are happy with a single string:
read.markdown('Table Header | Second Header \n------------- | ------------- \nTable Cell | Cell 2 \nCell 3 | Cell 4 ')
#> Table.Header Second.Header
#> 1 Table Cell Cell 2
#> 2 Cell 3 Cell 4
a vector of lines (like one gets from clipr::read_clip
):
clipr::write_clip(
' | | mpg | cyl | disp | hp |
|:-------------------:|:-----:|:-----:|:------:|:----:|
| Mazda RX4 | 21 | 6 | 160 | 110 |
| Mazda RX4 Wag | 21 | 6 | 160 | 110 |
| Datsun 710 | 22.8 | 4 | 108 | 93 |'
)
read.markdown(clipr::read_clip())
#> X mpg cyl disp hp
#> 1 Mazda RX4 21.0 6 160 110
#> 2 Mazda RX4 Wag 21.0 6 160 110
#> 3 Datsun 710 22.8 4 108 93
or a filename (though the file can only contain the table and whitespace):
tmp <- tempfile()
writeLines(
' +---------------------+-------+-------+--------+------+--------+
| | mpg | cyl | disp | hp | drat |
+=====================+=======+=======+========+======+========+
| Mazda RX4 | 21 | 6 | 160 | 110 | 3.9 |
+---------------------+-------+-------+--------+------+--------+
| Mazda RX4 Wag | 21 | 6 | 160 | 110 | 3.9 |
+---------------------+-------+-------+--------+------+--------+
| Datsun 710 | 22.8 | 4 | 108 | 93 | 3.85 |
+---------------------+-------+-------+--------+------+--------+',
tmp
)
read_markdown(tmp)
#> Warning: Missing column names filled in: 'X1' [1]
#> # A tibble: 3 x 6
#> X1 mpg cyl disp hp drat
#> <chr> <dbl> <int> <int> <int> <dbl>
#> 1 Mazda RX4 21.0 6 160 110 3.90
#> 2 Mazda RX4 Wag 21.0 6 160 110 3.90
#> 3 Datsun 710 22.8 4 108 93 3.85
Edit: I've put these functions in a package, should anyone find them useful.
Upvotes: 16
Reputation: 1378
I typed up the question, but then realized that answer is quite simple.
Function read_delim
in readr
package handles this easily:
library(readr)
library(dplyr)
object <- 'Table Header | Second Header \n------------- | ------------- \nTable Cell | Cell 2 \nCell 3 | Cell 4'
data_frame <- read_delim(object, delim = '|')
# A tibble: 3 x 2
`Table Header ` ` Second Header `
<chr> <chr>
1 ------------- -------------
2 Table Cell Cell 2
3 Cell 3 Cell 4
One would only need to filter out the '-------' row. Et voila.
Hope this solution helps someone.
Upvotes: 9