Reputation: 76412

What can R do about a messy data format?

Sometimes I see data posted in a Stack Overflow question formatted like in this question. This is not the first time, so I have decided to ask a question about it, and answer the question with a way to make the posted data palatable.

I will post the dataset example here just in case the question is deleted.

+------------+------+------+----------+--------------------------+
|    Date    | Emp1 | Case | Priority | PriorityCountinLast7days |
+------------+------+------+----------+--------------------------+
| 2018-06-01 | A    | A1   |        0 |                        0 |
| 2018-06-03 | A    | A2   |        0 |                        1 |
| 2018-06-03 | A    | A3   |        0 |                        2 |
| 2018-06-03 | A    | A4   |        1 |                        1 |
| 2018-06-03 | A    | A5   |        2 |                        1 |
| 2018-06-04 | A    | A6   |        0 |                        3 |
| 2018-06-01 | B    | B1   |        0 |                        1 |
| 2018-06-02 | B    | B2   |        0 |                        2 |
| 2018-06-03 | B    | B3   |        0 |                        3 |
+------------+------+------+----------+--------------------------+

As you can see this is not the right way to post data. As a user wrote in a comment,

It must've taken a bit of time to format the data the way you're showing it here. Unfortunately this is not a good format for us to copy & paste.

I believe this says it all. The asker is well intended and it took some work and time to try to be nice, but the result is not good.

What can R code do to make that table usable, if anything? Will it take a great deal of trouble?

Upvotes: 50

Answers (6)

dww

Reputation: 31452

Using data.table::fread:

x = '
+------------+------+------+----------+--------------------------+
|    Date    | Emp1 | Case | Priority | PriorityCountinLast7days |
+------------+------+------+----------+--------------------------+
| 2018-06-01 | A    | A1   |        0 |                        0 |
| 2018-06-03 | A    | A2   |        0 |                        1 |
| 2018-06-03 | A    | A3   |        0 |                        2 |
| 2018-06-03 | A    | A4   |        1 |                        1 |
| 2018-06-03 | A    | A5   |        2 |                        1 |
| 2018-06-04 | A    | A6   |        0 |                        3 |
| 2018-06-01 | B    | B1   |        0 |                        1 |
| 2018-06-02 | B    | B2   |        0 |                        2 |
| 2018-06-03 | B    | B3   |        0 |                        3 |
+------------+------+------+----------+--------------------------+
'

fread(gsub('[\\+-]+\\n', '', x), drop = c(1,7))

#          Date Emp1 Case Priority PriorityCountinLast7days
# 1: 2018-06-01    A   A1        0                        0
# 2: 2018-06-03    A   A2        0                        1
# 3: 2018-06-03    A   A3        0                        2
# 4: 2018-06-03    A   A4        1                        1
# 5: 2018-06-03    A   A5        2                        1
# 6: 2018-06-04    A   A6        0                        3
# 7: 2018-06-01    B   B1        0                        1
# 8: 2018-06-02    B   B2        0                        2
# 9: 2018-06-03    B   B3        0                        3

The gsub part removes the horizontal rules. drop removes the extra columns caused by delimiters at the line ends.

Upvotes: 35

AkselA

Reputation: 8846

The issue isn't so much how many lines of code it takes, two or five, not much difference. The question is more whether it will work beyond the example you posted here.

I haven't come across this sort of thing in the wild, but I had a go at constructing another example that I thought could conceivably exist.

I've since come across a couple more cases and added them to the test suite.

I've also included a table drawn using box-drawing characters. You don't come across this much these days, but for completeness' sake it's here.

x1 <- "
+------------+------+------+----------+--------------------------+
|    Date    | Emp1 | Case | Priority | PriorityCountinLast7days |
+------------+------+------+----------+--------------------------+
| 2018-06-01 | A    | A1   |        0 |                        0 |
| 2018-06-03 | A    | A2   |        0 |                        1 |
| 2018-06-02 | B    | B2   |        0 |                        2 |
| 2018-06-03 | B    | B3   |        0 |                        3 |
+------------+------+------+----------+--------------------------+
"

x2 <- "
––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––
    Date    | Emp1 | Case | Priority | PriorityCountinLast7days 
––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––
 2018-06-01 | A    | A|1  |        0 |                        0 
 2018-06-03 | A    | A|2  |        0 |                        1 
 2018-06-02 | B    | B|2  |        0 |                        2 
 2018-06-03 | B    | B|3  |        0 |                        3 
––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––
"

x3 <- "
 Maths | English | Science | History | Class

  0.1  |  0.2    |  0.3    |  0.2    |  Y2

  0.9  |  0.5    |  0.7    |  0.4    |  Y1

  0.2  |  0.4    |  0.6    |  0.2    |  Y2

  0.9  |  0.5    |  0.2    |  0.7    |  Y1
"

x4 <- "
       Season   |   Team  | W | AHWO
-------------------------------------
1  |  2017/2018 |  TeamA  | 2 | 1.75
2  |  2017/2018 |  TeamB  | 1 | 1.85
3  |  2017/2018 |  TeamC  | 1 | 1.70
4  |  2016/2017 |  TeamA  | 1 | 1.49
5  |  2016/2017 |  TeamB  | 3 | 1.51
6  |  2016/2017 |  TeamC  | 2 | N/A
"

x5 <- "
    A   B   C
  ┌───┬───┬───┐
A │ 5 │ 1 │ 3 │
  ├───┼───┼───┤
B │ 2 │ 5 │ 3 │
  ├───┼───┼───┤
C │ 3 │ 4 │ 4 │
  └───┴───┴───┘
"

x6 <- "
------------------------------------------------------------
|date              |Material          |Description         |
|----------------------------------------------------------|
|10/04/2013        |WM.5597394        |PNEUMATIC           |
|11/07/2013        |GB.D040790        |RING                |
------------------------------------------------------------
------------------------------------------------------------
|date              |Material          |Description         |
|----------------------------------------------------------|
|08/06/2013        |WM.4M01004A05     |TOUCHEUR            |
|08/06/2013        |WM.4M010108-1     |LEVER               |
------------------------------------------------------------
"

My go at a function

f <- function(x=x6, header=TRUE, rem.dup.header=header, 
  na.strings=c("NA", "N/A"), stringsAsFactors=FALSE, ...) {

    # read each row as a character string
    x <- scan(text=x, what="character", sep="\n", quiet=TRUE)

    # keep only lines containing alphanumerics
    x <- x[grep("[[:alnum:]]", x)]
    
    # remove vertical bars with trailing or leading space
    x <- gsub("\\|? | \\|?", " ", x)

    # remove vertical bars at beginning and end of string
    x <- gsub("\\|?$|^\\|?", "", x)

    # remove vertical box-drawing characters
    x <- gsub("\U2502|\U2503|\U2505|\U2507|\U250A|\U250B", " ", x)
    
    if (rem.dup.header) {
        dup.header <- x == x[1]
        dup.header[1] <- FALSE
        x <- x[!dup.header]
    }

    # read the result as a table
    read.table(text=paste(x, collapse="\n"), header=header, 
      na.strings=na.strings, stringsAsFactors=stringsAsFactors, ...)    
}


lapply(c(x1, x2, x3, x4, x5, x6), f)

Output

[[1]]
        Date Emp1 Case Priority PriorityCountinLast7days
1 2018-06-01    A   A1        0                        0
2 2018-06-03    A   A2        0                        1
3 2018-06-02    B   B2        0                        2
4 2018-06-03    B   B3        0                        3

[[2]]
        Date Emp1 Case Priority PriorityCountinLast7days
1 2018-06-01    A  A|1        0                        0
2 2018-06-03    A  A|2        0                        1
3 2018-06-02    B  B|2        0                        2
4 2018-06-03    B  B|3        0                        3

[[3]]
  Maths English Science History Class
1   0.1     0.2     0.3     0.2    Y2
2   0.9     0.5     0.7     0.4    Y1
3   0.2     0.4     0.6     0.2    Y2
4   0.9     0.5     0.2     0.7    Y1

[[4]]
     Season  Team W AHWO
1 2017/2018 TeamA 2 1.75
2 2017/2018 TeamB 1 1.85
3 2017/2018 TeamC 1 1.70
4 2016/2017 TeamA 1 1.49
5 2016/2017 TeamB 3 1.51
6 2016/2017 TeamC 2   NA

[[5]]
  A B C
A 5 1 3
B 2 5 3
C 3 4 4

[[6]]
        date      Material Description
1 10/04/2013    WM.5597394   PNEUMATIC
2 11/07/2013    GB.D040790        RING
3 08/06/2013 WM.4M01004A05    TOUCHEUR
4 08/06/2013 WM.4M010108-1       LEVER

x3 is from here (will have to look at the edit history).
x4 is from here
x6 is from here

Upvotes: 12

Rui Barradas

Reputation: 76412

The short answer to the question is yes, R code can solve that mess and no, it doesn't take that much trouble.

The first step after copying & pasting the table into an R session is to read it in with read.table setting the header, sep, comment.char and strip.white arguments.

Credits for reminding me of arguments comment.char and strip.white go to @nicola, and his comment.

dat <- read.table(text = "
+------------+------+------+----------+--------------------------+
|    Date    | Emp1 | Case | Priority | PriorityCountinLast7days |
+------------+------+------+----------+--------------------------+
| 2018-06-01 | A    | A1   |        0 |                        0 |
| 2018-06-03 | A    | A2   |        0 |                        1 |
| 2018-06-03 | A    | A3   |        0 |                        2 |
| 2018-06-03 | A    | A4   |        1 |                        1 |
| 2018-06-03 | A    | A5   |        2 |                        1 |
| 2018-06-04 | A    | A6   |        0 |                        3 |
| 2018-06-01 | B    | B1   |        0 |                        1 |
| 2018-06-02 | B    | B2   |        0 |                        2 |
| 2018-06-03 | B    | B3   |        0 |                        3 |
+------------+------+------+----------+--------------------------+
", header = TRUE, sep = "|", comment.char = "+", strip.white = TRUE)

But as you can see there are some issues with the result.

dat
   X       Date Emp1 Case Priority PriorityCountinLast7days X.1
1 NA 2018-06-01    A   A1        0                        0  NA
2 NA 2018-06-03    A   A2        0                        1  NA
3 NA 2018-06-03    A   A3        0                        2  NA
4 NA 2018-06-03    A   A4        1                        1  NA
5 NA 2018-06-03    A   A5        2                        1  NA
6 NA 2018-06-04    A   A6        0                        3  NA
7 NA 2018-06-01    B   B1        0                        1  NA
8 NA 2018-06-02    B   B2        0                        2  NA
9 NA 2018-06-03    B   B3        0                        3  NA

To have separators start and end each data row made R believe those separators mark extra columns, which is not what is meant by the original question's OP.

So the second step is to keep only the real columns. I will do this subsetting the columns by their numbers, easily done, they usually are the first and last columns.

dat <- dat[-c(1, ncol(dat))]
dat
          Date   Emp1   Case Priority PriorityCountinLast7days
1  2018-06-01   A      A1           0                        0
2  2018-06-03   A      A2           0                        1
3  2018-06-03   A      A3           0                        2
4  2018-06-03   A      A4           1                        1
5  2018-06-03   A      A5           2                        1
6  2018-06-04   A      A6           0                        3
7  2018-06-01   B      B1           0                        1
8  2018-06-02   B      B2           0                        2
9  2018-06-03   B      B3           0                        3

That wasn't too hard, much better.
In this case there is still a problem, to coerce column Date to class Date.

dat$Date <- as.Date(dat$Date)

And the result is satisfactory.

str(dat)
'data.frame':   9 obs. of  5 variables:
 $ Date                    : Date, format: "2018-06-01" "2018-06-03" ...
 $ Emp1                    : Factor w/ 2 levels "A","B": 1 1 1 1 1 1 2 2 2
 $ Case                    : Factor w/ 9 levels "A1","A2","A3",..: 1 2 3 4 5 6 7 8 9
 $ Priority                : int  0 0 0 1 2 0 0 0 0
 $ PriorityCountinLast7days: int  0 1 2 1 1 3 1 2 3

Note that I have not set the more or less standard argument stringsAsFactors = FALSE. If needed, this should be done when running read.table.

The whole process took only 3 lines of base R code.

Finally, the end result in dput format, like it should be in the first place.

dat <-
structure(list(Date = structure(c(17683, 17685, 17685, 17685, 
17685, 17686, 17683, 17684, 17685), class = "Date"), Emp1 = c("A", 
"A", "A", "A", "A", "A", "B", "B", "B"), Case = c("A1", "A2", 
"A3", "A4", "A5", "A6", "B1", "B2", "B3"), Priority = c(0, 0, 
0, 1, 2, 0, 0, 0, 0), PriorityCountinLast7days = c(0, 1, 2, 1, 
1, 3, 1, 2, 3)), row.names = c(NA, -9L), class = "data.frame")

Upvotes: 24

Pawel Stradowski

Reputation: 827

As it was suggested, you could use dput to save the content of a dataframe to a file, open the file in a text editor and paste its content. An example of mtcar's dataset limited to first 10 rows:

dput(mtcars  %>% head(10), file = 'reproducible.txt')

The content of reproducible.txt can be used to make a dataframe/tibble as shown below. In such a case data the format is machine readable, but it is hard to be undestood by human at first glance (without pasting into R).

df <- structure(list(mpg = c(21, 21, 22.8, 21.4, 18.7, 18.1, 14.3,
24.4, 22.8, 19.2), cyl = c(6, 6, 4, 6, 8, 6, 8, 4, 4, 6), disp = c(160,
160, 108, 258, 360, 225, 360, 146.7, 140.8, 167.6), hp = c(110,
110, 93, 110, 175, 105, 245, 62, 95, 123), drat = c(3.9, 3.9,
3.85, 3.08, 3.15, 2.76, 3.21, 3.69, 3.92, 3.92), wt = c(2.62,
2.875, 2.32, 3.215, 3.44, 3.46, 3.57, 3.19, 3.15, 3.44), qsec = c(16.46,
17.02, 18.61, 19.44, 17.02, 20.22, 15.84, 20, 22.9, 18.3), vs = c(0,
0, 1, 1, 0, 1, 0, 1, 1, 1), am = c(1, 1, 1, 0, 0, 0, 0, 0, 0,
0), gear = c(4, 4, 4, 3, 3, 3, 3, 4, 4, 4), carb = c(4, 4, 1,
1, 2, 1, 4, 2, 2, 4)), .Names = c("mpg", "cyl", "disp", "hp",
"drat", "wt", "qsec", "vs", "am", "gear", "carb"), row.names = c("Mazda RX4",
"Mazda RX4 Wag", "Datsun 710", "Hornet 4 Drive", "Hornet Sportabout",
"Valiant", "Duster 360", "Merc 240D", "Merc 230", "Merc 280"), class = "data.frame")

Upvotes: -3

Sal-laS

Reputation: 11649

Well, about this specific dataset I used the import feature in RStudio, but I took one additional step beforehand.

Copy the dataset into the Notepad file.
Replace all | characters with ,
Import the Notepad file using read.csv to RStudio using this code (seperate columns by ,).

But, if you mean use the R to fully understand it in one step, then I have no idea.

Upvotes: 3

Zheyuan Li

Reputation: 73315

md_table <- scan(text = "
+------------+------+------+----------+--------------------------+
|    Date    | Emp1 | Case | Priority | PriorityCountinLast7days |
+------------+------+------+----------+--------------------------+
| 2018-06-01 | A    | A1   |        0 |                        0 |
| 2018-06-03 | A    | A2   |        0 |                        1 |
| 2018-06-03 | A    | A3   |        0 |                        2 |
| 2018-06-03 | A    | A4   |        1 |                        1 |
| 2018-06-03 | A    | A5   |        2 |                        1 |
| 2018-06-04 | A    | A6   |        0 |                        3 |
| 2018-06-01 | B    | B1   |        0 |                        1 |
| 2018-06-02 | B    | B2   |        0 |                        2 |
| 2018-06-03 | B    | B3   |        0 |                        3 |
+------------+------+------+----------+--------------------------+",
what = "", sep = "", comment.char = "+", quiet = TRUE)

## it is clear that there are 5 columns
mat <- matrix(md_table[md_table != "|"], ncol = 5, byrow = TRUE)
#      [,1]         [,2]   [,3]   [,4]       [,5]                      
# [1,] "Date"       "Emp1" "Case" "Priority" "PriorityCountinLast7days"
# [2,] "2018-06-01" "A"    "A1"   "0"        "0"                       
# [3,] "2018-06-03" "A"    "A2"   "0"        "1"                       
# [4,] "2018-06-03" "A"    "A3"   "0"        "2"                       
# [5,] "2018-06-03" "A"    "A4"   "1"        "1"                       
# [6,] "2018-06-03" "A"    "A5"   "2"        "1"                       
# [7,] "2018-06-04" "A"    "A6"   "0"        "3"                       
# [8,] "2018-06-01" "B"    "B1"   "0"        "1"                       
# [9,] "2018-06-02" "B"    "B2"   "0"        "2"                       
#[10,] "2018-06-03" "B"    "B3"   "0"        "3"

## a data frame with all character columns
dat <- setNames(data.frame(mat[-1, ], stringsAsFactors = FALSE), mat[1, ])
#        Date Emp1 Case Priority PriorityCountinLast7days
#1 2018-06-01    A   A1        0                        0
#2 2018-06-03    A   A2        0                        1
#3 2018-06-03    A   A3        0                        2
#4 2018-06-03    A   A4        1                        1
#5 2018-06-03    A   A5        2                        1
#6 2018-06-04    A   A6        0                        3
#7 2018-06-01    B   B1        0                        1
#8 2018-06-02    B   B2        0                        2
#9 2018-06-03    B   B3        0                        3

## or maybe just use `type.convert` on some columns?
dat[] <- lapply(dat, type.convert)

Upvotes: 6

What can R do about a messy data format?

Answers (6)

Related Questions