Use R to clean a raw, old data

Question

So I have a quite old, raw data that looks like the following:

        1
    *******
    *******
    *******
    *******
      S  H
     HHHHH
        2
    *******
    JSH   K
    *******
    *******
    *******
    *******

The first row has one number 1, which is an ID. The following 2~7 rows each is supposed to have 7 elements, corresponding to 7 categories, say a1,a2,a3,a4,a5,a6,a7. Row 8 is again an ID. So for each individual we have 6 rows.

At the end of the day, I want an output looks like this

   ID   a1   a2 a3   a4   a5   a6   a7
1   1    *    *  *    *    *    *    *
2   1    *    *  *    *    *    *    *
3   1    *    *  *    *    *    *    *
4   1    *    *  *    *    *    *    *
5   1    S      H 
6   1     H  H    H    H    H 
7   2    *    *  *    *    *    *    *
8   2    J    S  H       K
9   2    *    *  *    *    *    *    *
10  2    *    *  *    *    *    *    *
11  2    *    *  *    *    *    *    *
12  2    *    *  *    *    *    *    *

The data does not have any filename extensions (like .csv or .txt). So the first question is how to read it in R while keeping the data formation unchanged.

I tried to use read.csv(), but it will make the 6th row become

SH

which assign S into 1st category instead of 3rd and H into 2nd category instead of 6th. And ultimately, how can I generate the desired data frame?

Daniel Anderson · Accepted Answer

Seems to me you're probably looking for read.fwf. Below is the method I used. Of course, you'd want to get rid of the textConnection part and replace it with the path to your file but otherwise I think this should work.

d <- read.fwf(textConnection(
"    1  
*******
*******
*******
*******
  S  H 
 HHHHH 
    2  
*******
JSH   K
*******
*******
*******
*******"), 
    widths = rep(1, 7),
    na = c(" "),
    stringsAsFactors = FALSE)

id <- as.numeric(d[seq(1, nrow(d), 7), 5])
id <- rep(id, each = 6)

d <- d[seq(1, nrow(d), 7), ]
d <- cbind(id, d)
names(d)[-1] <- paste0("a", 1:7)
d

   id   a1   a2   a3   a4   a5   a6   a7
3   1    *    *    *    *    *    *    *
4   1    *    *    *    *    *    *    *
5   1    *    *    *    *    *    *    *
6   1      S      H 
7   1     H    H    H    H    H 
8   1        2  
9   2    *    *    *    *    *    *    *
10  2    J    S    H       K
11  2    *    *    *    *    *    *    *
12  2    *    *    *    *    *    *    *
13  2    *    *    *    *    *    *    *
14  2    *    *    *    *    *    *    *

Use R to clean a raw, old data

Answers (2)

Related Questions