skyindeer
skyindeer

Reputation: 175

Use R to clean a raw, old data

So I have a quite old, raw data that looks like the following:

        1
    *******
    *******
    *******
    *******
      S  H
     HHHHH
        2
    *******
    JSH   K
    *******
    *******
    *******
    *******

The first row has one number 1, which is an ID. The following 2~7 rows each is supposed to have 7 elements, corresponding to 7 categories, say a1,a2,a3,a4,a5,a6,a7. Row 8 is again an ID. So for each individual we have 6 rows.

At the end of the day, I want an output looks like this

   ID   a1   a2 a3   a4   a5   a6   a7
1   1    *    *  *    *    *    *    *
2   1    *    *  *    *    *    *    *
3   1    *    *  *    *    *    *    *
4   1    *    *  *    *    *    *    *
5   1 <NA> <NA>  S <NA> <NA>    H <NA>
6   1 <NA>    H  H    H    H    H <NA>
7   2    *    *  *    *    *    *    *
8   2    J    S  H <NA> <NA> <NA>    K
9   2    *    *  *    *    *    *    *
10  2    *    *  *    *    *    *    *
11  2    *    *  *    *    *    *    *
12  2    *    *  *    *    *    *    *

The data does not have any filename extensions (like .csv or .txt). So the first question is how to read it in R while keeping the data formation unchanged.

I tried to use read.csv(), but it will make the 6th row become

SH

which assign S into 1st category instead of 3rd and H into 2nd category instead of 6th. And ultimately, how can I generate the desired data frame?

Upvotes: 1

Views: 48

Answers (2)

Daniel Anderson
Daniel Anderson

Reputation: 2424

Seems to me you're probably looking for read.fwf. Below is the method I used. Of course, you'd want to get rid of the textConnection part and replace it with the path to your file but otherwise I think this should work.

d <- read.fwf(textConnection(
"    1  
*******
*******
*******
*******
  S  H 
 HHHHH 
    2  
*******
JSH   K
*******
*******
*******
*******"), 
    widths = rep(1, 7),
    na = c(" "),
    stringsAsFactors = FALSE)

id <- as.numeric(d[seq(1, nrow(d), 7), 5])
id <- rep(id, each = 6)

d <- d[seq(1, nrow(d), 7), ]
d <- cbind(id, d)
names(d)[-1] <- paste0("a", 1:7)
d

   id   a1   a2   a3   a4   a5   a6   a7
3   1    *    *    *    *    *    *    *
4   1    *    *    *    *    *    *    *
5   1    *    *    *    *    *    *    *
6   1 <NA> <NA>    S <NA> <NA>    H <NA>
7   1 <NA>    H    H    H    H    H <NA>
8   1 <NA> <NA> <NA> <NA>    2 <NA> <NA>
9   2    *    *    *    *    *    *    *
10  2    J    S    H <NA> <NA> <NA>    K
11  2    *    *    *    *    *    *    *
12  2    *    *    *    *    *    *    *
13  2    *    *    *    *    *    *    *
14  2    *    *    *    *    *    *    *

Upvotes: 2

Bea
Bea

Reputation: 1110

Try with readLine. It will read the file into a character vector. You can then split the strings by "" :

v1 = readLines("C:/User/Yourfolder/test_text")
v2 = t(sapply(v1, function(x) {unlist(strsplit(x,""))}))
rownames(v2) = c(1:length(v1))

output:

   [,1] [,2] [,3] [,4] [,5] [,6] [,7]
1  " "  " "  " "  " "  "1"  " "  " " 
2  "*"  "*"  "*"  "*"  "*"  "*"  "*" 
3  "*"  "*"  "*"  "*"  "*"  "*"  "*" 
4  "*"  "*"  "*"  "*"  "*"  "*"  "*" 
5  "*"  "*"  "*"  "*"  "*"  "*"  "*" 
6  " "  " "  "S"  " "  " "  "H"  " " 
7  " "  "H"  "H"  "H"  "H"  "H"  " " 
8  " "  " "  " "  " "  "2"  " "  " " 
9  "*"  "*"  "*"  "*"  "*"  "*"  "*" 
10 "J"  "S"  "H"  " "  " "  " "  "K" 
11 "*"  "*"  "*"  "*"  "*"  "*"  "*" 
12 "*"  "*"  "*"  "*"  "*"  "*"  "*" 
13 "*"  "*"  "*"  "*"  "*"  "*"  "*" 

Upvotes: 0

Related Questions