Reputation: 175
So I have a quite old, raw data that looks like the following:
1
*******
*******
*******
*******
S H
HHHHH
2
*******
JSH K
*******
*******
*******
*******
The first row has one number 1
, which is an ID. The following 2~7 rows each is supposed to have 7 elements, corresponding to 7 categories, say a1,a2,a3,a4,a5,a6,a7. Row 8 is again an ID. So for each individual we have 6 rows.
At the end of the day, I want an output looks like this
ID a1 a2 a3 a4 a5 a6 a7
1 1 * * * * * * *
2 1 * * * * * * *
3 1 * * * * * * *
4 1 * * * * * * *
5 1 <NA> <NA> S <NA> <NA> H <NA>
6 1 <NA> H H H H H <NA>
7 2 * * * * * * *
8 2 J S H <NA> <NA> <NA> K
9 2 * * * * * * *
10 2 * * * * * * *
11 2 * * * * * * *
12 2 * * * * * * *
The data does not have any filename extensions (like .csv or .txt). So the first question is how to read it in R while keeping the data formation unchanged.
I tried to use read.csv()
, but it will make the 6th row become
SH
which assign S into 1st category instead of 3rd and H into 2nd category instead of 6th. And ultimately, how can I generate the desired data frame?
Upvotes: 1
Views: 48
Reputation: 2424
Seems to me you're probably looking for read.fwf
. Below is the method I used. Of course, you'd want to get rid of the textConnection
part and replace it with the path to your file but otherwise I think this should work.
d <- read.fwf(textConnection(
" 1
*******
*******
*******
*******
S H
HHHHH
2
*******
JSH K
*******
*******
*******
*******"),
widths = rep(1, 7),
na = c(" "),
stringsAsFactors = FALSE)
id <- as.numeric(d[seq(1, nrow(d), 7), 5])
id <- rep(id, each = 6)
d <- d[seq(1, nrow(d), 7), ]
d <- cbind(id, d)
names(d)[-1] <- paste0("a", 1:7)
d
id a1 a2 a3 a4 a5 a6 a7
3 1 * * * * * * *
4 1 * * * * * * *
5 1 * * * * * * *
6 1 <NA> <NA> S <NA> <NA> H <NA>
7 1 <NA> H H H H H <NA>
8 1 <NA> <NA> <NA> <NA> 2 <NA> <NA>
9 2 * * * * * * *
10 2 J S H <NA> <NA> <NA> K
11 2 * * * * * * *
12 2 * * * * * * *
13 2 * * * * * * *
14 2 * * * * * * *
Upvotes: 2
Reputation: 1110
Try with readLine
. It will read the file into a character vector. You can then split the strings by "" :
v1 = readLines("C:/User/Yourfolder/test_text")
v2 = t(sapply(v1, function(x) {unlist(strsplit(x,""))}))
rownames(v2) = c(1:length(v1))
output:
[,1] [,2] [,3] [,4] [,5] [,6] [,7]
1 " " " " " " " " "1" " " " "
2 "*" "*" "*" "*" "*" "*" "*"
3 "*" "*" "*" "*" "*" "*" "*"
4 "*" "*" "*" "*" "*" "*" "*"
5 "*" "*" "*" "*" "*" "*" "*"
6 " " " " "S" " " " " "H" " "
7 " " "H" "H" "H" "H" "H" " "
8 " " " " " " " " "2" " " " "
9 "*" "*" "*" "*" "*" "*" "*"
10 "J" "S" "H" " " " " " " "K"
11 "*" "*" "*" "*" "*" "*" "*"
12 "*" "*" "*" "*" "*" "*" "*"
13 "*" "*" "*" "*" "*" "*" "*"
Upvotes: 0