user6072192
user6072192

Reputation:

Formatting of data in R?

I am currently working with a data file that contains thousands of records and i have to format them using R.

This is how my data frame looks like now

    ROWID  CAT SERIALNO SUB PRODUCTNAME HOMENUM Start.X Start.Y End.X End.Y
1 111111111 CATA       10  43  PRODUCT A1      NA      NA      NA    NA    NA
2         1            NA  NA                  NA      NA      NA    NA    NA
3         2    3       NA  NA                  NA      NA      NA    NA    NA
4         4    5       NA  NA                  NA      NA      NA    NA    NA
5 555555555 CATB       13  76  PRODUCT A2      NA      NA      NA    NA    NA
6         6            NA  NA                  NA      NA      NA    NA    NA
7         7    8       NA  NA                  NA      NA      NA    NA    NA
8         9   10       NA  NA                  NA      NA      NA    NA    NA

My desired format

    ROWID  CAT SERIALNO SUB PRODUCTNAME HOMENUM Start.X Start.Y End.X End.Y
1 111111111 CATA       10  43  PRODUCT A1       1       2       3     4     5
2 555555555 CATB       13  76  PRODUCT A2       6       7       8     9    10

As seen from the first screenshot above, the values for the last 4 columns are in row 2,3,4 and 6,7,8 respectively

I tried using the t() function but it does not seems to produce what i require and using the fix() function to manually arrange the data would be impossible since i am dealing with a large datafile.

Is there anyway to achieve the desired format using R ?

EDIT: Results of dput()

structure(list(V1 = structure(c(9L, 2L, 1L, 3L, 4L, 5L, 6L, 7L, 
8L), .Label = c("1", "111111111", "2", "4", "555555555", "6", 
"7", "9", "ROWID"), class = "factor"), V2 = structure(c(6L, 7L, 
1L, 3L, 4L, 8L, 1L, 5L, 2L), .Label = c("", "10", "3", "5", "8", 
"CAT", "CATA", "CATB"), class = "factor"), V3 = structure(c(4L, 
2L, 1L, 1L, 1L, 3L, 1L, 1L, 1L), .Label = c("", "10", "13", "SERIALNO"
), class = "factor"), V4 = structure(c(4L, 2L, 1L, 1L, 1L, 3L, 
1L, 1L, 1L), .Label = c("", "43", "76", "SUB"), class = "factor"), 
    V5 = structure(c(4L, 2L, 1L, 1L, 1L, 3L, 1L, 1L, 1L), .Label = c("", 
    "PRODUCT A1", "PRODUCT A2", "PRODUCTNAME"), class = "factor"), 
    V6 = structure(c(2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("", 
    "HOMENUM"), class = "factor"), V7 = structure(c(2L, 1L, 1L, 
    1L, 1L, 1L, 1L, 1L, 1L), .Label = c("", "Start X"), class = "factor"), 
    V8 = structure(c(2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("", 
    "Start Y"), class = "factor"), V9 = structure(c(2L, 1L, 1L, 
    1L, 1L, 1L, 1L, 1L, 1L), .Label = c("", "End X"), class = "factor"), 
    V10 = structure(c(2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("", 
    "End Y"), class = "factor")), .Names = c("V1", "V2", "V3", 
"V4", "V5", "V6", "V7", "V8", "V9", "V10"), class = "data.frame", row.names = c(NA, 
-9L))

Upvotes: 1

Views: 129

Answers (1)

Roland
Roland

Reputation: 132576

I suspect that you should work on your data import first. Your import didn't use header = TRUE, so that we have to fix the column names first:

names(DF) <- as.character(unlist(DF[1,]))
DF <- DF[-1,]

Then we can select every fifth row:

DF1 <- DF[seq_len(nrow(DF)) %% 4 == 1L,]

Now we can select the first two columns of all other rows and transpose them:

temp <- t(DF[seq_len(nrow(DF)) %% 4 != 1L, 1:2])

We remove empty cells from the resulting character matrix and turn the resulting character vector into a five-column matrix, which is then assigned to the last five columns of the new data.frame:

DF1[, 6:10] <- matrix(temp[temp != ""], ncol = 5, byrow = TRUE)

Finally, we convert the column types, so that number are actually numbers and not characters:

DF1[] <- lapply(DF1, function(x) type.convert(as.character(x), as.is = TRUE))
print(DF1)
#      ROWID  CAT SERIALNO SUB PRODUCTNAME HOMENUM Start X Start Y End X End Y
#2 111111111 CATA       10  43  PRODUCT A1       1       2       3     4     5
#6 555555555 CATB       13  76  PRODUCT A2       6       7       8     9    10

Upvotes: 1

Related Questions