homer3018
homer3018

Reputation: 329

Split a single column into a multiple columns dataframe

So I've solved my initial problem using a suggestion from G. Grothendieck, thanks again, exactly the clean way of doing that I was after. Initial post is here. Now reality is that my file is just a little more subtle .

It actually looks like this (see also the data section at end of post for reproducible format):

A1
100
200
txt 
A2
STRING
300
400
txt txt
txt
txt txt txt
A3
STRING
STRING
150
250
A2
.
.
.

Preliminary data wrangling looks like this :

type <- cumsum(raw_data[[1]] %in% c("A1","A2","A3"))
v <- tapply(raw_data[[1]], type, c, simplify = FALSE)
m <- t(do.call(cbind, lapply(v, ts)))

raw_data<- as.data.frame(m, stringsAsFactors = FALSE)
raw_data[] <- lapply(raw_data, type.convert, as.is = TRUE)

raw_data$Occurences <- 0

giving :

  V1     V2     V3   V4      V5   V6          V7
1 A1    100    200 txt     <NA> <NA>        <NA>
2 A2 String    300  400 txt txt  txt txt txt txt
3 A3 String String  150     250 <NA>        <NA>
4 A2   <NA>   <NA> <NA>    <NA> <NA>        <NA>

The issue is that df[3,4] should be in df[3,2], and I should state "2" in a new column. Same goes on line 2, where df[2,3] should be in df[2,2] and state "1" in that same addiotional column. In other words, I'm chasing this :

  V1  V2  V3      V4   V5          V6 Occurences
1 A1 100 200    txt  <NA>        <NA>          0
2 A2 300 400 txt txt  txt txt txt txt          1
3 A3 150 250    <NA> <NA>        <NA>          2
4 A2  NA  NA    <NA> <NA>        <NA>          0

The STRING is right after A something, sometimes it does not occur and sometimes just one time or several occurences. This is what I did to handle the problem :

#Count "STRING" occurences and readjust values in expected columns
formatString <- function(df) {
  z <<- which(df[,2] %in% "STRING")
  if (length(z) > 0){
    for (i in z){
      df$Occurences = df$Occurences + 1
      for (j in 2:ncol(df)-1){
            if (is.na(df[i,j]) | is.na(df[i,j+1])){
          df[i,j] = NA
        } else {
          df[i,j] = df[i,j+1]
        }
      }
    }
  }
  z <<- which(df[,2] %in% "STRING")
  if(length(z) > 0){formatString(df)}
}

This function is supposed to process exclusively the lines where STRING is found in column 2, it would increment the last column (Occurences) and then shift all the values one column to the left so that they all go back to where they are expected to be. The IS.NA thing is just to try and stop the loop when we're starting to see NAs. Once we've processed these lines, we look again if STRING is in the column 2, and if so call again the function.

Now my problem is that the function looks like it is processing (takes like 20 sec on almost 19k observations and 261 columns, not really sure it's the best in terms of processing time), expect my dataframe was not updated at the end of the loop. However z got updated so it seems to be working the way it should.

What am I missing ?

The data

Data in reproductible form :

DF <- structure(list(V1 = c("A1", "100", "200", "txt ", "A2", "String",
"300","400", "txt txt", "txt", "txt txt txt", "A3",
"String", "String", "150", "250", "A2")), .Names = "V1",
row.names = c(NA, -14L), class = "data.frame")

Upvotes: 1

Views: 391

Answers (2)

homer3018
homer3018

Reputation: 329

So, because my example was not as accurate as I thought, the data.table approach does not suit my need. Actually the STRING is not necessarily between integer.

I came up with this function :

#Count "STRING" occurences and readjust values in expected columns
formatSearching <- function(df) {
  z <<- which(df[,2] %in% "STRING")
  if (length(z) > 0){
    for (i in z){
      df[i,"String_occurences"] = df[i,"String_occurences"] + 1
      for (j in 2:(ncol(raw_data)-1)){
        if (is.na(df[i,j]) | is.na(df[i,j+1])){
          df[i,j] = NA
        } else {
          df[i,j] = df[i,j+1]
        }
      }
    }
  }
  z <<- which(df[,2] %in% "STRING")
  #if(length(z) > 0){formatSearching(df)} This somehow does work, but does not update df...
  return(df)
}

Because of the comment at the end, I'm calling it like this :

raw_data <- formatSearching(raw_data)
while(length(z) > 0){raw_data <- formatSearching(raw_data)}

There are several issues with all this. First my intent was not to put a while in the middle of the processing of my data, but have a fully functional recursive function. I am probably missing an assignment somewhere so that my RAW_DATA dataframe gets updated.

Secondly, this process is taking time, especially the while step. It is possible that on some lines, I'll have up to 10 occurrences, but on others just 1. I am convinced that we could do better and by better i mean faster, more efficient.

Now this is doing the job the way I want to, I'm just trying to gain some processing speed.

Thank you all.

Upvotes: 0

Uwe
Uwe

Reputation: 42544

Opposed to the OP's approach, the general idea of this suggestion is to do all data cleaning in long form and to reshape the data from to long to wide form as last step.

It appears that the main objective is to have the integer values aligned in columns V2 and V3 of the final wide format table while keeping track of the number of removed STRING rows between the group header and the appearence of the first integer row within each group.

Therefore, the data.table (development version 1.9.7) approach below is looking for the first row containing an integer value within each group rather than to remove any rows containing explicitely the character string STRING. Thus, the approach is more flexible.

In addition, it is assumed that the same group header may appear multiple times.

library(data.table)

# read data (to make it a reproducible example)
dt <- fread("A1
            100
            200
            txt 
            A2
            STRING
            300
            400
            txt txt
            txt
            txt txt txt
            A3
            STRING
            STRING
            150
            250
            A2
            ", header = FALSE, sep = "\n")

# Identify group headers by regular expression and push them down
dt <- dt[V1 %like% "^A[1-3]$", grp := V1][, grp := zoo::na.locf(grp)]
# Count groups in case of multiple appearances of the same group headers
dt[V1 == grp, grp_cnt := .I][, grp_cnt := zoo::na.locf(grp_cnt)]

# Remove "STRING" rows
# Add row count within each individual group
dt[, id := seq_len(.N), by = grp_cnt]

# find first occurrence of an integer in each group by regex
first_int <- dt[V1 %like% "^\\d+$", .(min_id = min(id)), by = grp_cnt]

# non-equi join to start each group with a row containing the first integer
# (requires data.table development version 1.9.7)
dt <- dt[first_int, on = c("grp_cnt", "id>=min_id")]

# compute Occurences as the number of dropped "STRING" rows
dt[, Occurences := id - 2L]

print(dt)
#    grp_cnt          V1 grp id
# 1:       1         100  A1  2
# 2:       1         200  A1  3
# 3:       1         txt  A1  4
# 4:       2         300  A2  3
# 5:       2         400  A2  4
# 6:       2     txt txt  A2  5
# 7:       2         txt  A2  6
# 8:       2 txt txt txt  A2  7
# 9:       3         150  A3  4
#10:       3         250  A3  5

# prepare for dcast: add column names for each group
# (one added to have the column names in line with Q)
dt[, col := paste0("V", seq_len(.N) + 1), by = grp]

# reshape from long to wide form
z <- dcast(dt, grp_cnt + grp + Occurences ~ col, value.var = "V1")[, grp_cnt := NULL]

# do type conversion on the new columns
new_cols <- dt[, unique(col)]
z[, (new_cols) := lapply(.SD, type.convert, as.is = TRUE), .SDcols = new_cols]

print(z)
#   grp Occurences  V2  V3      V4  V5          V6
#1:  A1          0 100 200     txt  NA          NA
#2:  A2          1 300 400 txt txt txt txt txt txt
#3:  A3          2 150 250      NA  NA          NA

str(z)
#Classes ‘data.table’ and 'data.frame': 3 obs. of  7 variables:
# $ grp       : chr  "A1" "A2" "A3"
# $ Occurences: int  0 1 2
# $ V2        : int  100 300 150
# $ V3        : int  200 400 250
# $ V4        : chr  "txt" "txt txt" NA
# $ V5        : chr  NA "txt" NA
# $ V6        : chr  NA "txt txt txt" NA
# - attr(*, ".internal.selfref")=<externalptr> 

Note, that the second appearance of A2 has been dropped because there aren't any rows below the second A2 (in the original one-column file) which contain an integer value.

If the production data contain other group headers than A1, A2, and A3 the regular expression to identify group headers has to be amended accordingly.

Column names are in line with the OP's expected result (except V1 is called grp for clarity). Column order is slightly different which shouldn't be relevant.

Upvotes: 1

Related Questions