Reputation: 329
So I've solved my initial problem using a suggestion from G. Grothendieck, thanks again, exactly the clean way of doing that I was after. Initial post is here. Now reality is that my file is just a little more subtle .
It actually looks like this (see also the data section at end of post for reproducible format):
A1
100
200
txt
A2
STRING
300
400
txt txt
txt
txt txt txt
A3
STRING
STRING
150
250
A2
.
.
.
Preliminary data wrangling looks like this :
type <- cumsum(raw_data[[1]] %in% c("A1","A2","A3"))
v <- tapply(raw_data[[1]], type, c, simplify = FALSE)
m <- t(do.call(cbind, lapply(v, ts)))
raw_data<- as.data.frame(m, stringsAsFactors = FALSE)
raw_data[] <- lapply(raw_data, type.convert, as.is = TRUE)
raw_data$Occurences <- 0
giving :
V1 V2 V3 V4 V5 V6 V7
1 A1 100 200 txt <NA> <NA> <NA>
2 A2 String 300 400 txt txt txt txt txt txt
3 A3 String String 150 250 <NA> <NA>
4 A2 <NA> <NA> <NA> <NA> <NA> <NA>
The issue is that df[3,4] should be in df[3,2], and I should state "2" in a new column. Same goes on line 2, where df[2,3] should be in df[2,2] and state "1" in that same addiotional column. In other words, I'm chasing this :
V1 V2 V3 V4 V5 V6 Occurences
1 A1 100 200 txt <NA> <NA> 0
2 A2 300 400 txt txt txt txt txt txt 1
3 A3 150 250 <NA> <NA> <NA> 2
4 A2 NA NA <NA> <NA> <NA> 0
The STRING is right after A something, sometimes it does not occur and sometimes just one time or several occurences. This is what I did to handle the problem :
#Count "STRING" occurences and readjust values in expected columns
formatString <- function(df) {
z <<- which(df[,2] %in% "STRING")
if (length(z) > 0){
for (i in z){
df$Occurences = df$Occurences + 1
for (j in 2:ncol(df)-1){
if (is.na(df[i,j]) | is.na(df[i,j+1])){
df[i,j] = NA
} else {
df[i,j] = df[i,j+1]
}
}
}
}
z <<- which(df[,2] %in% "STRING")
if(length(z) > 0){formatString(df)}
}
This function is supposed to process exclusively the lines where STRING is found in column 2, it would increment the last column (Occurences) and then shift all the values one column to the left so that they all go back to where they are expected to be. The IS.NA thing is just to try and stop the loop when we're starting to see NAs. Once we've processed these lines, we look again if STRING is in the column 2, and if so call again the function.
Now my problem is that the function looks like it is processing (takes like 20 sec on almost 19k observations and 261 columns, not really sure it's the best in terms of processing time), expect my dataframe was not updated at the end of the loop. However z got updated so it seems to be working the way it should.
What am I missing ?
Data in reproductible form :
DF <- structure(list(V1 = c("A1", "100", "200", "txt ", "A2", "String",
"300","400", "txt txt", "txt", "txt txt txt", "A3",
"String", "String", "150", "250", "A2")), .Names = "V1",
row.names = c(NA, -14L), class = "data.frame")
Upvotes: 1
Views: 391
Reputation: 329
So, because my example was not as accurate as I thought, the data.table approach does not suit my need. Actually the STRING is not necessarily between integer.
I came up with this function :
#Count "STRING" occurences and readjust values in expected columns
formatSearching <- function(df) {
z <<- which(df[,2] %in% "STRING")
if (length(z) > 0){
for (i in z){
df[i,"String_occurences"] = df[i,"String_occurences"] + 1
for (j in 2:(ncol(raw_data)-1)){
if (is.na(df[i,j]) | is.na(df[i,j+1])){
df[i,j] = NA
} else {
df[i,j] = df[i,j+1]
}
}
}
}
z <<- which(df[,2] %in% "STRING")
#if(length(z) > 0){formatSearching(df)} This somehow does work, but does not update df...
return(df)
}
Because of the comment at the end, I'm calling it like this :
raw_data <- formatSearching(raw_data)
while(length(z) > 0){raw_data <- formatSearching(raw_data)}
There are several issues with all this. First my intent was not to put a while in the middle of the processing of my data, but have a fully functional recursive function. I am probably missing an assignment somewhere so that my RAW_DATA dataframe gets updated.
Secondly, this process is taking time, especially the while step. It is possible that on some lines, I'll have up to 10 occurrences, but on others just 1. I am convinced that we could do better and by better i mean faster, more efficient.
Now this is doing the job the way I want to, I'm just trying to gain some processing speed.
Thank you all.
Upvotes: 0
Reputation: 42544
Opposed to the OP's approach, the general idea of this suggestion is to do all data cleaning in long form and to reshape the data from to long to wide form as last step.
It appears that the main objective is to have the integer values aligned in columns V2
and V3
of the final wide format table while keeping track of the number of removed STRING
rows between the group header and the appearence of the first integer row within each group.
Therefore, the data.table
(development version 1.9.7) approach below is looking for the first row containing an integer value within each group rather than to remove any rows containing explicitely the character string STRING
. Thus, the approach is more flexible.
In addition, it is assumed that the same group header may appear multiple times.
library(data.table)
# read data (to make it a reproducible example)
dt <- fread("A1
100
200
txt
A2
STRING
300
400
txt txt
txt
txt txt txt
A3
STRING
STRING
150
250
A2
", header = FALSE, sep = "\n")
# Identify group headers by regular expression and push them down
dt <- dt[V1 %like% "^A[1-3]$", grp := V1][, grp := zoo::na.locf(grp)]
# Count groups in case of multiple appearances of the same group headers
dt[V1 == grp, grp_cnt := .I][, grp_cnt := zoo::na.locf(grp_cnt)]
# Remove "STRING" rows
# Add row count within each individual group
dt[, id := seq_len(.N), by = grp_cnt]
# find first occurrence of an integer in each group by regex
first_int <- dt[V1 %like% "^\\d+$", .(min_id = min(id)), by = grp_cnt]
# non-equi join to start each group with a row containing the first integer
# (requires data.table development version 1.9.7)
dt <- dt[first_int, on = c("grp_cnt", "id>=min_id")]
# compute Occurences as the number of dropped "STRING" rows
dt[, Occurences := id - 2L]
print(dt)
# grp_cnt V1 grp id
# 1: 1 100 A1 2
# 2: 1 200 A1 3
# 3: 1 txt A1 4
# 4: 2 300 A2 3
# 5: 2 400 A2 4
# 6: 2 txt txt A2 5
# 7: 2 txt A2 6
# 8: 2 txt txt txt A2 7
# 9: 3 150 A3 4
#10: 3 250 A3 5
# prepare for dcast: add column names for each group
# (one added to have the column names in line with Q)
dt[, col := paste0("V", seq_len(.N) + 1), by = grp]
# reshape from long to wide form
z <- dcast(dt, grp_cnt + grp + Occurences ~ col, value.var = "V1")[, grp_cnt := NULL]
# do type conversion on the new columns
new_cols <- dt[, unique(col)]
z[, (new_cols) := lapply(.SD, type.convert, as.is = TRUE), .SDcols = new_cols]
print(z)
# grp Occurences V2 V3 V4 V5 V6
#1: A1 0 100 200 txt NA NA
#2: A2 1 300 400 txt txt txt txt txt txt
#3: A3 2 150 250 NA NA NA
str(z)
#Classes ‘data.table’ and 'data.frame': 3 obs. of 7 variables:
# $ grp : chr "A1" "A2" "A3"
# $ Occurences: int 0 1 2
# $ V2 : int 100 300 150
# $ V3 : int 200 400 250
# $ V4 : chr "txt" "txt txt" NA
# $ V5 : chr NA "txt" NA
# $ V6 : chr NA "txt txt txt" NA
# - attr(*, ".internal.selfref")=<externalptr>
Note, that the second appearance of A2
has been dropped because there aren't any rows below the second A2
(in the original one-column file) which contain an integer value.
If the production data contain other group headers than A1
, A2
, and A3
the regular expression to identify group headers has to be amended accordingly.
Column names are in line with the OP's expected result (except V1
is called grp
for clarity). Column order is slightly different which shouldn't be relevant.
Upvotes: 1