Reputation: 693
I know many posts already exists on this topic but I'm still having trouble understanding the error. The error I receive is number of items to replace is not a multiple of replacement length.
When I run my code on a smaller practice data frame, it runs perfectly without any error. Once I attempt to run the code on my actual large data frame, I get the error. Is this because of the size of my dataframe, or am I missing the point completely?
I've seen others have created a temporary matrix or vector before running their for loops, and I did this once, but do I need to do it twice because I'm using two for loops? I don't understand why this would be the case though, especially if it works on the smaller dataframe.
df
Des.GeneSymbol <- c("A1BG", "A1BG", "A1BG", "A1BG", "A1BG", "A1BG-AS1", "A1BG-AS1", "A1BG-AS1", "A1BG-AS1", "admin.batch_number", "admin.file_uuid", "admin.month_of_dcc_upload", "admin.patient_withdrawal", "admin.project", "admin.year_of_dcc_upload", "patient.age_at_initial", "patient.anatomic_neoplasm", "patient.axillary_lymph_node")
Des.Description <- c("CHR19-", "1", "1", "1", "Missense_Mutation", "CHR19+", "503538", "503538", "503538", "admin.file_uuid", "admin.month_of_dcc_upload", "admin.project", "admin.year_of_dcc_upload", "admin.patient_withdrawal", "patient.age_at_initial", "patient.anatomic_neoplasm", "admin.batch_number","patient.axillary_lymph_node")
df <- data.frame(Des.GeneSymbol, Des.Description, row.names = 1:length(Des.GeneSymbol), stringsAsFactors = FALSE)
colnames(df) <- c("Des.GeneSymbol", "Des.Description")
df created
Des.GeneSymbol Des.Description
1 A1BG CHR19-
2 A1BG 1
3 A1BG 1
4 A1BG 1
5 A1BG Missense_Mutation
6 A1BG-AS1 CHR19+
7 A1BG-AS1 503538
8 A1BG-AS1 503538
9 A1BG-AS1 503538
10 admin.batch_number admin.file_uuid
11 admin.file_uuid admin.month_of_dcc_upload
12 admin.month_of_dcc_upload admin.project
13 admin.patient_withdrawal admin.year_of_dcc_upload
14 admin.project admin.patient_withdrawl
15 admin.year_of_dcc_upload patient.age_at_initial
16 patient.age_at_initial patient.anatomic_neoplasm
17 patient.anatomic_neoplasm admin.batch_number
18 patient.axillary_lymph_node patient.axillary_lymph_node
The code I've written replaces values from Des.GeneSymbol that are also in Des.Description with "-".
remove_description <- df[, "Des.Description"]
count <- 1
for (cell in df[, "Des.GeneSymbol"]) {
for(value in remove_description) {
if (cell == value) {
df[, "Des.GeneSymbol"][count] <- "-"
break;
}
}
count <- count + 1
}
Desired output:
Des.GeneSymbol Des.Description
1 A1BG CHR19-
2 A1BG 1
3 A1BG 1
4 A1BG 1
5 A1BG Missense_Mutation
6 A1BG-AS1 CHR19+
7 A1BG-AS1 503538
8 A1BG-AS1 503538
9 A1BG-AS1 503538
10 - admin.file_uuid
11 - admin.month_of_dcc_upload
12 - admin.project
13 - admin.year_of_dcc_upload
14 - admin.patient_withdrawl
15 - patient.age_at_initial
16 - patient.anatomic_neoplasm
17 - admin.batch_number
18 - patient.axillary_lymph_node
When I run this on the actual dataframe, count increments to approximately 120,000. The error won't be given on this small dataframe, only on my larger one.
Can someone please explain why this might be happening?
EDIT: Matched the dataframe with the data I've provided. I've also mixed the values in Des.Description to more accurately display what is needed.
Upvotes: 1
Views: 569
Reputation: 42592
EDIT: The OP has clarified that his production data are more complex than his initial, simplified sample data set. He has updated question and sample data set accordingly. The solution for the simplified data set are left below as reference but the output is removed as it no longer matches the updated, more realistic sample data set.
It's difficult to understand how the double for loop is working. Therefore I suggest a "one-liner" using ifelse()
from base R:
df$Des.GeneSymbol <- ifelse(df$Des.GeneSymbol == df$Des.Description, "-", Des.GeneSymbol)
df
The example provided by the OP suggest that the replacement is within each row. Therefore, the ifelse()
is comparing and replacing rowwise.
In case of a large data.frame it might be more efficient to switch over to data.table
:
library(data.table)
setDT(df)[Des.GeneSymbol == Des.Description, Des.GeneSymbol := "-"][]
There are two advantages here:
ifelse()
returns a full vector.data.table
updates in place, i.e., without copying the whole, potentially large object.However, it is still a rowwise comparison which is not reflecing OP's production data.
The OP has clarified that the matching items may not be located in the same row. This is the reason why the OP used a double for
loop in his code.
With the clarification and the updated data set, we need a completely different approach to find matching items elsewhere in the data.frame. The code below uses a self-join to find matches in other rows and an update on join to replace the matches.
library(data.table)
setDT(df)[df, on = c("Des.GeneSymbol==Des.Description"), Des.GeneSymbol := "-"][]
Des.GeneSymbol Des.Description 1: A1BG CHR19- 2: A1BG 1 3: A1BG 1 4: A1BG 1 5: A1BG Missense_Mutation 6: A1BG-AS1 CHR19+ 7: A1BG-AS1 503538 8: A1BG-AS1 503538 9: A1BG-AS1 503538 10: - admin.file_uuid 11: - admin.month_of_dcc_upload 12: - admin.project 13: - admin.year_of_dcc_upload 14: - admin.patient_withdrawal 15: - patient.age_at_initial 16: - patient.anatomic_neoplasm 17: - admin.batch_number 18: - patient.axillary_lymph_node
Upvotes: 2