DarkHark
DarkHark

Reputation: 693

number of items to replace is not a multiple of replacement length for a data frame

I know many posts already exists on this topic but I'm still having trouble understanding the error. The error I receive is number of items to replace is not a multiple of replacement length.

When I run my code on a smaller practice data frame, it runs perfectly without any error. Once I attempt to run the code on my actual large data frame, I get the error. Is this because of the size of my dataframe, or am I missing the point completely?

I've seen others have created a temporary matrix or vector before running their for loops, and I did this once, but do I need to do it twice because I'm using two for loops? I don't understand why this would be the case though, especially if it works on the smaller dataframe.

df

Des.GeneSymbol <- c("A1BG", "A1BG", "A1BG", "A1BG", "A1BG", "A1BG-AS1", "A1BG-AS1", "A1BG-AS1", "A1BG-AS1", "admin.batch_number", "admin.file_uuid", "admin.month_of_dcc_upload", "admin.patient_withdrawal", "admin.project", "admin.year_of_dcc_upload", "patient.age_at_initial", "patient.anatomic_neoplasm", "patient.axillary_lymph_node") 
Des.Description <- c("CHR19-", "1", "1", "1", "Missense_Mutation", "CHR19+", "503538", "503538", "503538", "admin.file_uuid", "admin.month_of_dcc_upload", "admin.project", "admin.year_of_dcc_upload", "admin.patient_withdrawal", "patient.age_at_initial", "patient.anatomic_neoplasm", "admin.batch_number","patient.axillary_lymph_node") 
df <- data.frame(Des.GeneSymbol, Des.Description, row.names = 1:length(Des.GeneSymbol), stringsAsFactors = FALSE)
colnames(df) <- c("Des.GeneSymbol", "Des.Description")

df created

    Des.GeneSymbol               Des.Description
1       A1BG                        CHR19-
2       A1BG                        1
3       A1BG                        1
4       A1BG                        1
5       A1BG                        Missense_Mutation
6       A1BG-AS1                    CHR19+
7       A1BG-AS1                    503538
8       A1BG-AS1                    503538
9       A1BG-AS1                    503538
10      admin.batch_number          admin.file_uuid
11      admin.file_uuid             admin.month_of_dcc_upload
12      admin.month_of_dcc_upload   admin.project
13      admin.patient_withdrawal    admin.year_of_dcc_upload
14      admin.project               admin.patient_withdrawl
15      admin.year_of_dcc_upload    patient.age_at_initial
16      patient.age_at_initial      patient.anatomic_neoplasm
17      patient.anatomic_neoplasm   admin.batch_number
18      patient.axillary_lymph_node patient.axillary_lymph_node

The code I've written replaces values from Des.GeneSymbol that are also in Des.Description with "-".

remove_description <- df[, "Des.Description"]
count <- 1
for (cell in df[, "Des.GeneSymbol"]) {
  for(value in remove_description) {
    if (cell == value) {
      df[, "Des.GeneSymbol"][count] <- "-"
      break;
    }
  }
  count <- count + 1
}

Desired output:

    Des.GeneSymbol       Des.Description
1       A1BG                CHR19-
2       A1BG                1
3       A1BG                1
4       A1BG                1
5       A1BG                Missense_Mutation
6       A1BG-AS1            CHR19+
7       A1BG-AS1            503538
8       A1BG-AS1            503538
9       A1BG-AS1            503538
10         -                admin.file_uuid
11         -                admin.month_of_dcc_upload
12         -                admin.project
13         -                admin.year_of_dcc_upload
14         -                admin.patient_withdrawl
15         -                patient.age_at_initial
16         -                patient.anatomic_neoplasm
17         -                admin.batch_number
18         -                patient.axillary_lymph_node

When I run this on the actual dataframe, count increments to approximately 120,000. The error won't be given on this small dataframe, only on my larger one.

Can someone please explain why this might be happening?

EDIT: Matched the dataframe with the data I've provided. I've also mixed the values in Des.Description to more accurately display what is needed.

Upvotes: 1

Views: 569

Answers (1)

Uwe
Uwe

Reputation: 42592

EDIT: The OP has clarified that his production data are more complex than his initial, simplified sample data set. He has updated question and sample data set accordingly. The solution for the simplified data set are left below as reference but the output is removed as it no longer matches the updated, more realistic sample data set.

Solutions for simplified case

It's difficult to understand how the double for loop is working. Therefore I suggest a "one-liner" using ifelse() from base R:

df$Des.GeneSymbol <- ifelse(df$Des.GeneSymbol == df$Des.Description, "-", Des.GeneSymbol)
df

The example provided by the OP suggest that the replacement is within each row. Therefore, the ifelse() is comparing and replacing rowwise.

In case of a large data.frame it might be more efficient to switch over to data.table:

library(data.table)
setDT(df)[Des.GeneSymbol == Des.Description, Des.GeneSymbol := "-"][]

There are two advantages here:

  1. Only values in those rows are replaced which fulfil the condition while ifelse() returns a full vector.
  2. data.table updates in place, i.e., without copying the whole, potentially large object.

However, it is still a rowwise comparison which is not reflecing OP's production data.

Updated solution

The OP has clarified that the matching items may not be located in the same row. This is the reason why the OP used a double for loop in his code.

With the clarification and the updated data set, we need a completely different approach to find matching items elsewhere in the data.frame. The code below uses a self-join to find matches in other rows and an update on join to replace the matches.

library(data.table)

setDT(df)[df, on = c("Des.GeneSymbol==Des.Description"), Des.GeneSymbol := "-"][]
    Des.GeneSymbol             Des.Description
 1:           A1BG                      CHR19-
 2:           A1BG                           1
 3:           A1BG                           1
 4:           A1BG                           1
 5:           A1BG           Missense_Mutation
 6:       A1BG-AS1                      CHR19+
 7:       A1BG-AS1                      503538
 8:       A1BG-AS1                      503538
 9:       A1BG-AS1                      503538
10:              -             admin.file_uuid
11:              -   admin.month_of_dcc_upload
12:              -               admin.project
13:              -    admin.year_of_dcc_upload
14:              -    admin.patient_withdrawal
15:              -      patient.age_at_initial
16:              -   patient.anatomic_neoplasm
17:              -          admin.batch_number
18:              - patient.axillary_lymph_node

Upvotes: 2

Related Questions