Running into R error with matching data frame columns

Question

I have one data frame that looks like (gwas.data):

              SNP CHR        BP A1 A2 zscore      P CEUmaf    MAF
1       rs1000000  12 125456933  A  G  1.441 0.1496 0.3729 0.2401
563090 rs10000010   4  21227772  T  C  0.068 0.9455  0.575 0.4934
563091 rs10000023   4  95952929  T  G  1.217 0.2236 0.5917 0.3852
563092  rs1000003   3  99825597  A  G -0.306 0.7597  0.875 0.1794
563093 rs10000033   4 139819348  T  C  1.050 0.2935 0.4917 0.4789
2      rs10000037   4  38600725  A  G  0.072 0.9428 0.2833 0.2296

I have another that looks like (correct orientation):

        CHR        SNP A1 A2    MAF NCHROBS
6952148  12  rs1000000  A  G 0.2401     758
2272221   4 rs10000010  C  T 0.4934     758
2524810   4 rs10000023  G  T 0.3852     758
1838654   3  rs1000003  G  A 0.1794     758
2675630   4 rs10000033  C  T 0.4789     758
2338861   4 rs10000037  A  G 0.2296     758

I'm trying to right a program that takes replaces the gwas.data$MAF with (1-MAF) if A1 and A2 and switched between the two data frames. I'm trying to use this line of code here that I am borrowing from someone else:

    flip <- gwas.data$A1 == correct.orientation$A2 & gwas.data$A2 == correct.orientation$A1
    dont.flip <- gwas.data$A1 == correct.orientation$A1 & gwas.data$A2 == correct.orientation$A2

    for ( i in 1 : nrow ( gwas.data ) ) {
        if ( flip [ i ] ) {
            gwas.data$A1 [ i ] <- correct.orientation$A1 [ i ]
            gwas.data$A2 [ i ] <- correct.orientation$A2 [ i ]
            gwas.data$zscore [ i ] <- - gwas.data$EFF [ i ]
            gwas.data$MAF [ i ] <- 1 - gwas.data$FRQ [ i ]
        } else if ( dont.flip [ i ] ) {
            #do nothing
        } else {
            stop ( "Strand Issue")      
        }

I'm running into the error at the first line flip <- gwas.data$A1 == correct.orientation$A2 & gwas.data$A2 == correct.orientation$A1 The error is Error in Ops.factor(gwas.data$A1, correct.orientation$A2) : level sets of factors are different How to fix this?

Parfait · Accepted Answer

Consider forgoing the use of for loop and use the base R merge() function of both dataframes. However, a little data management is needed: 1) temporarily convert factors to characters (or use stringAsFactors=FALSE in read.csv() or read.table()) and 2) adding suffixes for repeat column names. Once calculated MAF is complete with ifelse(), split the merged data frame and reset column names and data types to original structure:

# CONVERT FACTORS TO CHARACTER
gwas.data[, c("A1","A2")] <- sapply(gwas.data[,c("A1","A2")],as.character)
# SUFFIXING COL NAMES TO IDENTIFY IN MERGED DF
names(gwas.data) <- paste0(names(gwas.data), "_A")

# CONVERT FACTORS TO CHARACTER
correct.orientation[, c("A1","A2")] <- sapply(correct.orientation[,c("A1","A2")],as.character)
# SUFFIXING COL NAMES TO IDENTIFY IN MERGED DF
names(correct.orientation) <- paste0(names(correct.orientation ), "_B")

# MERGE DATA FRAMES (ASSUMING SNP IS UNIQUE IDENTIFIER)
comparedf <- merge(gwas.data, correct.orientation, by.x="SNP_A", by.y="SNP_B", all=TRUE)

# CALCULATE NEW MAF
comparedf$MAF_A <- ifelse(((comparedf$A1_A == comparedf$A2_B) &
                           (comparedf$A2_B == comparedf$A1_A)), 
                          (1 - comparedf$MAF_A), 
                          comparedf$MAF_A)
comparedf$zscore_A <- ifelse(((comparedf$A1_A == comparedf$A2_B) &
                              (comparedf$A2_B == comparedf$A1_A)),   
                               -1 * comparedf$zscore_A, 
                               comparedf$zscore_A)

# SPLIT MERGE BACK TO ORIGINAL STRUCTURE
newgwas.data <- comparedf[,names(gwas.data)]
# REMOVE SUFFIX
names(newgwas.data) <- gsub("_A", "", names(newgwas.data))
# RESET FACTORS
newgwas.data$A1 <- as.factor(newgwas.data$A1)
newgwas.data$A2 <- as.factor(newgwas.data$A2)

Running into R error with matching data frame columns

Answers (1)

Related Questions