Reputation: 31
please can you help me again?
I have a data frame that contains 4 columns, which are either a gene symbol or a rank that I have assigned the gene symbol like this:
mb_rank mb_gene ts_rank ts_gene
[1] 1 BIRCA 1 MYCN
[2] 2 MYCN 2 MOB4
[3] 3 ATXN1 3 ABHD17C
[4] 4 ABHD17C 4 AEBP2
5 etc... for up to 6000 rows in some data sets.
the ts columns are usually a lot longer than the mb columns.
I want to arrange the data so that non-duplicates are removed thereby leaving only genes that appear in both columns of the data frame e.g.
mb_rank mb_gene ts_rank ts_gene
[1] 2 MYCN 1 MYCN
[2] 4 ABHD17C 3 ABHD17C
In this example of the desired outcome, the non-duplicated genes have been removed leaving only genes that appeared in both lists to begin with.
I have tried many things like:
`df[df$mb_gene %in% df$ts_gene,]`
but it doesn't work and seems to hit and miss some gene
2) I attempted to write an IF
function but my skills are to limited.
I hope I have described this well enough but if I can clarify anything please ask, I'm really stuck. Thanks in advance!
Upvotes: 0
Views: 207
Reputation: 189
Without more details, it's hard to know about edge cases. In any case, it sounds like a relational table join. Have you tried:
d1 = select(df, c(mb_rank, mb_gene))
d2 = select(df, c(ts_rank, ts_gene))
merge(d1, d2, by.x="mb_gene", by.y="ts_gene")
Upvotes: 0
Reputation: 83
Use: df_new is your new dataframe.
df_new = df[df['mb_gene'] == df['ts_gene']]
Upvotes: 0
Reputation: 160417
In a data.frame
, typically a row is a complete observation, meaning that all data in it correlates (somehow) with the rest. In a survey, one row is either one person (all questions) or one question for one person. In your data here, though, your first row BIRCA
and MYCN
are completely separate, meaning you want to remove one without removing the other. In a "data-science-y" view, this to me suggests your data is improperly shaped.
In order to do what you want, we need to split them into separate frames.
df <- read.table(header = TRUE, stringsAsFactors = FALSE, text = "
mb_rank mb_gene ts_rank ts_gene
1 BIRCA 1 MYCN
2 MYCN 2 MOB4
3 ATXN1 3 ABHD17C
4 ABHD17C 4 AEBP2")
df1 <- df[,1:2]
df2 <- df[,3:4]
df1
# mb_rank mb_gene
# 1 1 BIRCA
# 2 2 MYCN
# 3 3 ATXN1
# 4 4 ABHD17C
df2
# ts_rank ts_gene
# 1 1 MYCN
# 2 2 MOB4
# 3 3 ABHD17C
# 4 4 AEBP2
From here, we can use intersect
to find genes in common:
incommon <- intersect(df1$mb_gene, df2$ts_gene)
df1[df1$mb_gene %in% incommon,]
# mb_rank mb_gene
# 2 2 MYCN
# 4 4 ABHD17C
df2[df2$ts_gene %in% incommon,]
# ts_rank ts_gene
# 1 1 MYCN
# 3 3 ABHD17C
If you are 100% certain that you will always have the same number of rows in each, then you can merely cbind
these together:
cbind(
df1[df1$mb_gene %in% incommon,],
df2[df2$ts_gene %in% incommon,]
)
# mb_rank mb_gene ts_rank ts_gene
# 2 2 MYCN 1 MYCN
# 4 4 ABHD17C 3 ABHD17C
However, if there is a chance that there will be different numbers in each, then you will run into problems. If the number of one is a multiple of the other, you will get "recycling" of data and a warning, but you will still get data (which I think is a mistake):
cbind(
df1[df1$mb_gene %in% incommon,],
df2
)
# Warning in data.frame(..., check.names = FALSE) :
# row names were found from a short variable and have been discarded
# mb_rank mb_gene ts_rank ts_gene
# 1 2 MYCN 1 MYCN
# 2 4 ABHD17C 2 MOB4
# 3 2 MYCN 3 ABHD17C
# 4 4 ABHD17C 4 AEBP2
If not a multiple, though, you'll just get an error:
cbind(
df1[df1$mb_gene %in% incommon,],
df2[1:3,]
)
# Error in data.frame(..., check.names = FALSE) :
# arguments imply differing number of rows: 2, 3
I suggest that you think about this storage structure, as I believe it defeats assumptions that some tools make about rows of a frame.
Upvotes: 1