Chemist learns to code
Chemist learns to code

Reputation: 487

Partially working for the statement used for checking if column value in one column exists in another column

I am checking if a value in the column catalytic.site.AXK.start.postion is in another column AxK.postion2 using the following statement. If so, create another column catalytic.site.matched.sequence.position indicating a value of TRUE. It seems that it works for some values, but not for all. Does anyone know why this happens and how to fix it? Thanks!

df4$catalytic.site.matched.sequence.position[which(df4$catalytic.site.AXK.start.postion %in% df4$AxK.postion2)] <- TRUE

enter image description here

structure(list(Entry = c("Q86TB3", "Q96GD4", "Q00537", "Q8IZL9", 
"Q13131", "Q96L96"), Entry.name = c("ALPK2_HUMAN", "AURKB_HUMAN", 
"CDK17_HUMAN", "CDK20_HUMAN", "AAPK1_HUMAN", "ALPK3_HUMAN"), 
    Gene.names...primary.. = c("ALPK2", "AURKB", "CDK17", "CDK20", 
    "PRKAA1", "ALPK3"), Protein.names = c("Alpha-protein kinase 2 (EC 2.7.11.1) (Heart alpha-protein kinase)", 
    "Aurora kinase B (EC 2.7.11.1) (Aurora 1) (Aurora- and IPL1-like midbody-associated protein 1) (AIM-1) (Aurora/IPL1-related kinase 2) (ARK-2) (Aurora-related kinase 2) (STK-1) (Serine/threonine-protein kinase 12) (Serine/threonine-protein kinase 5) (Serine/threonine-protein kinase aurora-B)", 
    "Cyclin-dependent kinase 17 (EC 2.7.11.22) (Cell division protein kinase 17) (PCTAIRE-motif protein kinase 2) (Serine/threonine-protein kinase PCTAIRE-2)", 
    "Cyclin-dependent kinase 20 (EC 2.7.11.22) (CDK-activating kinase p42) (CAK-kinase p42) (Cell cycle-related kinase) (Cell division protein kinase 20) (Cyclin-dependent protein kinase H) (Cyclin-kinase-activating kinase p42)", 
    "5'-AMP-activated protein kinase catalytic subunit alpha-1 (AMPK subunit alpha-1) (EC 2.7.11.1) (Acetyl-CoA carboxylase kinase) (ACACA kinase) (EC 2.7.11.27) (Hydroxymethylglutaryl-CoA reductase kinase) (HMGCR kinase) (EC 2.7.11.31) (Tau-protein kinase PRKAA1) (EC 2.7.11.26)", 
    "Alpha-protein kinase 3 (EC 2.7.11.1) (Muscle alpha-protein kinase)"
    ), Binding.site = c(NA, "BINDING 106;  /note=\"ATP\";  /evidence=\"ECO:0000255|PROSITE-ProRule:PRU00159\"", 
    "BINDING 221;  /note=\"ATP\";  /evidence=\"ECO:0000255|PROSITE-ProRule:PRU00159\"", 
    "BINDING 33;  /note=\"ATP\";  /evidence=\"ECO:0000255|PROSITE-ProRule:PRU00159\"", 
    "BINDING 56;  /note=\"ATP\";  /evidence=\"ECO:0000255|PROSITE-ProRule:PRU00159\"", 
    NA), Site = c(NA_character_, NA_character_, NA_character_, 
    NA_character_, NA_character_, NA_character_), Sequence = c("MKDSEGPQRPPLCFLSTLLSQKVPEKSDAVLRCIISGQPKPEVTWYKNGQAIDGSGIISNYEFFENQYIHVLHLSCCTKNDAAVYQISAKNSFGMICCSASVEVECSSENPQLSPNLEDDRDRGWKHETGTHEEERANQIDEKEHPYKEEESISPGTPRSADSSPSKSNHSLSLQSLGNLDISVSSSENPLGVKGTRHTGEAYDPSNTEEIANGLLFLNSSHIYEKQDRCCHKTVHSMASKFTDGDLNNDGPHDEGLRSSQQNPKVQKYISFSLPLSEATAHIYPGDSAVANKQPSPQLSSEDSDSDYELCPEITLTYTEEFSDDDLEYLECSDVMTDYSNAVWQRNLLGTEHVFLLESDDEEMEFGEHCLGGCEHFLSGMGCGSRVSGDAGPMVATAGFCGHHSQPQEVGVRSSRVSKHGPSSPQTGMTLILGPHQDGTSSVTEQGRYKLPTAPEAAENDYPGIQGETRDSHQAREEFASDNLLNMDESVRETEMKLLSGESENSGMSQCWETAADKRVGGKDLWSKRGSRKSARVRQPGMKGNPKKPNANLRESTTEGTLHLCSAKESAEPPLTQSDKRETSHTTAAATGRSSHADARECAISTQAEQEAKTLQTSTDSVSKEGNTNCKGEGMQVNTLFETSQVPDWSDPPQVQVQETVRETISCSQMPAFSEPAGEESPFTGTTTISFSNLGGVHKENASLAQHSEVKPCTCGPQHEEKQDRDGNIPDNFREDLKYEQSISEANDETMSPGVFSRHLPKDARADFREPVAVSVASPEPTDTALTLENVCDEPRDREAVCAMECFEAGDQGTCFDTIDSLVGRPVDKYSPQEICSVDTELAEGQNKVSDLCSSNDKTLEVFFQTQVSETSVSTCKSSKDGNSVMSPLFTSTFTLNISHTASEGATGENLAKVENSTYPLASTVHAGQEQPSPSNSGGLDETQLLSSENNPLVQFKEGGDKSPSPSAADTTATPASYSSIVSFPWEKPTTLTANNECFQATRETEDTSTVTIATEVHPAKYLAVSIPEDKHAGGTEERFPRASHEKVSQFPSQVQLDHILSGATIKSTKELLCRAPSVPGVPHHVLQLPEGEGFCSNSPLQVDNLSGDKSQTVDRADFRSYEENFQERGSETKQGVQQQSLSQQGSLSAPDFQQSLPTTSAAQEERNLVPTAHSPASSREGAGQRSGWGTRVSVVAETAGEEDSQALSNVPSLSDILLEESKEYRPGNWEAGNKLKIITLEASASEIWPPRQLTNSESKASDGGLIIPDKVWAVPDSLKADAVVPELAPSEIAALAHSPEDAESALADSRESHKGEEPTISVHWRSLSSRGFSQPRLLESSVDPVDEKELSVTDSLSAASETGGKENVNNVSQDQEEKQLKMDHTAFFKKFLTCPKILESSVDPIDEISVIEYTRAGKPEPSETTPQGAREGGQSNDGNMGHEAEIQPAILQVPCLQGTILSENRISRSQEGSMKQEAEQIQPEEAKTAIWQVLQPSEGGERIPSGCSIGQIQESSDGSLGEAEQSKKDKAELISPTSPLSSCLPIMTHASLGVDTHNSTGQIHDVPENDIVEPRKRQYVFPVSQKRGTIENERGKPLPSSPDLTRFPCTSSPEGNVTDFLISHKMEEPKIEVLQIGETKPPSSSSSSAKTLAFISGERELEKAPKLLQDPCQKGTLGCAKKSREREKSLEARAGKSPGTLTAVTGSEEVKRKPEAPGSGHLAEGVKKKILSRVAALRLKLEEKENIRKNSAFLKKMPKLETSLSHTEEKQDPKKPSCKREGRAPVLLKKIQAEMFPEHSGNVKLSCQFAEIHEDSTICWTKDSKSIAQVQRSAGDNSTVSFAIVQASPKDQGLYYCCIKNSYGKVTAEFNLTAEVLKQLSSRQDTKGCEEIEFSQLIFKEDFLHDSYFGGRLRGQIATEELHFGEGVHRKAFRSTVMHGLMPVFKPGHACVLKVHNAIAYGTRNNDELIQRNYKLAAQECYVQNTARYYAKIYAAEAQPLEGFGEVPEIIPIFLIHRPENNIPYATVEEELIGEFVKYSIRDGKEINFLRRESEAGQKCCTFQHWVYQKTSGCLLVTDMQGVGMKLTDVGIATLAKGYKGFKGNCSMTFIDQFKALHQCNKYCKMLGLKSLQNNNQKQKQPSIGKSKVQTNSMTIKKAGPETPGEKKT", 
    "MAQKENSYPWPYGRQTAPSGLSTLPQRVLRKEPVTPSALVLMSRSNVQPTAAPGQKVMENSSGTPDILTRHFTIDDFEIGRPLGKGKFGNVYLAREKKSHFIVALKVLFKSQIEKEGVEHQLRREIEIQAHLHHPNILRLYNYFYDRRRIYLILEYAPRGELYKELQKSCTFDEQRTATIMEELADALMYCHGKKVIHRDIKPENLLLGLKGELKIADFGWSVHAPSLRRKTMCGTLDYLPPEMIEGRMHNEKVDLWCIGVLCYELLVGNPPFESASHNETYRRIVKVDLKFPASVPMGAQDLISKLLRHNPSERLPLAQVSAHPWVRANSRRVLPPSALQSVA", 
    "MKKFKRRLSLTLRGSQTIDESLSELAEQMTIEENSSKDNEPIVKNGRPPTSHSMHSFLHQYTGSFKKPPLRRPHSVIGGSLGSFMAMPRNGSRLDIVHENLKMGSDGESDQASGTSSDEVQSPTGVCLRNRIHRRISMEDLNKRLSLPADIRIPDGYLEKLQINSPPFDQPMSRRSRRASLSEIGFGKMETYIKLEKLGEGTYATVYKGRSKLTENLVALKEIRLEHEEGAPCTAIREVSLLKDLKHANIVTLHDIVHTDKSLTLVFEYLDKDLKQYMDDCGNIMSMHNVKLFLYQILRGLAYCHRRKVLHRDLKPQNLLINEKGELKLADFGLARAKSVPTKTYSNEVVTLWYRPPDVLLGSSEYSTQIDMWGVGCIFFEMASGRPLFPGSTVEDELHLIFRLLGTPSQETWPGISSNEEFKNYNFPKYKPQPLINHAPRLDSEGIELITKFLQYESKKRVSAEEAMKHVYFRSLGPRIHALPESVSIFSLKEIQLQKDPGFRNSSYPETGHGKNRRQSMLF", 
    "MDQYCILGRIGEGAHGIVFKAKHVETGEIVALKKVALRRLEDGFPNQALREIKALQEMEDNQYVVQLKAVFPHGGGFVLAFEFMLSDLAEVVRHAQRPLAQAQVKSYLQMLLKGVAFCHANNIVHRDLKPANLLISASGQLKIADFGLARVFSPDGSRLYTHQVATRWYRAPELLYGARQYDQGVDLWSVGCIMGELLNGSPLFPGKNDIEQLCYVLRILGTPNPQVWPELTELPDYNKISFKEQVPMPLEEVLPDVSPQALDLLGQFLLYPPHQRIAASKALLHQYFFTAPLPAHPSELPIPQRLGGPAPKAHPGPPHIHDFHVDRPLEESLLNPELIRPFILEG", 
    "MRRLSSWRKMATAEKQKHDGRVKIGHYILGDTLGVGTFGKVKVGKHELTGHKVAVKILNRQKIRSLDVVGKIRREIQNLKLFRHPHIIKLYQVISTPSDIFMVMEYVSGGELFDYICKNGRLDEKESRRLFQQILSGVDYCHRHMVVHRDLKPENVLLDAHMNAKIADFGLSNMMSDGEFLRTSCGSPNYAAPEVISGRLYAGPEVDIWSSGVILYALLCGTLPFDDDHVPTLFKKICDGIFYTPQYLNPSVISLLKHMLQVDPMKRATIKDIREHEWFKQDLPKYLFPEDPSYSSTMIDDEALKEVCEKFECSEEEVLSCLYNRNHQDPLAVAYHLIIDNRRIMNEAKDFYLATSPPDSFLDDHHLTRPHPERVPFLVAETPRARHTLDELNPQKSKHQGVRKAKWHLGIRSQSRPNDIMAEVCRAIKQLDYEWKVVNPYYLRVRRKNPVTSTYSKMSLQLYQVDSRTYLLDFRSIDDEITEAKSGTATPQRSGSVSNYRSCQRSDSDAEAQGKSSEVSLTSSVTSLDSSPVDLTPRPGSHTIEFFEMCANLIKILAQ", 
    "MEVAWLVYVLGQQPLARQGEGQSRLVPGRGLVLWLPGLPRSSPSWPAVDLAPLAPARPRGPLICHTGHEQAGREPGPGSSTKGPVLHDQDTRCAFLPRPPGPLQTRRYCRHQGRQGSGLGAGPGAGTWAPAPPGVSKPRCPGRARPGEGQQQVTTARPPAINRGARQPRAGAAAAGRGPGAGAWRTGEAAASAGPAVGEGGAMGSRRAPSRGWGAGGRSGAGGDGEDDGPVWIPSPASRSYLLSVRPETSLSSNRLSHPSSGRSTFCSIIAQLTEETQPLFETTLKSRSVSEDSDVRFTCIVTGYPEPEVTWYKDDTELDRYCGLPKYEITHQGNRHTLQLYRCREEDAAIYQASAQNSKGIVSCSGVLEVGTMTEYKIHQRWFAKLKRKAAAKLREIEQSWKHEKAVPGEVDTLRKLSPDRFQRKRRLSGAQAPGPSVPTREPEGGTLAAWQEGETETAQHSGLGLINSFASGEVTTNGEAAPENGEDGEHGLLTYICDAMELGPQRALKEESGAKKKKKDEESKQGLRKPELEKAAQSRRSSENCIPSSDEPDSCGTQGPVGVEQVQTQPRGRAARGPGSSGTDSTRKPASAVGTPDKAQKAPGPGPGQEVYFSLKDMYLENTQAVRPLGEEGPQTLSVRAPGESPKGKAPLRARSEGVPGAPGQPTHSLTPQPTRPFNRKRFAPPKPKGEATTDSKPISSLSQAPECGAQSLGKAPPQASVQVPTPPARRRHGTRDSTLQGQAGHRTPGEVLECQTTTAPTMSASSSSDVASIGVSTSGSQGIIEPMDMETQEDGRTSANQRTGSKKNVQADGKIQVDGRTRGDGTQTAQRTRADRKTQVDAGTQESKRPQSDRSAQKGMMTQGRAETQLETTQAGEKIQEDRKAQADKGTQEDRRMQGEKGMQGEKGTQSEGSAPTAMEGQSEQEVATSLGPPSRTPKLPPTAGPRAPLNIECFVQTPEGSCFPKKPGCLPRSEEAVVTASRNHEQTVLGPLSGNLMLPAQPPHEGSVEQVGGERCRGPQSSGPVEAKQEDSPFQCPKEERPGGVPCMDQGGCPLAGLSQEVPTMPSLPGTGLTASPKAGPCSTPTSQHGSTATFLPSEDQVLMSSAPTLHLGLGTPTQSHPPETMATSSEGACAQVPDVEGRTPGPRSCDPGLIDSLKNYLLLLLKLSSTETSGAGGESQVGAATGGLVPSATLTPTVEVAGLSPRTSRRILERVENNHLVQSAQTLLLSPCTSRRLTGLLDREVQAGRQALAAARGSWGPGPSSLTVPAIVVDEEDPGLASEGASEGEGEVSPEGPGLLGASQESSMAGRLGEAGGQAAPGQGPSAESIAQEPSQEEKFPGEALTGLPAATPEELALGARRKRFLPKVRAAGDGEATTPEERESPTVSPRGPRKSLVPGSPGTPGRERRSPTQGRKASMLEVPRAEEELAAGDLGPSPKAGGLDTEVALDEGKQETLAKPRKAKDLLKAPQVIRKIRVEQFPDASGSLKLWCQFFNILSDSVLTWAKDQRPVGEVGRSAGDEGPAALAIVQASPVDCGVYRCTIHNEHGSASTDFCLSPEVLSGFISREEGEVGEEIEMTPMVFAKGLADSGCWGDKLFGRLVSEELRGGGYGCGLRKASQAKVIYGLEPIFESGRTCIIKVSSLLVFGPSSETSLVGRNYDVTIQGCKIQNMSREYCKIFAAEARAAPGFGEVPEIIPLYLIYRPANNIPYATLEEDLGKPLESYCSREWGCAEAPTASGSSEAMQKCQTFQHWLYQWTNGSFLVTDLAGVDWKMTDVQIATKLRGYQGLKESCFPALLDRFASSHQCNAYCELLGLTPLKGPEAAHPQAKAKGSKSPSAGRKGSQLSPQPQKKGLPSPQGTRKSAPSSKATPQASEPVTTQLLGQPPTQEEGSKAQGMR"
    ), catalytic.site = c(NA, 106, 221, 33, 56, NA), catalytic.sequence = c("AGK", 
    "ALK", "AMK", "APK", "AIK", "ATK"), AxK.postion2 = c("239, 291, 516, 1417, 1665, 1681, 1695", 
    "2, 104", "219, 467", "31, 279, 310", "13, 54, 303, 427", 
    "392, 509, 516, 601, 859, 890, 1788"), catalytic.site.AXK.start.postion = c(NA, 
    "104", "219", "31", "54", NA), catalytic.site.matched.sequence.position = c(NA, 
    NA, NA, TRUE, TRUE, NA)), row.names = c(84L, 91L, 116L, 118L, 
128L, 133L), class = "data.frame")

Upvotes: 1

Views: 41

Answers (1)

Ben
Ben

Reputation: 30474

Focusing on your 2 columns of interest:

                             AxK.postion2 catalytic.site.AXK.start.postion
84  239, 291, 516, 1417, 1665, 1681, 1695                             <NA>
91                                 2, 104                              104
116                              219, 467                              219
118                          31, 279, 310                               31
128                      13, 53, 303, 427                               54
133    392, 509, 516, 601, 859, 890, 1788                             <NA>

It appears both are strings. The first column AxK.position2 includes comma-separated strings of numbers.

If you use %in%, you are essentially looking to see if an element is inside a vector. In that case, you need to separate your string into individual elements (creating a vector of them).

For example,

R> "5" %in% c("3, 4, 5")
[1] FALSE

R> "5" %in% c("3", "4", "5")
[1] TRUE

To split up the string in each row, use strsplit:

R> strsplit(df4$AxK.postion2, ', ')

[[1]]
[1] "239"  "291"  "516"  "1417" "1665" "1681" "1695"

[[2]]
[1] "2"   "104"

[[3]]
[1] "219" "467"

[[4]]
[1] "31"  "279" "310"

[[5]]
[1] "13"  "53"  "303" "427"

[[6]]
[1] "392"  "509"  "516"  "601"  "859"  "890"  "1788"

Note the space after the comma, since you have spaces in your strings after commas. Otherwise, you should remove the whitespace.

Since the result is a list, you can unlist, resulting in a vector for each row.

Ultimately, this should give you the correct new column:

df4$catalytic.site.matched.sequence.position <- 
  df4$catalytic.site.AXK.start.postion %in% unlist(strsplit(df4$AxK.postion2, ', '))

Upvotes: 1

Related Questions