Reputation: 912
I have the following problem:
I know I could just make a huge if-cascade but I guess that's not only ugly and hard to maintain, but also slow.
What is a fast, easy to maintain way to implement that? Some kind of lookup table perhaps, or a matrix for the combinations? Any code samples would be greatly appreciated. I would have used Biojava, but the current version I am already using does not offer that functionality (or I haven't found it yet...).
Update: there seems to be a bit of confusion here. The consensus symbol is a single char, that stands for a single char in both sequences.
String1 and String2 are, for example "ACGT" and "ACCT" - they mismatch on position 2. Sooo, I want a consensus string to be ACST, because S stands for "either C or G"
I want to make a method like this:
char getConsensus(char a, char b)
Update 2: some of the proposed methods work if I only have 2 sequences. I might need to do several iterations of these "consensifications", so the input alphabet could increase from "ACGT" to "ACGTRYKMSWBDHVN" which would make some of the proposed approaches quite unwieldy to write and maintain.
Upvotes: 2
Views: 877
Reputation: 912
A possible solution using enums, inspired by pablochan, with a little input from biostar.stackexchange.com:
enum lut {
AA('A'), AC('M'), AG('R'), AT('W'), AR('R'), AY('H'), AK('D'), AM('M'), AS('V'), AW('W'), AB('N'), AD('D'), AH('H'), AV('V'), AN('N'),
CA('M'), CC('C'), CG('S'), CT('Y'), CR('V'), CY('Y'), CK('B'), CM('M'), CS('S'), CW('H'), CB('B'), CD('N'), CH('H'), CV('V'), CN('N'),
GA('R'), GC('S'), GG('G'), GT('K'), GR('R'), GY('B'), GK('K'), GM('V'), GS('S'), GW('D'), GB('B'), GD('D'), GH('N'), GV('V'), GN('N'),
TA('W'), TC('Y'), TG('K'), TT('T'), TR('D'), TY('Y'), TK('K'), TM('H'), TS('B'), TW('W'), TB('B'), TD('D'), TH('H'), TV('N'), TN('N'),
RA('R'), RC('V'), RG('R'), RT('D'), RR('R'), RY('N'), RK('D'), RM('V'), RS('V'), RW('D'), RB('N'), RD('D'), RH('N'), RV('V'), RN('N'),
YA('H'), YC('Y'), YG('B'), YT('Y'), YR('N'), YY('Y'), YK('B'), YM('H'), YS('B'), YW('H'), YB('B'), YD('N'), YH('H'), YV('N'), YN('N'),
KA('D'), KC('B'), KG('K'), KT('K'), KR('D'), KY('B'), KK('K'), KM('N'), KS('B'), KW('D'), KB('B'), KD('D'), KH('N'), KV('N'), KN('N'),
MA('M'), MC('M'), MG('V'), MT('H'), MR('V'), MY('H'), MK('N'), MM('M'), MS('V'), MW('H'), MB('N'), MD('N'), MH('H'), MV('V'), MN('N'),
SA('V'), SC('S'), SG('S'), ST('B'), SR('V'), SY('B'), SK('B'), SM('V'), SS('S'), SW('N'), SB('B'), SD('N'), SH('N'), SV('V'), SN('N'),
WA('W'), WC('H'), WG('D'), WT('W'), WR('D'), WY('H'), WK('D'), WM('H'), WS('N'), WW('W'), WB('N'), WD('D'), WH('H'), WV('N'), WN('N'),
BA('N'), BC('B'), BG('B'), BT('B'), BR('N'), BY('B'), BK('B'), BM('N'), BS('B'), BW('N'), BB('B'), BD('N'), BH('N'), BV('N'), BN('N'),
DA('D'), DC('N'), DG('D'), DT('D'), DR('D'), DY('N'), DK('D'), DM('N'), DS('N'), DW('D'), DB('N'), DD('D'), DH('N'), DV('N'), DN('N'),
HA('H'), HC('H'), HG('N'), HT('H'), HR('N'), HY('H'), HK('N'), HM('H'), HS('N'), HW('H'), HB('N'), HD('N'), HH('H'), HV('N'), HN('N'),
VA('V'), VC('V'), VG('V'), VT('N'), VR('V'), VY('N'), VK('N'), VM('V'), VS('V'), VW('N'), VB('N'), VD('N'), VH('N'), VV('V'), VN('N'),
NA('N'), NC('N'), NG('N'), NT('N'), NR('N'), NY('N'), NK('N'), NM('N'), NS('N'), NW('N'), NB('N'), ND('N'), NH('N'), NV('N'), NN('N');
char consensusChar = 'X';
lut(char c) {
consensusChar = c;
}
char getConsensusChar() {
return consensusChar;
}
}
char getConsensus(char a, char b) {
return lut.valueOf("" + a + b).getConsensusChar();
}
Upvotes: 0
Reputation: 86391
A simple, fast solution is to use bitwise-OR.
At startup, initialize two tables:
To get the consensus for a single position:
Here's a simple bitwise representation to get you started:
private static final int A = 1 << 3;
private static final int C = 1 << 2;
private static final int G = 1 << 1;
private static final int T = 1 << 0;
Set the members of the first table like this:
characterToBitwiseTable[ 'd' ] = A | G | T;
characterToBitwiseTable[ 'D' ] = A | G | T;
Set the members of the second table like this:
bitwiseToCharacterTable[ A | G | T ] = 'd';
Upvotes: 2
Reputation: 4543
Considered reading multiple sequences at once - I would:
There are probably ways hot o optimize the second and the first steps.
Upvotes: 0
Reputation: 5715
You can just use a HashMap<String, String>
which maps the conflicts/differences to the consensus symbols. You can either "hard code" (fill in the code of your app) or fill it during the startup of your app from some outside source (a file, database etc.). Then you just use it whenever you have a difference.
String consensusSymbol = consensusMap.get(differenceString);
EDIT: To accomodate your API request ;]
Map<String, Character> consensusMap; // let's assume this is filled somewhere
...
char getConsensus(char a, char b) {
return consensusMap.get("" + a + b);
}
I realize this look crude but I think you get the point. This might be slightly slower than a lookup table but it's also a lot easier to maintain.
YET ANOTHER EDIT:
If you really want something super fast and you actuall use the char
type you can just create a 2d table and index it with characters (since they're interpreted as numbers).
char lookup[][] = new char[256][256]; // all "english" letters will be below 256
//... fill it... e. g. lookup['A']['C'] = 'M';
char consensus = lookup['A']['C'];
Upvotes: 2
Reputation: 10285
The possible combinations are around 20. So there is not a real performace issue. If you do not wish to do a big if else block, the fastest solution would be to build a Tree data structure. http://en.wikipedia.org/wiki/Tree_data_structure. This is the fastest way to do what you want to do.
In a tree, you put all the possible combinations and you input the string and it traverses the tree to find the longest matching sequence for a symbol
Do you want an illustrated example?
PS: All Artificial Intelligence softwares uses the Tree apporach which is the fastest and the most adapted.
Upvotes: 0
Reputation: 121710
Given that they are all unique symbols, I'd go for an Enum
:
public Enum ConsensusSymbol
{
A("A"), // simple case
// ....
GTUC("B"),
// etc
// last entry:
AGCTU("N");
// Not sure what X means?
private final String symbol;
ConsensusSymbol(final String symbol)
{
this.symbol = symbol;
}
public String getSymbol()
{
return symbol;
}
}
Then, when you encounter a difference, use .valueOf()
:
final ConsensusSymbol symbol;
try {
symbol = ConsensusSymbol.valueOf("THESEQUENCE");
} catch (IllegalArgumentException e) { // Unknown sequence
// TODO
}
For instance, if you encounter GTUC
as a String, Enum.valueOf("GTUC")
will return the GTUC
enum value, and calling getSymbol()
on that value will return "B"
.
Upvotes: 0