Luiz Fernando
Luiz Fernando

Reputation: 11

Knuth Morris Pratt Algorithm

Given a string of DNA characters (accepted characters: A, C, G, T) and a disease, with gene x (consider a pattern P to be found) also using the same alphabet. The task is to search the DNA strand using the provided gene, check the probability of the disease with gene x by verifying matches of the gene pattern within the DNA strand.

I need to modify the KMP algorithm for two things:

Always find at least one matching character from the pattern in the strand, even if it is not a full match, and recreate the shift table without that character. Count how many combinations of the gene from the DNA were found. Of course, they must follow the restriction of the minimum size for the match to be considered valid

For example:

enter image description here

Here is the code of KMP:

void calcular_tabela(int *k, char *P, int len) {
    int j = -1;
    k[0] = -1;
    for (int i = 1; i < len; i++) {
        while (j >= 0 && P[j + 1] != P[i])
            j = k[j];
        if (P[j + 1] == P[i])
            j++;
        k[i] = j;
    }
}

void KMP(int *R, int *count, char *T, char *P, float *compatibildade, int tamSubcadeia) {
    int n = strlen(T), m = strlen(P);
    int *k = malloc(m * sizeof(int));
    int caracteres_iguais = 0;
    int aux = 0;
    // Calcula a tabela para o padrão inteiro inicialmente
    calcular_tabela(k, P, m);
    
    // Debug: imprime a tabela inicial
    printf("TABELA INICIAL:\n");
    for (int i = 0; i < m; i++) {
        printf("pos %d: %d\n", i, k[i]);
    }
 
    for (int i = 0, j = -1; i < n; i++) {
        // Processamento padrão do KMP
        while (j >= 0 && P[j + 1] != T[i]) j = k[j];

        if (P[j + 1] == T[i])
            j++;
        
        if (j >= tamSubcadeia) { // Encontrou correspondência parcial ou completa
            caracteres_iguais += j + 1;
            printf("passou, j= %d\n", j);
            calcular_tabela(k, P + j, m-j);
            j = k[j];
        } else {
            calcular_tabela(k, P + j, m-j);
            j = k[j];
        }

        // (Opcional) Debug: imprime a tabela alterada a cada iteração
        printf("TABELA ALTERADA:\n");
        for (int i = 0; i < (m - caracteres_iguais); i++) {
            printf("caractere %c pos %d: %d\n", P[caracteres_iguais + i], i, k[i]);
        }
        printf("\n");
    }
    
    float pct = ((float)caracteres_iguais / m) * 100;
    printf("%d combinações: %f\n", caracteres_iguais, pct);
    *compatibildade = pct;
    
    free(k);
}

I don't know how I can change the KMP so it does the question it asks. Currently it works, but only in case the gene pattern has only 1 character type "GGG, for example".

Hope some KMP specialist see this code

Recreate the table when it found a pattern, or a character

------------------------------ATUALIZAÇÃO-------------

Thank you for answering me, the question of DNA can be abstracted, you can consider a chain with the alphabet A C G T, and a standard to use KMP.

There are 2 cases in which "cutting" must be done in the pattern string: when the match size meets the minimum size of the substring (work >= sizeSubstring) And when there is a match and it does not have the minimum size to be considered. In this case, it is also necessary to remove these characters from the string, but not iterate through the variable to find the combinations (equal characters).

I made some changes to make it work in the example, but in others it is no longer correct.

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <stdbool.h>

void cut_string(char *test, char r)
{
    int i = 1, j = 0;

    while(test[i] != '\0')
    {
        test[j] = test[i];
        j++;
        i++;
    }
    test[j] = '\0';
    printf("New test string: %s\n", test);  /* Show the next test string for illustration   */
}

void KMP(int *count, char *T, char *P, float *compatibildade, int tamSubcadeia)
{
    int step_1 = 0;             /* Increment work variable for the test string              */
    int step_2 = 0;             /* Increment work variable for the DNA strand               */
    int step_3 = 0;             /* Work variable to reset the testing continuation          */
    int work = 0;               /* String match work variable                               */
    int m = strlen(P);          /* DNA string length used to calculate percentage           */
    int caracteres_iguais = 0;  /* Number of valid substring matches >= tamSubcadeia        */
    bool found = false;         /* Work variable if when a valid pattern match was found    */
    bool new_letter = false;    /* Test variable to ensure string match contains two chars  */
    float pct;                  /* Match percentage work variable                           */
    char sub_string[100];       /* Work field to print out when a substring match is found  */

    printf("%s", T);
    while(1)
    {
        if (P[step_1] == T[step_2])
        {
            if (step_3 == 0)
                step_3 = step_2 + 1;

            sub_string[work] = P[step_1];

            if (work > 0 && sub_string[work] != sub_string[0])
                new_letter = true;

            work++;

            if (work > 0 && new_letter) /* Valid length match   */
            {
                if (work >= tamSubcadeia){
                    sub_string[work] = '\0';
                    printf("%s\n", sub_string);
                    found = true;
                    new_letter = false;    /* Increment number of matches      */
                } else {
                    sub_string[work] = '\0';
                    printf("%s\n", sub_string);
                    new_letter = false;    /* Increment number of matches      */
                }
            }

            step_1++;
            step_2++;

            if (T[step_2] == '\0')  /* Quit when end of DNA strand found    */
                break;
        }
        else
        {
            if (found)
            {
                cut_string(P, P[0]);    /* Remove first character from test */
                found = false;
                caracteres_iguais++;
                work = 0;
                step_1 = 0;
                step_2 = step_3;
                step_3 = 0;
                continue;
            }

            step_2++;

            if (T[step_2] == '\0')  /* Quit when end of DNA strand found    */
                break;
        }
    }

    pct = ((float)caracteres_iguais / m) * 100; /* Calculate matches pct    */
    *compatibildade = pct;
    *count = caracteres_iguais;

    return;
}

int main()
{
    int count = 0, tamSubcadeia = 3;
    float compatibildade = 0.0;
    char T[100], P[100];

    strcpy(P, "AATTTGGCCC");
    strcpy(T, "AAAAAAAAAATTTTTTTTTTTGGGGGGGGG");

    KMP(&count, T, P, &compatibildade, tamSubcadeia);   /* Removed "R" - did not see its use    */

    printf("%d combinações: %f\n", count, compatibildade);

    return 0;
}

E aqui estão alguns outros exemplos de padrão que vc pode tentar testar também.

GGGGGGGGGG -for the DNA in the code AAAATTTCGTTAAATTTGAACATAGGGATA - new DNA

And here are some other examples that you can try to test too.

I'm going to take some time to rest, I'm no longer thinking after 10 hours of coding. Thanks for the help!

Upvotes: 1

Views: 98

Answers (1)

NoDakker
NoDakker

Reputation: 4816

First off, I am not a geneticist, but a retired programmer/analyst, so some of your detail about genetics was unclear to me. And reviewing the Knuth Morris Pratt algorithm with how you wanted to use it to search out substring matches within a DNA strand, I slowly worked my way back to how to analyze the DNA strand using the spirit of the KMP algorithm if not strictly following the letter of the law, so to speak.

First, when trying out your functions, I added in a simple prototype setup in the "main" function to replicate the DNA strand and the test strand.

int main()
{
    int count = 0, tamSubcadeia = 3, R = 0;
    float compatibildade = 0.0;
    char T[100], P[100];

    strcpy(T, "AATTTGGCCC");
    strcpy(P, "AAAAAAAAAATTTTTTTTTTTGGGGGGGGG");

    KMP(&R, &count, T, P, &compatibildade, tamSubcadeia);

    return 0;
}

Which yielded a percentage of zero for terminal output.

0 combinações: 0.000000

So analyzing the idea of searching the DNA strand for all or part of the test strand, iteratively removing the first character of the test string to continue testing, I came up with the following refactored code.

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <stdbool.h>

void cut_string(char *test, char r)
{
    int i = 1, j = 0;

    while(test[i] != '\0')
    {
        test[j] = test[i];
        j++;
        i++;
    }
    test[j] = '\0';
    printf("New test string: %s\n", test);  /* Show the next test string for illustration   */
}

void KMP(int *count, char *T, char *P, float *compatibildade, int tamSubcadeia)
{
    int step_1 = 0;             /* Increment work variable for the test string              */
    int step_2 = 0;             /* Increment work variable for the DNA strand               */
    int step_3 = 0;             /* Work variable to reset the testing continuation          */
    int work = 0;               /* String match work variable                               */
    int m = strlen(P);          /* DNA string length used to calculate percentage           */
    int caracteres_iguais = 0;  /* Number of valid substring matches >= tamSubcadeia        */
    bool found = false;         /* Work variable if when a valid pattern match was found    */
    bool new_letter = false;    /* Test variable to ensure string match contains two chars  */
    float pct;                  /* Match percentage work variable                           */
    char sub_string[100];       /* Work field to print out when a substring match is found  */

    while(1)
    {
        if (T[step_1] == P[step_2])
        {
            if (step_3 == 0)
                step_3 = step_2 + 1;

            sub_string[work] = T[step_1];

            if (work > 0 && sub_string[work] != sub_string[0])
                new_letter = true;

            work++;

            if (work >= tamSubcadeia && new_letter) /* Valid length match   */
            {
                sub_string[work] = '\0';
                printf("%s\n", sub_string);
                found = true;
                new_letter = false;
                caracteres_iguais++;    /* Increment number of matches      */
            }

            step_1++;
            step_2++;

            if (P[step_2] == '\0')  /* Quit when end of DNA strand found    */
                break;
        }
        else
        {
            if (found)
            {
                cut_string(T, T[0]);    /* Remove first character from test */
                found = false;
                work = 0;
                step_1 = 0;
                step_2 = step_3;
                step_3 = 0;
                continue;
            }

            step_2++;

            if (P[step_2] == '\0')  /* Quit when end of DNA strand found    */
                break;
        }
    }

    pct = ((float)caracteres_iguais / m) * 100; /* Calculate matches pct    */
    *compatibildade = pct;
    *count = caracteres_iguais;

    return;
}

int main()
{
    int count = 0, tamSubcadeia = 3;
    float compatibildade = 0.0;
    char T[100], P[100];

    strcpy(T, "AATTTGGCCC");
    strcpy(P, "AAAAAAAAAATTTTTTTTTTTGGGGGGGGG");

    KMP(&count, T, P, &compatibildade, tamSubcadeia);   /* Removed "R" - did not see its use    */

    printf("%d combinações: %f\n", count, compatibildade);

    return 0;
}

Following are some of the pertinent points in the refactoring:

  • In lieu of calling the "calcular_tabela" function, I defined a function called "cut_string" to iteratively remove the first character of the test string once no more matches were found that included the initial character.
  • As matches were found that were the same length or longer as the minimum substring length, a match counter was incremented.
  • Once all of the subsequent matches were found with the test string as is, the "cut_string" function was called, and substring testing continued with the next character in the DNA strand.
  • Once the testing reached the end of the DNA strand, a percentage was calculated and printed.

Following is the terminal output using the DNA strand example and test string example noted above.

craig@Vera:~/C_Programs/Console/Genetics/bin/Release$ ./Genetics 
AAT
AATT
AATTT
New test string: ATTTGGCCC
ATT
ATTT
New test string: TTTGGCCC
TTTG
TTTGG
New test string: TTGGCCC
TTG
TTGG
New test string: TGGCCC
TGG
New test string: GGCCC
10 combinações: 33.333336

The first thing to note is that the percentage the refactored program calculated is different than the 50% value noted in your example, which probably is reflective of my inexperience in genetics. So that might be something you might look at

Upvotes: 0

Related Questions