Knuth Morris Pratt Algorithm

Question

Given a string of DNA characters (accepted characters: A, C, G, T) and a disease, with gene x (consider a pattern P to be found) also using the same alphabet. The task is to search the DNA strand using the provided gene, check the probability of the disease with gene x by verifying matches of the gene pattern within the DNA strand.

I need to modify the KMP algorithm for two things:

Always find at least one matching character from the pattern in the strand, even if it is not a full match, and recreate the shift table without that character. Count how many combinations of the gene from the DNA were found. Of course, they must follow the restriction of the minimum size for the match to be considered valid

For example:

Here is the code of KMP:

void calcular_tabela(int *k, char *P, int len) {
    int j = -1;
    k[0] = -1;
    for (int i = 1; i < len; i++) {
        while (j >= 0 && P[j + 1] != P[i])
            j = k[j];
        if (P[j + 1] == P[i])
            j++;
        k[i] = j;
    }
}

void KMP(int *R, int *count, char *T, char *P, float *compatibildade, int tamSubcadeia) {
    int n = strlen(T), m = strlen(P);
    int *k = malloc(m * sizeof(int));
    int caracteres_iguais = 0;
    int aux = 0;
    // Calcula a tabela para o padrão inteiro inicialmente
    calcular_tabela(k, P, m);
    
    // Debug: imprime a tabela inicial
    printf("TABELA INICIAL:
");
    for (int i = 0; i < m; i++) {
        printf("pos %d: %d
", i, k[i]);
    }
 
    for (int i = 0, j = -1; i < n; i++) {
        // Processamento padrão do KMP
        while (j >= 0 && P[j + 1] != T[i]) j = k[j];

        if (P[j + 1] == T[i])
            j++;
        
        if (j >= tamSubcadeia) { // Encontrou correspondência parcial ou completa
            caracteres_iguais += j + 1;
            printf("passou, j= %d
", j);
            calcular_tabela(k, P + j, m-j);
            j = k[j];
        } else {
            calcular_tabela(k, P + j, m-j);
            j = k[j];
        }

        // (Opcional) Debug: imprime a tabela alterada a cada iteração
        printf("TABELA ALTERADA:
");
        for (int i = 0; i < (m - caracteres_iguais); i++) {
            printf("caractere %c pos %d: %d
", P[caracteres_iguais + i], i, k[i]);
        }
        printf("
");
    }
    
    float pct = ((float)caracteres_iguais / m) * 100;
    printf("%d combinações: %f
", caracteres_iguais, pct);
    *compatibildade = pct;
    
    free(k);
}

I don't know how I can change the KMP so it does the question it asks. Currently it works, but only in case the gene pattern has only 1 character type "GGG, for example".

Hope some KMP specialist see this code

Recreate the table when it found a pattern, or a character

------------------------------ATUALIZAÇÃO-------------

Thank you for answering me, the question of DNA can be abstracted, you can consider a chain with the alphabet A C G T, and a standard to use KMP.

There are 2 cases in which "cutting" must be done in the pattern string: when the match size meets the minimum size of the substring (work >= sizeSubstring) And when there is a match and it does not have the minimum size to be considered. In this case, it is also necessary to remove these characters from the string, but not iterate through the variable to find the combinations (equal characters).

I made some changes to make it work in the example, but in others it is no longer correct.

#include 
#include 
#include 
#include 

void cut_string(char *test, char r)
{
    int i = 1, j = 0;

    while(test[i] != '\0')
    {
        test[j] = test[i];
        j++;
        i++;
    }
    test[j] = '\0';
    printf("New test string: %s
", test);  /* Show the next test string for illustration   */
}

void KMP(int *count, char *T, char *P, float *compatibildade, int tamSubcadeia)
{
    int step_1 = 0;             /* Increment work variable for the test string              */
    int step_2 = 0;             /* Increment work variable for the DNA strand               */
    int step_3 = 0;             /* Work variable to reset the testing continuation          */
    int work = 0;               /* String match work variable                               */
    int m = strlen(P);          /* DNA string length used to calculate percentage           */
    int caracteres_iguais = 0;  /* Number of valid substring matches >= tamSubcadeia        */
    bool found = false;         /* Work variable if when a valid pattern match was found    */
    bool new_letter = false;    /* Test variable to ensure string match contains two chars  */
    float pct;                  /* Match percentage work variable                           */
    char sub_string[100];       /* Work field to print out when a substring match is found  */

    printf("%s", T);
    while(1)
    {
        if (P[step_1] == T[step_2])
        {
            if (step_3 == 0)
                step_3 = step_2 + 1;

            sub_string[work] = P[step_1];

            if (work > 0 && sub_string[work] != sub_string[0])
                new_letter = true;

            work++;

            if (work > 0 && new_letter) /* Valid length match   */
            {
                if (work >= tamSubcadeia){
                    sub_string[work] = '\0';
                    printf("%s
", sub_string);
                    found = true;
                    new_letter = false;    /* Increment number of matches      */
                } else {
                    sub_string[work] = '\0';
                    printf("%s
", sub_string);
                    new_letter = false;    /* Increment number of matches      */
                }
            }

            step_1++;
            step_2++;

            if (T[step_2] == '\0')  /* Quit when end of DNA strand found    */
                break;
        }
        else
        {
            if (found)
            {
                cut_string(P, P[0]);    /* Remove first character from test */
                found = false;
                caracteres_iguais++;
                work = 0;
                step_1 = 0;
                step_2 = step_3;
                step_3 = 0;
                continue;
            }

            step_2++;

            if (T[step_2] == '\0')  /* Quit when end of DNA strand found    */
                break;
        }
    }

    pct = ((float)caracteres_iguais / m) * 100; /* Calculate matches pct    */
    *compatibildade = pct;
    *count = caracteres_iguais;

    return;
}

int main()
{
    int count = 0, tamSubcadeia = 3;
    float compatibildade = 0.0;
    char T[100], P[100];

    strcpy(P, "AATTTGGCCC");
    strcpy(T, "AAAAAAAAAATTTTTTTTTTTGGGGGGGGG");

    KMP(&count, T, P, &compatibildade, tamSubcadeia);   /* Removed "R" - did not see its use    */

    printf("%d combinações: %f
", count, compatibildade);

    return 0;
}

E aqui estão alguns outros exemplos de padrão que vc pode tentar testar também.

GGGGGGGGGG -for the DNA in the code AAAATTTCGTTAAATTTGAACATAGGGATA - new DNA

new patterns for the new DNA-- TTTTTTGGGG CATAGGGATT

And here are some other examples that you can try to test too.

I'm going to take some time to rest, I'm no longer thinking after 10 hours of coding. Thanks for the help!

Knuth Morris Pratt Algorithm

Answers (1)

Related Questions