Avinash
Avinash

Reputation: 13257

trie data structure and printing all the sub-strings

I need to print all the unique sub-strings. So I build a trie but not able to figure out how could i print all the sub-strings. For example if the input is aab and aac then I expect it to print "a", "aa", "aab", "aac", "ab", "ac", "b", "c".

What essentially I need to find out a way to get a unique substrings from set of strings. I am thinking trie is good way since building a trie would take O(n)

Following is my code to build a trie.

#include <string>
#include <iostream>
#include <vector>

struct trie_node {
    trie_node *(next[26]);

    trie_node() {
        for ( int i = 0; i < 26; ++i) {
            next[i] = (trie_node*)0;
        }
    }
};

trie_node *root;
char cur_substring[2000];
void build_trie(std::string& input) {
    trie_node *ptrie = root;
    for ( std::string::iterator it = input.begin(); it != input.end(); ++it) {
        int i = *it - 'a';
        if (ptrie->next[i] == (trie_node*)0)
            ptrie->next[i] = new trie_node;
        ptrie = ptrie->next[i];
    }
}

void print_sub_strings(trie_node *p_trie, int pos) {
    for (int i = 0; i < 26; i++) {
        if (p_trie->next[i] != (trie_node*)0) {
            cur_substring[pos] = i + 'a';
            print_sub_strings(p_trie->next[i], pos + 1 );
        }
    }
}

UPDATE 1

Based on the input I got I re-wrote my code, but it also does not seems to work.

#include <string>
#include <iostream>
#include <vector>

const int ALPHABET_SIZE = 26;
char text[2000];
int LEN;

struct trie_node_t { 
    trie_node_t*child_list[ALPHABET_SIZE]; 
    trie_node_t() {
        for(int index = 0; index < ALPHABET_SIZE; index++)
            child_list[index] = (trie_node_t*)0;
    }
};

class Trie {
public:
    Trie():m_root(new trie_node_t) {
    }

    ~Trie() {
        _delete(m_root);
    }

    void _insert(int pos) {
        int lcv, index; 
        trie_node_t* t = m_root;
        for(lcv = pos; lcv < LEN; lcv++) {
            index = text[lcv] - 'a';
            if (t->child_list[index] == (trie_node_t*)0) {
                t->child_list[index] = new trie_node_t;
            }
            t = t->child_list[index];
        }
    }
    void insert() {
        for ( int i = 0; i < LEN; i++) {
            _insert(i);
        }
    }

    void iterate() {
        _iterate(m_root, "");
    }

    void _iterate(trie_node_t *t, std::string prefix) {        
        for (int i = 0; i < ALPHABET_SIZE; i++) {
            if (t->child_list[i] != (trie_node_t*)0) {
                prefix += 'a' + i;
                std::cout << prefix << std::endl;
                _iterate(t->child_list[i], prefix);
            }   
        }
    }   
private: 
    int node_count;
    trie_node_t* m_root;

    void _delete (trie_node_t* t) {
        int index; 
        if (t != (trie_node_t*)0) {
            for(index = 0; index < ALPHABET_SIZE; index++)
                _delete(t->child_list[index]);
            delete t;
        }
    }    
};

int main ( int argc, char** argv) {
    Trie *pTrie =  new Trie();

    strcpy(text,"aab");
    LEN = strlen(text);
    pTrie->insert();

    strcpy(text,"aac");
    LEN = strlen(text);
    pTrie->insert();

    pTrie->iterate();
}

Output is

a
aa
aab
aabc
aab
aabc
ab
abc
Press any key to continue . . .

Upvotes: 3

Views: 3923

Answers (3)

LiKao
LiKao

Reputation: 10658

If you want to get all the substrings of the string (including those that do not start at the first letter), you have to store suffixes of the string in the trie.

I.e. what you do, you store the complete string, then you store the string without the first letter, then without the second letter etc. This way the trie handles the removal of repeating substrings correctly and when you traverse it you will get all the correct substrings. However note that this is not O(n), which as others have pointed out correctly is an impossible bound for this kind of problem.

The usual use case for this kind of datastructure however is fast retrieval of substrings. If you store at each leave the position where you started the suffixes (there might be multiple ones), you can easily find all occurences of any substring in an arbitrarily long substring. This is where tries play their strength for retrieval tasks, such as full text searches.

EDIT:

After your update, you are appending to the local prefix variable within the loop, so when you investigate the next child in the loop, it will have the wrong value. The additional values, which should not be there in your example are caused by this. You have to create a new prefix variable in each iteration and pass that one around. You can find some corrected code with additional debug outputs here.

Upvotes: 0

Ddavid
Ddavid

Reputation: 385

Trie stores different strings, but it doesn't care about their sub-strings which are not starting from the first letter. Each string stored in Trie starts from the root to a non-root node. You can try to pick up a sub-string from a non-root node to another non-root node, but it can't ensure that the sub-string is unique.

For example, a string "abab" is stored. You can get unique strings a, ab, aba, abab from root to a non-root node. If you try to pick up string starting from non-root nodes, you will get

  • a b ab
  • a ba b
  • a bab
  • ab a b
  • ab ab
  • aba b

where a, ab and b have already existed. You can try to store all sub-strings ends at the last letter to avoid this. For example, when a new string "abcdab" is coming, you need to store "abcdab", "bcdab", "cdab", "dab", "ab" and "b" in Trie. Anyway, this makes the time complexity becomes O(n^2), not O(n).

Upvotes: 1

Shamim Hafiz - MSFT
Shamim Hafiz - MSFT

Reputation: 22104

Well, if you properly built the Trie structure by considering every string, traversing the Trie would give a you every possible sub-string of the source strings. Duplicates are automatically handled due to the structure of Tries.

Upvotes: 0

Related Questions