Avishek Bhattacharya
Avishek Bhattacharya

Reputation: 6974

Longest Common Subsequence between very large strings

I am trying solve the Longest Common subsequence problem, which is the problem of finding the longest subsequence common to all sequences in a set of sequences (often just two sequences).

I am trying to do this to calculate the overlap between 2 strings.

This is well know Dynamic programming problem. However, In my case the strings are is too huge. When I tried to use the 2D matrix to memoize, I ran into memory out of bound problem.

One solution could be using sparse matrix instead but I am little concerned about the performance overhead with that.

Also I want to perform this algorithm across multiple strings. And it will be okay to provide approximate answer since I am only trying to measure the overlap between 2 strings.

EDIT: After some investigation I found the following alternatives

  1. Hirschberg Algorithm https://en.wikipedia.org/wiki/Hirschberg%27s_algorithm

Original paper http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.348.4360&rep=rep1&type=pdf

  1. Approximate algorithm : http://cs.haifa.ac.il/~ilan/online-papers/cpm09.pdf

  2. Deposition and Extension Approach to Find Longest Common Subsequence for Multiple Sequences https://arxiv.org/pdf/0903.2015.pdf

  3. LCS on DNA sequence http://www.sersc.org/journals/IJAST/vol47/2.pdf

  4. Efficient Algorithm http://www.sciencedirect.com/science/article/pii/S0885064X12000635

Upvotes: 1

Views: 1015

Answers (1)

MathBunny
MathBunny

Reputation: 956

To reduce memory complexity, you don't need to store the entire 2D table. You can only store the row above and current row and thus you can reduce the memory consumption by O(N) if you store the maximum in another data-structure. This results in O(N) memory usage, but time complexity remains O(N^2).

Upvotes: 1

Related Questions