Reputation: 2974

How can I use a very large string (500 million characters) in my program?

I have a .txt file that contains a 500 million digit binary representation of Pi.

I need to use a string representation of that in my program. I also need to be able to search it for substrings and the like - in other words, I need to be able to treat it like a normal sized string. I'll be trying to find a lot of substrings so speed is necessary.

My initial logic was to simply copy and paste the string directly into the program and use it as a static variable.. But I was unable to actually open the .txt file, so I couldn't copy and paste. My next attempt was to read the entire string from the file, but I can't do this in a static method and it takes WAAAY too long (I actually don't know exactly how long it takes, I closed the program eventually).

Is it possible to do this? Any help would be appreciated.

Edit: Potentially relevant information:

With this code:

/// <summary>
    /// Gets a 500 million binary digit representation of Pi.
    /// </summary>
    public static string GetPi()
    {
        //as per http://msdn.microsoft.com/en-us/library/db5x7c0d.aspx
        StreamReader piStream = new StreamReader(@"C:\binaryPi.txt");
        string pi = "";
        string line;

        while ((line = piStream.ReadLine()) != null)
        {
            pi += line;
        }

        return pi;
    }

I get an OutOfMemoryException.. So scanning the file actually doesn't seem possible, unless I'm missing something..

Upvotes: 2

Answers (5)

HABO

Reputation: 15816

Read the text file once in an application that converts it to an array of bits, one segment at a time, and then write a new file containing the array persisted in binary. Thereafter, just use the real binary file.

To search you can create a bit mask of the target pattern and slide it along the bit array, one bit at a time, performing a bitwise XOR to compare the bits and a bitwise AND to filter out bits you don't care about. If anything left is nonzero then you don't have a match.

Experiment to determine how the performance differs between datatypes. For example, you could use bytes and search 8-bits at a time or integers and search 32-bits at a time. If your pattern is smaller than the selected datatype then the bitwise AND discards the extra bits. Larger patterns are handled by finding an initial match, then trying to match the next segment of the pattern and so on.

EDIT: An optimization that may help. Let's say you have a long, e.g. greater than 128-bit, pattern. Construct an array of 64 64-bit values from the pattern: bits 0-63, 1-64, 2-65, ... . You can then make a fast pass through trying to match any of the array values to each long integer value in the pi array. Where matches occur, check any prior bits for matches as needed, then test the subsequent bits. The idea is to make the best use of aligned memory accesses.

Depending on the pattern length it may be worthwhile to assemble a two dimensional array of shifted values such that you can easily continue matching a shifted pattern without recomputing the values. (Just make a turn at the match and pick up the next pattern value wih the same shift.) You would need to allow for masking unused bits at each end. Note that you want to make the most frequent array access occur on the shortest stride to make the best use of cache.

The BigInteger structure may be of some use in fiddling about.

Upvotes: 1

Guffa

Reputation: 700372

I would suggest that you make a custom class that can handle that kind of data.

If the content of the file is a representation of the binary form of pi, then it's just zeroes and ones. If you store each bit in an actual bit, then each binary digit uses 1/8 of a byte, while if you store it as text, each bit will use two bytes. By storing in a more compact form, you will use 1/16 of the memory.

Your class would then have to handle how you search for bit patterns in the data. That would be the tricky part, but if you create eight different versions of the search pattern, shifted to match the eight possible positions in a byte, the search could be even more efficient than searching in a string.

Edit:

Here's a start...

public class BitList {

  private byte[] _data;
  private int _count;

  public BitList(string fileName) {
    using (FileStream s = File.OpenRead(fileName)) {
      _data = new byte[(s.Length + 7) / 8];
      _count = 0;
      int len;
      byte[] buffer = new byte[4096];
      while ((len = s.Read(buffer, 0, buffer.Length)) > 0) {
        for (int i = 0; i < len; i++) {
          switch (buffer[i]) {
            case 48: Add(0); break;
            case 49: Add(1); break;
          }
        }
      }
    }
  }

  public void Add(int bit) {
    _data[_count / 8] |= (byte)(bit << (_count % 8));
    _count++;
  }

  public int this[int index] {
    get {
      return (_data[index / 8] >> (index % 8)) & 1;
    }
  }

}

(Note: This code is NOT TESTED, but you should at least get the principle.)

Upvotes: 7

jorel

Reputation: 808

According to MSDN the maximum size of a String object in memory is 2 GB, or about 1 billion characters. To allocate a string of that size you would probably need a 64bit OS. Since you are dealing with digits only try to use some other data type than strings.

Upvotes: 0

David Thielen

Reputation: 32926

If speed of finding multiple different sub-strings is key, then here is an approach that may give you better results.

Leave it in the text file. Scan it and build a tree where the top of the tree has 10 nodes holding the digits 0..9. Each of those nodes holds 10 nodes, the sequence of their digit and then 0..9. Top level is 0..9, next is 00..09, .., 90..99. Next is 000..009, ... 990..999.

And at each level you also store the offset in the text file of every occurrence that matches its sequence - only if it has no children. The no children rule is to save a lot of memory and by definition every child node contains offsets where the parent sequence exists. In other words, if you are looking for "123456" then an occurrence of "123456789" is a match.

This would use a horrendous amount of memory but it would be very fast on the lookups.

Adding: There are a lot of tricks you can implement to minimize the memory usage. Store the numbers as a nibble (4 bits). Instead of objects for elements in your tree, store offsets from a base and put everything in a small number of large fixed size arrays. You can do this because you create the tree once and then it is read only.

Upvotes: 0

Fredou

Reputation: 20120

so with the information available, i would just declare a bitarray (initial size as file.length) then i would open the file and reading it by chunk of maybe 4096 then you look trough these 4096 characters

in the loop you just do a simple if text = 1 then set true else set false

do this until you reach the end of the file then you have the full thing into a huge bitarray variable

from that point on you just need to find your pattern

Upvotes: 2

How can I use a very large string (500 million characters) in my program?

Answers (5)

Edit:

Related Questions