Reputation: 5914

Need help on an algorithm

I need help on an algorithm. I have randomly generated numbers with 6 digits. Like;

123654 109431

There are approximately 1 million of them saved in a file line by line. I have to filter them according to the rule I try to describe below.

Take a number, compare it to all others digit by digit. If a number comes up with a digit with a value of bigger by one to the compared number, then delete it. Let me show it by using numbers.

Our number is: 123456 Increase the first digit with 1, so the number becomes: 223456. Delete all the 223456s from the file. Increase the second digit by 1, the number becomes: 133456. Delete all 133456s from the file, and so on...

I can do it just as I describe but I need it to be "FAST".

So can anyone help me on this?

Thanks.

Upvotes: 6

Answers (10)

cdhowie

Reputation: 169353

This algorithm will keep a lot of numbers around in memory, but it will process the file one number at a time so you don't actually need to read it all in at once. You only need to supply an IEnumerable<int> for it to operate on.

    public static IEnumerable<int> FilterInts(IEnumerable<int> ints)
    {
        var removed = new HashSet<int>();

        foreach (var i in ints)
        {
            var iStr = i.ToString("000000").ToCharArray();

            for (int j = 0; j < iStr.Length; j++)
            {
                var c = iStr[j];

                if (c == '9')
                    iStr[j] = '0';
                else
                    iStr[j] = (char)(c + 1);

                removed.Add(int.Parse(new string(iStr)));

                iStr[j] = c;
            }

            if (!removed.Contains(i))
                yield return i;
        }
    }

You can use this method to create an IEnumerable<int> from the file:

    public static IEnumerable<int> ReadIntsFrom(string path)
    {
        using (var reader = File.OpenText(path))
        {
            string line;
            while ((line = reader.ReadLine()) != null)
                yield return int.Parse(line);
        }
    }

Upvotes: 1

Ravindra Sane

Reputation: 472

Still sounds like a homework question... the fastest sort on a million numbers will be n log(n) that is 1000000log(1000000) that is 6*1000000 which is the same as comparing 6 numbers to each of the million numbers. So a direct comparison will be faster than sort and remove, because after sorting you still have to compare to remove. Unless, ofcourse, my calculations have entirely missed the target.

Something else comes to mind. When you pick up the number, read it as hex and not base 10. then maybe some bitwise operators may help somehow. Still thinking on what can be done using this. Will update if it works

EDIT: currently thinking on the lines of gray code. 123456 (our original number) and 223456 or 133456 will be off only by one digit and a gray code convertor will catch it fast. It's late night here, so if someone else finds this useful and can give a solution...

Upvotes: 0

egrunin

Reputation: 25083

All the suggestions (so far) require six comparisons per input line, which is not necessary. The numbers are coming in as strings, so use string comparisons.

Start with @Armen Tsirunyan's idea:

Precalculate all the target numbers, in this case 223456, 133456, 124456, 123556, 123466, 123457.

But instead of single comparisons, make that into a string:

 string arg = "223456 133456 124456 123556 123466 123457";

Then read through the input (either from file or in memory). Pseudocode:

 foreach (string s in theBigListOfNumbers)
     if (arg.indexOf(s) == -1)
         print s;

This is just one comparison per input line, no dictionaries, maps, iterators, etc.

Edited to add:

In x86 instruction set processors (not just the Intel brand), substring searches like this are very fast. To search for a character within a string, for example, is just one machine instruction.

I'll have to ask others to weigh in on alternate architectures.

Upvotes: 1

thattolleyguy

Reputation: 813

Read all your numbers from the file and store them in a map where the number is the key and a boolean is the value signifying that the value hasn't been deleted. (True means exists, false means deleted).

Then iterate through your keys. For each key, set the map to false for the values you would be deleting from the list.

Iterate through your list one more time and get all the keys where the value is true. This is the list of remaining numbers.

public List<int> FilterNumbers(string fileName)
{
    StreamReader sr = File.OpenTest(fileName);
    string s = "";
    Dictionary<int, bool> numbers = new Dictionary<int, bool>();
    while((s = sr.ReadLine()) != null)
    {
        int number = Int32.Parse(s);
        numbers.Add(number,true);
    }
    foreach(int number in numbers.Keys)
    {
        if(numbers[number])
        {
            if(numbers.ContainsKey(100000+number))
                numbers[100000+number]=false;
            if(numbers.ContainsKey(10000+number))
                numbers[10000+number]=false;
            if(numbers.ContainsKey(1000+number))
                numbers[1000+number]=false;
            if(numbers.ContainsKey(100+number))
                numbers[100+number]=false;
            if(numbers.ContainsKey(10+number))
                numbers[10+number]=false;
            if(numbers.ContainsKey(1+number))
                numbers[1+number]=false;
        }
    }

    List<int> validNumbers = new List<int>();
    foreach(int number in numbers.Keys)
    {
        validNumbers.Add(number);
    }
    return validNumbers;
}

This may need to be tested as I don't have a C# compiler on this computer and I'm a bit rusty. The algorithm will take a bit of memory bit it runs in linear time.

** EDIT ** This runs into problems whenever one of the numbers is 9. I'll update the code later.

Upvotes: 0

veljkoz

Reputation: 8512

How about this. You process numbers one by one. Numbers will be stored in hash tables NumbersOK and NumbersNotOK.

Take one number
If it's not in NumbersNotOK place it in a Hash of NumbersOK
Get it's variances of single number increments in hash - NumbersNotOK.
Remove all of the NumbersOK members if they match any of the variances.
Repeat from 1, untill end of file
Save the NumbersOK to the file.

This way you'll pass the list just once. The hash table is made just for this kind of purposes and it'll be very fast (no expensive comparison methods).

This algorithm is not in full, as it doesn't handle when there are some numbers repeating, but it can be handled with some tweaking...

Upvotes: 0

Tim M.

Reputation: 54397

This sounds like a potential case for a multidimensional array, and possibly also unsafe c# code so that you can use pointer math to iterate through such a large quantity of numbers.

I would have to dig into it further, but I would also probably use a Dictionary for non-linear lookups, if you are comparing numbers that aren't in sequence.

Upvotes: 0

Armen Tsirunyan

Reputation: 133112

First of all, since it is around 1Million you had better perform the algorithm in RAM, not on Disk, that is, first load the contents into an array, then modify the array, then paste the results back into the file.

I would suggest the following algorithm - a straightforward one. Precalculate all the target numbers, in this case 223456, 133456, 124456, 123556, 123466, 123457. Now pass the array and if the number is NOT any of these, write it to another array. Alternatively if it is one of these numbers delete it(recommended if your data structure has O(1) remove)

Upvotes: 5

Bill Carey

Reputation: 1415

It seems like the rule you're describing is for the target number abdcef you want to find all numbers that contain a+1, b+1, c+1, d+1, e+1, or f+1 in the appropriate place. You can do this in O(n) by looping over the lines in the file and comparing each of the six digits to the digit in the target number if no digits match, write the number to an output file.

Upvotes: 0