Reputation: 351
I'm trying to solve Project Euler problem #92 with Haskell. I started learning Haskell recently. It's the first Project Euler problem I've tried to solve with Haskell, but my piece of code doesn't terminate even in 10 minutes. I know you don't give me the answer directly, but again I should warn I find answer with c++ doesn't give answer of Euler or new logic to solve Euler. I'm just curious why that guy doesn't work fast and what should I do to make it faster?
{--EULER 92--}
import Data.List
myFirstFunction 1 = 0
myFirstFunction 89 = 1
myFirstFunction x= myFirstFunction (giveResult x)
giveResult 0 = 0
giveResult x = (square (mod x 10)) + (giveResult (div x 10))
square x = x*x
a=[1..10000000]
main = putStrLn(show (sum (map myFirstFunction a)))
Upvotes: 3
Views: 834
Reputation: 183873
The biggest speedup can of course be gained from using a better algorithm. I'm not going deep into that here, though.
So let's focus on improving the used algorithm without really changing it.
You never give any type signature, therefore the type defaults to the arbitrary precision Integer
. Everything here fits easily in an Int
, there's no danger of overflow, so let's use that. Adding a type signature myFirstFunction :: Int -> Int
helps: time drops from Total time 13.77s ( 13.79s elapsed)
to Total time 6.24s ( 6.24s elapsed)
and total allocation drops by a factor of about 15. Not bad for such a simple change.
You use div
and mod
. These always compute a non-negative remainder and the corresponding quotient, so they need some extra checks in case some negative numbers are involved. The functions quot
and rem
map to the machine division instructions, they don't involve such checks and therefore are somewhat faster. If you compile via the LLVM backend (-fllvm
), that also takes advantage of the fact that you always divide by a single known number (10), and converts the division into multiplication and bit-shift. Time now: Total time 1.56s ( 1.56s elapsed)
.
Instead of using quot
and rem
separately, let's use the quotRem
function that computes both at once, so that we don't repeat the division (even with multiplication+shift that takes a little time):
giveResult x = case x `quotRem` 10 of
(q,r) -> r*r + giveResult q
That doesn't gain much, but a little: Total time 1.49s ( 1.49s elapsed)
.
You're using a list a = [1 .. 10000000]
, and map
the function over that list and then sum
the resulting list. That's idiomatic, neat and short, but not super fast, since allocating all those list cells and garbage collecting them takes time too - not very much, since GHC is very good at that, but transforming it into a loop
main = print $ go 0 1
where
go acc n
| n > 10000000 = acc
| otherwise = go (acc + myFirstFunction n) (n+1)
gains us a little still: Total time 1.34s ( 1.34s elapsed)
and the allocation dropped from 880,051,856 bytes allocated in the heap
for the last list version to 51,840 bytes allocated in the heap
.
giveResult
is recursive, and therefore cannot be inlined. The same holds for myFirstFunction
, hence each computation needs two function calls (at least). We can avoid that by rewriting giveResult
to a non-recursive wrapper and a recursive local loop,
giveResult x = go 0 x
where
go acc 0 = acc
go acc n = case n `quotRem` 10 of
(q,r) -> go (acc + r*r) q
so that that can be inlined: Total time 1.04s ( 1.04s elapsed)
.
Those were the most obvious points, further improvements - apart from the memoisation mentioned by hammar in the comments - would need some thinking.
We are now at
module Main (main) where
myFirstFunction :: Int -> Int
myFirstFunction 1 = 0
myFirstFunction 89 = 1
myFirstFunction x= myFirstFunction (giveResult x)
giveResult :: Int -> Int
giveResult x = go 0 x
where
go acc 0 = acc
go acc n = case n `quotRem` 10 of
(q,r) -> go (acc + r*r) q
main :: IO ()
main = print $ go 0 1
where
go acc n
| n > 10000000 = acc
| otherwise = go (acc + myFirstFunction n) (n+1)
With -O2 -fllvm
, that runs in 1.04 seconds here, but with the native code generator (only -O2
), it takes 3.5 seconds. That difference is due to the fact that GHC itself doesn't convert the division into a multiplication and bit-shift. If we do it by hand, we get pretty much the same performance from the native code generator.
Because we know something that the compiler doesn't, namely that we never deal with negative numbers here, and the numbers don't become large, we can even generate a better multiply-and-shift (that would produce wrong results for negative or large dividends) than the compiler and take the time down to 0.9 seconds for the native code generator and 0.73 seconds for the LLVM backend:
import Data.Bits
qr10 :: Int -> (Int, Int)
qr10 n = (q, r)
where
q = (n * 0x66666667) `unsafeShiftR` 34
r = n - 10 * q
Note: That requires that Int
is a 64-bit type, it won't work with 32-bit Int
s, it will produce wrong results for negative n
, and the multiplication will overflow for large n
. We're entering dirty-hack territory. We can alleviate the dirtyness by using Word
instead of Int
, that leaves only the overflow (which doesn't occur for n <= 10737418236
with Word
resp n <= 5368709118
for Int
, so here we are comfortably in the safe zone). The times aren't affected.
The corresponding C programme
#include <stdio.h>
unsigned int myFirstFunction(unsigned int i);
unsigned int giveResult(unsigned int i);
int main(void) {
unsigned int sum = 0;
for(unsigned int i = 1; i <= 10000000; ++i) {
sum += myFirstFunction(i);
}
printf("%u\n",sum);
return 0;
}
unsigned int myFirstFunction(unsigned int i) {
if (i == 1) return 0;
if (i == 89) return 1;
return myFirstFunction(giveResult(i));
}
unsigned int giveResult(unsigned int i) {
unsigned int acc = 0, r, q;
while(i) {
q = (i*0x66666667UL) >> 34;
r = i - q*10;
i = q;
acc += r*r;
}
return acc;
}
performs similarly, compiled with gcc -O3
, it runs in 0.78 seconds, and with clang -O3
in 0.71.
That's pretty much the end without changing the algorithm.
Now, a minor change of algorithm is memoisation. If we build a lookup table for the numbers <= 7*9²
, we need only one computation of the sum of the squares of the digits for each number rather than iterating that until we reach 1 or 89, so let's memoise,
module Main (main) where
import Data.Array.Unboxed
import Data.Array.IArray
import Data.Array.Base (unsafeAt)
import Data.Bits
qr10 :: Int -> (Int, Int)
qr10 n = (q, r)
where
q = (n * 0x66666667) `unsafeShiftR` 34
r = n - 10 * q
digitSquareSum :: Int -> Int
digitSquareSum = go 0
where
go acc 0 = acc
go acc n = case qr10 n of
(q,r) -> go (acc + r*r) q
table :: UArray Int Int
table = array (0,567) $ assocs helper
where
helper :: Array Int Int
helper = array (0,567) [(i, f i) | i <- [0 .. 567]]
f 0 = 0
f 1 = 0
f 89 = 1
f n = helper ! digitSquareSum n
endPoint :: Int -> Int
endPoint n = table `unsafeAt` digitSquareSum n
main :: IO ()
main = print $ go 0 1
where
go acc n
| n > 10000000 = acc
| otherwise = go (acc + endPoint n) (n+1)
Doing the memoisation by hand instead of using a library makes the code longer, but we can tailor it to our needs. We can use an unboxed array, and we can omit the bounds check on the array access. Both significantly speed the computation up. The time is now 0.18 seconds for the native code generator, and 0.13 seconds for the LLVM backend. The corresponding C programme runs in 0.16 seconds compiled with gcc -O3
, and 0.145 seconds compiled with clang -O3
(Haskell beats C, w00t!).
The used algorithm however doesn't scale too well, a bit worse than linear, and for an upper bound of 108 (with suitably adapted memoisation limit), it runs in 1.5 seconds (ghc -O2 -fllvm
), resp. 1.64 seconds (clang -O3
) and 1.87 seconds (gcc -O3
) [2.02 seconds for the native code generator].
Using a different algorithm that counts the numbers whose sequence ends in 1 by partitioning such numbers into a sum of squares of digits (The only numbers that directly produce 1 are powers of 10. We can write
10 = 1×3² + 1×1²
10 = 2×2² + 2×1²
10 = 1×2² + 6×1²
10 = 10×1²
From the first, we obtain 13, 31, 103, 130, 301, 310, 1003, 1030, 1300, 3001, 3010, 3100, ... From the second, we obtain 1122, 1212, 1221, 2112, 2121, 2211, 11022, 11202, ... From the third 1111112, 1111121, ...
Only 13, 31, 103, 130, 301, 310 are possible sums of squares of the digits of numbers <= 10^10
, so only those need to be investigated further. We can write
100 = 1×9² + 1×4² + 3×1²
...
100 = 1×8² + 1×6²
...
The first of these partitions generates no children since it requires five nonzero digits, the other explicitly given generates the two children 68 and 86 (also 608 if the limit is 108, more for larger limits)), we can get better scaling and a faster algorithm.
The fairly unoptimised programme I wrote way back when to solve this problem runs (input is exponent of 10 of the limit)
$ time ./problem92 7
8581146
real 0m0.010s
user 0m0.008s
sys 0m0.002s
$ time ./problem92 8
85744333
real 0m0.022s
user 0m0.018s
sys 0m0.003s
$ time ./problem92 9
854325192
real 0m0.040s
user 0m0.033s
sys 0m0.006s
$ time ./problem92 10
8507390852
real 0m0.074s
user 0m0.069s
sys 0m0.004s
in a different league.
Upvotes: 23
Reputation: 139830
First off, I took the liberty of cleaning up your code a little:
endsAt89 1 = 0
endsAt89 89 = 1
endsAt89 n = endsAt89 (sumOfSquareDigits n)
sumOfSquareDigits 0 = 0
sumOfSquareDigits n = (n `mod` 10)^2 + sumOfSquareDigits (n `div` 10)
main = print . sum $ map endsAt89 [1..10^7]
On my crappy netbook is 1 min 13 sec. Let's see if we can improve that.
Since the numbers are small, we can start by using machine-sized Int
instead of arbitrary-size Integer
. This is just a matter of adding type signatures, e.g.
sumOfSquareDigits :: Int -> Int
That improves the run time drastically to 20 seconds.
Since the numbers are all positive, we can replace div
and mod
with the slightly faster quot
and rem
, or even both in one go with quotRem
:
sumOfSquareDigits :: Int -> Int
sumOfSquareDigits 0 = 0
sumOfSquareDigits n = r^2 + sumOfSquareDigits q
where (q, r) = quotRem x 10
Run time is now 17 seconds. Making it tail recursive shaves off another second:
sumOfSquareDigits :: Int -> Int
sumOfSquareDigits n = loop n 0
where
loop 0 !s = s
loop n !s = loop q (s + r^2)
where (q, r) = quotRem n 10
For further improvements, we can notice that sumOfSquareDigits
returns at most 567 = 7 * 9^2
for the given input numbers, so we can memoize for small numbers to reduce the number of iterations needed. Here's my final version (using the data-memocombinators package for the memoization):
{-# LANGUAGE BangPatterns #-}
import qualified Data.MemoCombinators as Memo
endsAt89 :: Int -> Int
endsAt89 = Memo.arrayRange (1, 7*9^2) endsAt89'
where
endsAt89' 1 = 0
endsAt89' 89 = 1
endsAt89' n = endsAt89 (sumOfSquareDigits n)
sumOfSquareDigits :: Int -> Int
sumOfSquareDigits n = loop n 0
where
loop 0 !s = s
loop n !s = loop q (s + r^2)
where (q, r) = quotRem n 10
main = print . sum $ map endsAt89 [1..10^7]
This runs in just under 9 seconds on my machine.
Upvotes: 9