JD Long
JD Long

Reputation: 60756

R Function for returning ALL factors

My normal search foo is failing me. I'm trying to find an R function that returns ALL of the factors of an integer. There are at least 2 packages with factorize() functions: gmp and conf.design, however these functions return only prime factors. I'd like a function that returns all factors.

Obviously searching for this is made difficult since R has a construct called factors which puts a lot of noise in the search.

Upvotes: 40

Views: 20897

Answers (7)

ThomasIsCoding
ThomasIsCoding

Reputation: 102625

With base R, you can define the following functions

  • If you want PRIME factors only
primeFactors <- function(n) {
    res <- c()
    k <- 2
    repeat {
        if (n %% k == 0) {
            res <- append(res, k)
            n <- n / k
        } else {
            k <- k + 1 + (k > 2)
        }
        if (n == 1) {
            return(res)
        }
    }
}
  • If you want ALL factors (including both negative and positive factors)
allFactors <- function(n) {
    v <- trunc(c(-sqrt(n):-1, 1:sqrt(n)))
    f <- v[n %% v == 0]
    unique(sort(c(f, n / f)))
}

Example

> n <- 6

> primeFactors(n)
[1] 2 3

> allFactors(n)
[1] -6 -3 -2 -1  1  2  3  6

and

> n <- 2459745082

> primeFactors(n)
[1]     2    43  1123 25469

> allFactors(n)
 [1] -2459745082 -1229872541   -57203374   -28601687    -2190334    -1095167
 [7]      -96578      -50938      -48289      -25469       -2246       -1123
[13]         -86         -43          -2          -1           1           2
[19]          43          86        1123        2246       25469       48289
[25]       50938       96578     1095167     2190334    28601687    57203374
[31]  1229872541  2459745082

Upvotes: 2

Joseph Wood
Joseph Wood

Reputation: 7608

A lot has changed in the R language since this question was originally asked. In version 0.6-3 of the numbers package, the function divisors was included that is very useful for getting all of the factors of a number. It will meet the needs of most users, however if you are looking for raw speed or you are working with larger numbers, you will need an alternative method. I have authored two packages (partially inspired by this question, I might add) that contain highly optimized functions aimed at problems just like this. The first one is RcppAlgos and the other is RcppBigIntAlgos (formerly called bigIntegerAlgos).

RcppAlgos

RcppAlgos contains two functions for obtaining divisors of numbers less than 2^53 - 1 : divisorsRcpp (a vectorized function for quickly obtaining the complete factorization of many numbers) & divisorsSieve (quickly generates the complete factorization over a range). First up, we factor many random numbers using divisorsRcpp:

library(gmp)  ## for all_divisors by @GeorgeDontas
library(RcppAlgos)
library(numbers)
options(scipen = 999)
set.seed(42)
testSamp <- sample(10^10, 10)

## vectorized so you can pass the entire vector as an argument
testRcpp <- divisorsRcpp(testSamp)
testDontas <- lapply(testSamp, all_divisors)

identical(lapply(testDontas, as.numeric), testRcpp)
#> [1] TRUE

And now, factor many numbers over a range using divisorsSieve:

identical(lapply(testDontas, as.numeric), testRcpp)
#> [1] TRUE

system.time(testSieve <- divisorsSieve(10^13, 10^13 + 10^5))
#>    user  system elapsed 
#>   0.064   0.008   0.072

system.time(testDontasSieve <- lapply((10^13):(10^13 + 10^5), all_divisors))
#>    user  system elapsed 
#>  27.145   0.126  27.274

identical(lapply(testDontasSieve, asNumeric), testSieve)
#> [1] TRUE

Both divisorsRcpp and divisorsSieve are nice functions that are flexible and efficient, however they are limited to 2^53 - 1.

RcppBigIntAlgos

The RcppBigIntAlgos package (formerly called bigIntegerAlgos prior to version 0.2.0) links directly to the C library gmp and features divisorsBig which is designed for very large numbers.

library(RcppBigIntAlgos)
#> 
#> Attaching package: 'RcppBigIntAlgos'
#> The following object is masked from 'package:RcppAlgos':
#> 
#>     stdThreadMax
## testSamp is defined above... N.B. divisorsBig is not quite as
## efficient as divisorsRcpp. This is so because divisorsRcpp
## can take advantage of more efficient data types.
testBig <- divisorsBig(testSamp)

identical(testDontas, testBig)
#> [1] TRUE

And here are the benchmark as defined in my original post (N.B. MyFactors is replaced by divisorsRcpp and divisorsBig).

library(rbenchmark)
set.seed(199)
samp <- sample(10^9, 10^5)
benchmark(RcppAlgos=divisorsRcpp(samp),
          RcppBigIntAlgos=divisorsBig(samp),
          DontasDivs=lapply(samp, all_divisors),
          replications=10,
          columns = c("test", "replications", "elapsed", "relative"),
          order = "relative")
#>              test replications elapsed relative
#> 1       RcppAlgos           10   1.680    1.000
#> 2 RcppBigIntAlgos           10   4.976    2.962
#> 3      DontasDivs           10 251.170  149.506

set.seed(97)
samp <- sample(10^6, 10^4)
benchmark(RcppAlgos=divisorsRcpp(samp),
          RcppBigIntAlgos=divisorsBig(samp),
          numbers=lapply(samp, divisors),      ## From the numbers package
          DontasDivs=lapply(samp, all_divisors),
          CottonDivs=lapply(samp, get_all_factors),
          ChaseDivs=lapply(samp, FUN),
          replications=5,
          columns = c("test", "replications", "elapsed", "relative"),
          order = "relative")
#>              test replications elapsed relative
#> 1       RcppAlgos            5   0.044    1.000
#> 2 RcppBigIntAlgos            5   0.123    2.795
#> 3         numbers            5   5.383  122.341
#> 4      DontasDivs            5   9.792  222.545
#> 5      CottonDivs            5  22.638  514.500
#> 6       ChaseDivs            5  99.635 2264.432

The next benchmarks demonstrate the true power of the underlying algorithm in the divisorsBig function. The number being factored is a power of 10, so the prime factoring step can almost be completely ignored (e.g. system.time(factorize(pow.bigz(10,30))) registers 0 on my machine). Thus, the difference in timing is due solely to how quickly the prime factors can be combined to produce all factors.

library(microbenchmark)
powTen <- pow.bigz(10, 30)

microbenchmark(
    algos  = divisorsBig(powTen),
    Dontas = all_divisors(powTen),
    unit   = "relative"
)
#> Unit: relative
#>    expr      min       lq     mean   median       uq      max  neval cld
#>   algos  1.00000  1.00000  1.00000  1.00000  1.00000  1.00000    100  a
#>  Dontas 41.49166 39.63744 41.52777 42.59824 42.18948 56.24977    100   b


## Negative numbers show an even greater increase in efficiency
negPowTen <- powTen * -1

microbenchmark(
    algos  = divisorsBig(negPowTen),
    Dontas = all_divisors(negPowTen),
    unit   = "relative"
)
#> Unit: relative
#>    expr      min       lq     mean   median       uq      max  neval cld
#>   algos  1.00000  1.00000  1.00000  1.00000  1.00000  1.00000    100  a 
#>  Dontas 56.99954 55.95423 56.15268 56.99724 56.42193 42.90249    100   b

Very Large Numbers

With divisorsBig, obtaining the complete factorization with very large inputs is no problem. The algorithm dynamically adjusts based off of the input and applies different algorithms in different situations. We can also take advantage of multithreading if Lenstra's Elliptic Curve method or the Quadratic Sieve is utilized.

Here are some examples using n5 and n9 defined in this answer.

n5 <- as.bigz("94968915845307373740134800567566911")
system.time(print(divisorsBig(n5)))
#> Big Integer ('bigz') object of length 4:
#> [1] 1                                   216366620575959221                 
#> [3] 438925910071081891                  94968915845307373740134800567566911
#>    user  system elapsed 
#>   0.086   0.002   0.088

n9 <- prod(nextprime(urand.bigz(2, 82, 42)))
#> Seed default initialisation
#> Seed initialisation
system.time(print(divisorsBig(n9, nThreads = 4)))
#> Big Integer ('bigz') object of length 4:
#> [1] 1                                                 
#> [2] 2128750292720207278230259                         
#> [3] 4721136619794898059404993                         
#> [4] 10050120961360479179164300841596861740399588283187
#>    user  system elapsed 
#>   0.921   0.010   0.383

Here is an example provided by @Dontas with one large prime and one smaller prime:

x <- pow.bigz(2, 256) + 1
divisorsBig(x, showStats = TRUE, nThreads = 8)
#> 
#> Summary Statistics for Factoring:
#>     115792089237316195423570985008687907853269984665640564039457584007913129639937
#> 
#> |  Pollard Rho Time  |
#> |--------------------|
#> |        320ms       |
#> 
#> |  Lenstra ECM Time  |  Number of Curves  |
#> |--------------------|--------------------|
#> |        929ms       |        2584        |
#> 
#> |     Total Time     |
#> |--------------------|
#> |      1s 249ms      |
#>
#> Big Integer ('bigz') object of length 4:
#> [1] 1                                                                             
#> [2] 1238926361552897                                                              
#> [3] 93461639715357977769163558199606896584051237541638188580280321                
#> [4] 115792089237316195423570985008687907853269984665640564039457584007913129639937

Compare this to finding the prime factorization using gmp::factorize:

system.time(factorize(x))
#>    user  system elapsed 
#>   6.393   0.021   6.414

Lastly, here is an example with a large semiprime (N.B. since we know it's a semiprime, we skip the extended Pollard's rho algorithm as well as Lentra's elliptic curve method).

## https://members.loria.fr/PZimmermann/records/rsa.html
rsa79 <- as.bigz("7293469445285646172092483905177589838606665884410340391954917800303813280275279")
divisorsBig(
    rsa79, nThreads = 8, showStats = TRUE,
    skipPolRho = TRUE, skipECM = TRUE
)
#> 
#> Summary Statistics for Factoring:
#>     7293469445285646172092483905177589838606665884410340391954917800303813280275279
#> 
#> |      MPQS Time     | Complete | Polynomials |   Smooths  |  Partials  |
#> |--------------------|----------|-------------|------------|------------|
#> |     1m 37s 26ms    |   100%   |    91221    |    5651    |    7096    |
#> 
#> |  Mat Algebra Time  |    Mat Dimension   |
#> |--------------------|--------------------|
#> |      5s 296ms      |    12625 x 12747   |
#> 
#> |     Total Time     |
#> |--------------------|
#> |    1m 42s 628ms    |
#>
#> Big Integer ('bigz') object of length 4:
#> [1] 1                                                                              
#> [2] 848184382919488993608481009313734808977                                        
#> [3] 8598919753958678882400042972133646037727                                       
#> [4] 7293469445285646172092483905177589838606665884410340391954917800303813280275279

Upvotes: 8

Joseph Wood
Joseph Wood

Reputation: 7608

Update

This is now implemented in the package RcppBigIntAlgos. See this answer for more details.

Original Post

The algorithm has been fully updated and now implements multiple polynomials as well as some clever sieving techniques that eliminates millions of checks. In addition to the original links, this paper along with this post from primo were very helpful for this last stage (many kudos to primo). Primo does a great job of explaining the guts of the QS in a relatively short space and also wrote a pretty amazing algorithm (it will factor the number at the bottom, 38! + 1, in under 2 secs!! Insane!!).

As promised, below is my humble R implementation of the Quadratic Sieve. I have been working on this algorithm sporadically since I promised it in late January. I will not try to explain it fully (unless requested... also, the links below do a very good job) as it is very complicated and hopefully, my function names speak for themselves. This has proved to be one of the most challenging algorithms I have ever attempted to execute as it is demanding both from a programmer's point of view as well as mathematically. I have read countless papers and ultimately, I found these five to be the most helpful (QSieve1, QSieve2, QSieve3, QSieve4, QSieve5).

N.B. This algorithm, as it stands, does not serve very well as a general prime factorization algorithm. If it was optimized further, it would need to be accompanied by a section of code that factors out smaller primes (i.e. less than 10^5 as suggested by this post), then call QuadSieveAll, check to see if these are primes, and if not, call QuadSieveAll on both of these factors, etc. until you are left with all primes (all of these steps are not that difficult). However, the main point of this post is to highlight the heart of the Quadratic Sieve, so the examples below are all semiprimes (even though it will factor most odd numbers not containing a square… Also, I haven’t seen an example of the QS that didn’t demonstrate a non-semiprime). I know the OP was looking for a method to return all factors and not the prime factorization, but this algorithm (if optimized further) coupled with one of the algorithms above would be a force to reckon with as a general factoring algorithm (especially given that the OP was needing something for Project Euler, which usually requires much more than brute force methods). By the way, the MyIntToBit function is a variation of this answer and the PrimeSieve is from a post that @Dontas appeared on a while back (Kudos on that as well).

QuadSieveMultiPolysAll <- function(MyN, fudge1=0L, fudge2=0L, LenB=0L) {
### 'MyN' is the number to be factored; 'fudge1' is an arbitrary number
### that is used to determine the size of your prime base for sieving;
### 'fudge2' is used to set a threshold for sieving;
### 'LenB' is a the size of the sieving interval. The last three
### arguments are optional (they are determined based off of the
### size of MyN if left blank)

### The first 8 functions are helper functions

    PrimeSieve <- function(n) {
        n <- as.integer(n)
        if (n > 1e9) stop("n too large")
        primes <- rep(TRUE, n)
        primes[1] <- FALSE
        last.prime <- 2L
        fsqr <- floor(sqrt(n))
        while (last.prime <= fsqr) {
            primes[seq.int(last.prime^2, n, last.prime)] <- FALSE
            sel <- which(primes[(last.prime + 1):(fsqr + 1)])
            if (any(sel)) {
                last.prime <- last.prime + min(sel)
            } else {
                last.prime <- fsqr + 1
            }
        }
        MyPs <- which(primes)
        rm(primes)
        gc()
        MyPs
    }

    MyIntToBit <- function(x, dig) {
        i <- 0L
        string <- numeric(dig)
        while (x > 0) {
            string[dig - i] <- x %% 2L
            x <- x %/% 2L
            i <- i + 1L
        }
        string
    }

    ExpBySquaringBig <- function(x, n, p) {
        if (n == 1) {
            MyAns <- mod.bigz(x,p)
        } else if (mod.bigz(n,2)==0) {
            MyAns <- ExpBySquaringBig(mod.bigz(pow.bigz(x,2),p),div.bigz(n,2),p)
        } else {
            MyAns <- mod.bigz(mul.bigz(x,ExpBySquaringBig(mod.bigz(
                pow.bigz(x,2),p), div.bigz(sub.bigz(n,1),2),p)),p)
        }
        MyAns
    }

    TonelliShanks <- function(a,p) {
        P1 <- sub.bigz(p,1); j <- 0L; s <- P1
        while (mod.bigz(s,2)==0L) {s <- s/2; j <- j+1L}
        if (j==1L) {
            MyAns1 <- ExpBySquaringBig(a,(p+1L)/4,p)
            MyAns2 <- mod.bigz(-1 * ExpBySquaringBig(a,(p+1L)/4,p),p)
        } else {
            n <- 2L
            Legendre2 <- ExpBySquaringBig(n,P1/2,p)
            while (Legendre2==1L) {n <- n+1L; Legendre2 <- ExpBySquaringBig(n,P1/2,p)}
            x <- ExpBySquaringBig(a,(s+1L)/2,p)
            b <- ExpBySquaringBig(a,s,p)
            g <- ExpBySquaringBig(n,s,p)
            r <- j; m <- 1L
            Test <- mod.bigz(b,p)
            while (!(Test==1L) && !(m==0L)) {
                m <- 0L
                Test <- mod.bigz(b,p)
                while (!(Test==1L)) {m <- m+1L; Test <- ExpBySquaringBig(b,pow.bigz(2,m),p)}
                if (!m==0) {
                    x <- mod.bigz(x * ExpBySquaringBig(g,pow.bigz(2,r-m-1L),p),p)
                    g <- ExpBySquaringBig(g,pow.bigz(2,r-m),p)
                    b <- mod.bigz(b*g,p); r <- m
                }; Test <- 0L
            }; MyAns1 <- x; MyAns2 <- mod.bigz(p-x,p)
        }
        c(MyAns1, MyAns2)
    }

    SieveLists <- function(facLim, FBase, vecLen, sieveD, MInt) {
        vLen <- ceiling(vecLen/2); SecondHalf <- (vLen+1L):vecLen
        MInt1 <- MInt[1:vLen]; MInt2 <- MInt[SecondHalf]
        tl <- vector("list",length=facLim)
        
        for (m in 3:facLim) {
            st1 <- mod.bigz(MInt1[1],FBase[m])
            m1 <- 1L+as.integer(mod.bigz(sieveD[[m]][1] - st1,FBase[m]))
            m2 <- 1L+as.integer(mod.bigz(sieveD[[m]][2] - st1,FBase[m]))
            sl1 <- seq.int(m1,vLen,FBase[m])
            sl2 <- seq.int(m2,vLen,FBase[m])
            tl1 <- list(sl1,sl2)
            st2 <- mod.bigz(MInt2[1],FBase[m])
            m3 <- vLen+1L+as.integer(mod.bigz(sieveD[[m]][1] - st2,FBase[m]))
            m4 <- vLen+1L+as.integer(mod.bigz(sieveD[[m]][2] - st2,FBase[m]))
            sl3 <- seq.int(m3,vecLen,FBase[m])
            sl4 <- seq.int(m4,vecLen,FBase[m])
            tl2 <- list(sl3,sl4)
            tl[[m]] <- list(tl1,tl2)
        }
        tl
    }

    SieverMod <- function(facLim, FBase, vecLen, SD, MInt, FList, LogFB, Lim, myCol) {
        
        MyLogs <- rep(0,nrow(SD))
        
        for (m in 3:facLim) {
            MyBool <- rep(FALSE,vecLen)
            MyBool[c(FList[[m]][[1]][[1]],FList[[m]][[2]][[1]])] <- TRUE
            MyBool[c(FList[[m]][[1]][[2]],FList[[m]][[2]][[2]])] <- TRUE
            temp <- which(MyBool)
            MyLogs[temp] <- MyLogs[temp] + LogFB[m]
        }
        
        MySieve <- which(MyLogs > Lim)
        MInt <- MInt[MySieve]; NewSD <- SD[MySieve,]
        newLen <- length(MySieve); GoForIT <- FALSE
        
        MyMat <- matrix(integer(0),nrow=newLen,ncol=myCol)
        MyMat[which(NewSD[,1L] < 0),1L] <- 1L; MyMat[which(NewSD[,1L] > 0),1L] <- 0L
        if ((myCol-1L) - (facLim+1L) > 0L) {MyMat[,((facLim+2L):(myCol-1L))] <- 0L}
        if (newLen==1L) {MyMat <- matrix(MyMat,nrow=1,byrow=TRUE)}
        
        if (newLen > 0L) {
            GoForIT <- TRUE
            for (m in 1:facLim) {
                vec <- rep(0L,newLen)
                temp <- which((NewSD[,1L]%%FBase[m])==0L)
                NewSD[temp,] <- NewSD[temp,]/FBase[m]; vec[temp] <- 1L
                test <- temp[which((NewSD[temp,]%%FBase[m])==0L)]
                while (length(test)>0L) {
                    NewSD[test,] <- NewSD[test,]/FBase[m]
                    vec[test] <- (vec[test]+1L)
                    test <- test[which((NewSD[test,]%%FBase[m])==0L)]
                }
                MyMat[,m+1L] <- vec
            }
        }
        
        list(MyMat,NewSD,MInt,GoForIT)
    }

    reduceMatrix <- function(mat) {
        tempMin <- 0L; n1 <- ncol(mat); n2 <- nrow(mat)
        mymax <- 1L
        for (i in 1:n1) {
            temp <- which(mat[,i]==1L)
            t <- which(temp >= mymax)
            if (length(temp)>0L && length(t)>0L) {
                MyMin <- min(temp[t])
                if (!(MyMin==mymax)) {
                    vec <- mat[MyMin,]
                    mat[MyMin,] <- mat[mymax,]
                    mat[mymax,] <- vec
                }
                t <- t[-1]; temp <- temp[t]
                for (j in temp) {mat[j,] <- (mat[j,]+mat[mymax,])%%2L}
                mymax <- mymax+1L
            }
        }
        
        if (mymax<n2) {simpMat <- mat[-(mymax:n2),]} else {simpMat <- mat}
        lenSimp <- nrow(simpMat)
        if (is.null(lenSimp)) {lenSimp <- 0L}
        mycols <- 1:n1
        
        if (lenSimp>1L) {
            ## "Diagonalizing" Matrix
            for (i in 1:lenSimp) {
                if (all(simpMat[i,]==0L)) {simpMat <- simpMat[-i,]; next}
                if (!simpMat[i,i]==1L) {
                    t <- min(which(simpMat[i,]==1L))
                    vec <- simpMat[,i]; tempCol <- mycols[i]
                    simpMat[,i] <- simpMat[,t]; mycols[i] <- mycols[t]
                    simpMat[,t] <- vec; mycols[t] <- tempCol
                }
            }
            
            lenSimp <- nrow(simpMat); MyList <- vector("list",length=n1)
            MyFree <- mycols[which((1:n1)>lenSimp)];  for (i in MyFree) {MyList[[i]] <- i}
            if (is.null(lenSimp)) {lenSimp <- 0L}
            
            if (lenSimp>1L) {
                for (i in lenSimp:1L) {
                    t <- which(simpMat[i,]==1L)
                    if (length(t)==1L) {
                        simpMat[ ,t] <- 0L
                        MyList[[mycols[i]]] <- 0L
                    } else {
                        t1 <- t[t>i]
                        if (all(t1 > lenSimp)) {
                            MyList[[mycols[i]]] <- MyList[[mycols[t1[1]]]]
                            if (length(t1)>1) {
                                for (j in 2:length(t1)) {MyList[[mycols[i]]] <- c(MyList[[mycols[i]]], MyList[[mycols[t1[j]]]])}
                            }
                        }
                        else {
                            for (j in t1) {
                                if (length(MyList[[mycols[i]]])==0L) {MyList[[mycols[i]]] <- MyList[[mycols[j]]]}
                                else {
                                    e1 <- which(MyList[[mycols[i]]]%in%MyList[[mycols[j]]])
                                    if (length(e1)==0) {
                                        MyList[[mycols[i]]] <- c(MyList[[mycols[i]]],MyList[[mycols[j]]])
                                    } else {
                                        e2 <- which(!MyList[[mycols[j]]]%in%MyList[[mycols[i]]])
                                        MyList[[mycols[i]]] <- MyList[[mycols[i]]][-e1]
                                        if (length(e2)>0L) {MyList[[mycols[i]]] <- c(MyList[[mycols[i]]], MyList[[mycols[j]]][e2])}
                                    }
                                }
                            }
                        }
                    }
                }
                TheList <- lapply(MyList, function(x) {if (length(x)==0L) {0} else {x}})
                list(TheList,MyFree)
            } else {
                list(NULL,NULL)
            }
        } else {
            list(NULL,NULL)
        }
    }

    GetFacs <- function(vec1, vec2, n) {
        x <- mod.bigz(prod.bigz(vec1),n)
        y <- mod.bigz(prod.bigz(vec2),n)
        MyAns <- c(gcd.bigz(x-y,n),gcd.bigz(x+y,n))
        MyAns[sort.list(asNumeric(MyAns))]
    }

    SolutionSearch <- function(mymat, M2, n, FB) {
        
        colTest <- which(apply(mymat, 2, sum) == 0)
        if (length(colTest) > 0) {solmat <- mymat[ ,-colTest]} else {solmat <- mymat}
        
        if (length(nrow(solmat)) > 0) {
            nullMat <- reduceMatrix(t(solmat %% 2L))
            listSol <- nullMat[[1]]; freeVar <- nullMat[[2]]; LF <- length(freeVar)
        } else {LF <- 0L}
        
        if (LF > 0L) {
            for (i in 2:min(10^8,(2^LF + 1L))) {
                PosAns <- MyIntToBit(i, LF)
                posVec <- sapply(listSol, function(x) {
                    t <- which(freeVar %in% x)
                    if (length(t)==0L) {
                        0
                    } else {
                        sum(PosAns[t])%%2L
                    }
                })
                ansVec <- which(posVec==1L)
                if (length(ansVec)>0) {
                    
                    if (length(ansVec) > 1L) {
                        myY <- apply(mymat[ansVec,],2,sum)
                    } else {
                        myY <- mymat[ansVec,]
                    }
                    
                    if (sum(myY %% 2) < 1) {
                        myY <- as.integer(myY/2)
                        myY <- pow.bigz(FB,myY[-1])
                        temp <- GetFacs(M2[ansVec], myY, n)
                        if (!(1==temp[1]) && !(1==temp[2])) {
                            return(temp)
                        }
                    }
                }
            }
        }
    }
    
### Below is the main portion of the Quadratic Sieve

    BegTime <- Sys.time(); MyNum <- as.bigz(MyN); DigCount <- nchar(as.character(MyN))
    P <- PrimeSieve(10^5)
    SqrtInt <- .mpfr2bigz(trunc(sqrt(mpfr(MyNum,sizeinbase(MyNum,b=2)+5L))))
    
    if (DigCount < 24) {
        DigSize <- c(4,10,15,20,23)
        f_Pos <- c(0.5,0.25,0.15,0.1,0.05)
        MSize <- c(5000,7000,10000,12500,15000)
        
        if (fudge1==0L) {
            LM1 <- lm(f_Pos ~ DigSize)
            m1 <- summary(LM1)$coefficients[2,1]
            b1 <- summary(LM1)$coefficients[1,1]
            fudge1 <- DigCount*m1 + b1
        }
        
        if (LenB==0L) {
            LM2 <- lm(MSize ~ DigSize)
            m2 <- summary(LM2)$coefficients[2,1]
            b2 <- summary(LM2)$coefficients[1,1]
            LenB <- ceiling(DigCount*m2 + b2)
        }
        
        LimB <- trunc(exp((.5+fudge1)*sqrt(log(MyNum)*log(log(MyNum)))))
        B <- P[P<=LimB]; B <- B[-1]
        facBase <- P[which(sapply(B, function(x) ExpBySquaringBig(MyNum,(x-1)/2,x)==1L))+1L]
        LenFBase <- length(facBase)+1L
    } else if (DigCount < 67) {
        ## These values were obtained from "The Multiple Polynomial
        ## Quadratic Sieve" by Robert D. Silverman
        DigSize <- c(24,30,36,42,48,54,60,66)
        FBSize <- c(100,200,400,900,1200,2000,3000,4500)
        MSize <- c(5,25,25,50,100,250,350,500)
        
        LM1 <- loess(FBSize ~ DigSize)
        LM2 <- loess(MSize ~ DigSize)
        
        if (fudge1==0L) {
            fudge1 <- -0.4
            LimB <- trunc(exp((.5+fudge1)*sqrt(log(MyNum)*log(log(MyNum)))))
            myTarget <- ceiling(predict(LM1, DigCount))
            
            while (LimB < myTarget) {
                LimB <- trunc(exp((.5+fudge1)*sqrt(log(MyNum)*log(log(MyNum)))))
                fudge1 <- fudge1+0.001
            }
            B <- P[P<=LimB]; B <- B[-1]
            facBase <- P[which(sapply(B, function(x) ExpBySquaringBig(MyNum,(x-1)/2,x)==1L))+1L]
            LenFBase <- length(facBase)+1L
            
            while (LenFBase < myTarget) {
                fudge1 <- fudge1+0.005
                LimB <- trunc(exp((.5+fudge1)*sqrt(log(MyNum)*log(log(MyNum)))))
                myind <- which(P==max(B))+1L
                myset <- tempP <- P[myind]
                while (tempP < LimB) {
                    myind <- myind + 1L
                    tempP <- P[myind]
                    myset <- c(myset, tempP)
                }
                
                for (p in myset) {
                    t <- ExpBySquaringBig(MyNum,(p-1)/2,p)==1L
                    if (t) {facBase <- c(facBase,p)}
                }
                B <- c(B, myset)
                LenFBase <- length(facBase)+1L
            }
        } else {
            LimB <- trunc(exp((.5+fudge1)*sqrt(log(MyNum)*log(log(MyNum)))))
            B <- P[P<=LimB]; B <- B[-1]
            facBase <- P[which(sapply(B, function(x) ExpBySquaringBig(MyNum,(x-1)/2,x)==1L))+1L]
            LenFBase <- length(facBase)+1L
        }
        if (LenB==0L) {LenB <- 1000*ceiling(predict(LM2, DigCount))}
    } else {
        return("The number you've entered is currently too big for this algorithm!!")
    }
    
    SieveDist <- lapply(facBase, function(x) TonelliShanks(MyNum,x))
    SieveDist <- c(1L,SieveDist); SieveDist[[1]] <- c(SieveDist[[1]],1L); facBase <- c(2L,facBase)
    Lower <- -LenB; Upper <- LenB; LenB2 <- 2*LenB+1L; MyInterval <- Lower:Upper
    M <- MyInterval + SqrtInt ## Set that will be tested
    SqrDiff <- matrix(sub.bigz(pow.bigz(M,2),MyNum),nrow=length(M),ncol=1L)
    maxM <- max(MyInterval)
    LnFB <- log(facBase)
    
    ## N.B. primo uses 0.735, as his siever
    ## is more efficient than the one employed here
    if (fudge2==0L) {
        if (DigCount < 8) {
            fudge2 <- 0
        } else if (DigCount < 12) {
            fudge2 <- .7
        } else if (DigCount < 20) {
            fudge2 <- 1.3
        } else {
            fudge2 <- 1.6
        }
    }
    
    TheCut <- log10(maxM*sqrt(2*asNumeric(MyNum)))*fudge2
    myPrimes <- as.bigz(facBase)
    
    CoolList <- SieveLists(LenFBase, facBase, LenB2, SieveDist, MyInterval)
    GetMatrix <- SieverMod(LenFBase, facBase, LenB2, SqrDiff, M, CoolList, LnFB, TheCut, LenFBase+1L)
    
    if (GetMatrix[[4]]) {
        newmat <- GetMatrix[[1]]; NewSD <- GetMatrix[[2]]; M <- GetMatrix[[3]]
        NonSplitFacs <- which(abs(NewSD[,1L])>1L)
        newmat <- newmat[-NonSplitFacs, ]
        M <- M[-NonSplitFacs]
        lenM <- length(M)
        
        if (class(newmat) == "matrix") {
            if (nrow(newmat) > 0) {
                PosAns <- SolutionSearch(newmat,M,MyNum,myPrimes)
            } else {
                PosAns <- vector()
            }
        } else {
            newmat <- matrix(newmat, nrow = 1)
            PosAns <- vector()
        }
    } else {
        newmat <- matrix(integer(0),ncol=(LenFBase+1L))
        PosAns <- vector()
    }
    
    Atemp <- .mpfr2bigz(trunc(sqrt(sqrt(mpfr(2*MyNum))/maxM)))
    if (Atemp < max(facBase)) {Atemp <- max(facBase)}; myPoly <- 0L
    
    while (length(PosAns)==0L) {LegTest <- TRUE
        while (LegTest) {
            Atemp <- nextprime(Atemp)
            Legendre <- asNumeric(ExpBySquaringBig(MyNum,(Atemp-1L)/2,Atemp))
            if (Legendre == 1) {LegTest <- FALSE}
        }
    
        A <- Atemp^2
        Btemp <- max(TonelliShanks(MyNum, Atemp))
        B2 <- (Btemp + (MyNum - Btemp^2) * inv.bigz(2*Btemp,Atemp))%%A
        C <- as.bigz((B2^2 - MyNum)/A)
        myPoly <- myPoly + 1L
    
        polySieveD <- lapply(1:LenFBase, function(x) {
            AInv <- inv.bigz(A,facBase[x])
            asNumeric(c(((SieveDist[[x]][1]-B2)*AInv)%%facBase[x],
                        ((SieveDist[[x]][2]-B2)*AInv)%%facBase[x]))
        })
    
        M1 <- A*MyInterval + B2
        SqrDiff <- matrix(A*pow.bigz(MyInterval,2) + 2*B2*MyInterval + C,nrow=length(M1),ncol=1L)
        CoolList <- SieveLists(LenFBase, facBase, LenB2, polySieveD, MyInterval)
        myPrimes <- c(myPrimes,Atemp)
        LenP <- length(myPrimes)
        GetMatrix <- SieverMod(LenFBase, facBase, LenB2, SqrDiff, M1, CoolList, LnFB, TheCut, LenP+1L)
    
        if (GetMatrix[[4]]) {
            n2mat <- GetMatrix[[1]]; N2SD <- GetMatrix[[2]]; M1 <- GetMatrix[[3]]
            n2mat[,LenP+1L] <- rep(2L,nrow(N2SD))
            if (length(N2SD) > 0) {NonSplitFacs <- which(abs(N2SD[,1L])>1L)} else {NonSplitFacs <- LenB2}
            if (length(NonSplitFacs)<2*LenB) {
                M1 <- M1[-NonSplitFacs]; lenM1 <- length(M1)
                n2mat <- n2mat[-NonSplitFacs,]
                if (lenM1==1L) {n2mat <- matrix(n2mat,nrow=1)}
                if (ncol(newmat) < (LenP+1L)) {
                    numCol <- (LenP + 1L) - ncol(newmat)
                    newmat <-     cbind(newmat,matrix(rep(0L,numCol*nrow(newmat)),ncol=numCol))
                }
                newmat <- rbind(newmat,n2mat); lenM <- lenM+lenM1; M <- c(M,M1)
                if (class(newmat) == "matrix") {
                    if (nrow(newmat) > 0) {
                        PosAns <- SolutionSearch(newmat,M,MyNum,myPrimes)
                    }
                }
            }
        }
    }
    
    EndTime <- Sys.time()
    TotTime <- EndTime - BegTime
    print(format(TotTime))
    return(PosAns)
}

With Old QS algorithm

> library(gmp)
> library(Rmpfr)

> n3 <- prod(nextprime(urand.bigz(2, 40, 17)))
> system.time(t5 <- QuadSieveAll(n3,0.1,myps))
  user  system elapsed 
164.72    0.77  165.63 
> system.time(t6 <- factorize(n3))
user  system elapsed 
0.1     0.0     0.1 
> all(t5[sort.list(asNumeric(t5))]==t6[sort.list(asNumeric(t6))])
[1] TRUE

With New Muli-Polynomial QS algorithm

> QuadSieveMultiPolysAll(n3)
[1] "4.952 secs"
Big Integer ('bigz') object of length 2:
[1] 342086446909 483830424611

> n4 <- prod(nextprime(urand.bigz(2,50,5)))
> QuadSieveMultiPolysAll(n4)   ## With old algo, it took over 4 hours
[1] "1.131717 mins"
Big Integer ('bigz') object of length 2:
[1] 166543958545561 880194119571287

> n5 <- as.bigz("94968915845307373740134800567566911")   ## 35 digits
> QuadSieveMultiPolysAll(n5)
[1] "3.813167 mins"
Big Integer ('bigz') object of length 2:
[1] 216366620575959221 438925910071081891

> system.time(factorize(n5))   ## It appears we are reaching the limits of factorize
   user  system elapsed 
 131.97    0.00  131.98

Side note: The number n5 above is a very interesting number. Check it out here

The Breaking Point!!!!

> n6 <- factorialZ(38) + 1L   ## 45 digits
> QuadSieveMultiPolysAll(n6)
[1] "22.79092 mins"
Big Integer ('bigz') object of length 2:
[1] 14029308060317546154181 37280713718589679646221

> system.time(factorize(n6))   ## Shut it down after 2 days of running

Latest Triumph (50 digits)

> n9 <- prod(nextprime(urand.bigz(2,82,42)))
> QuadSieveMultiPolysAll(n9)
[1] "12.9297 hours"
Big Integer ('bigz') object of length 2:
[1] 2128750292720207278230259 4721136619794898059404993

## Based off of some crude test, factorize(n9) would take more than a year.

It should be noted that the QS generally doesn't perform as well as the Pollard's rho algorithm on smaller numbers and the power of the QS starts to become apparent as the numbers get larger.

Upvotes: 11

Joseph Wood
Joseph Wood

Reputation: 7608

Major Update

Below is my latest R factorization algorithm. It is way faster and pays homage to the rle function.

Algorithm 3 (Updated)

library(gmp)
MyFactors <- function(MyN) {
    myRle <- function (x1) {
        n1 <- length(x1)
        y1 <- x1[-1L] != x1[-n1]
        i <- c(which(y1), n1)
        list(lengths = diff(c(0L, i)), values = x1[i], uni = sum(y1)+1L)
    }

    if (MyN==1L) return(MyN)
    else {
        pfacs <- myRle(factorize(MyN))
        unip <- pfacs$values
        pv <- pfacs$lengths
        n <- pfacs$uni
        myf <- unip[1L]^(0L:pv[1L])
        if (n > 1L) {
            for (j in 2L:n) {
                myf <- c(myf, do.call(c,lapply(unip[j]^(1L:pv[j]), function(x) x*myf)))
            }
        }
    }
    myf[order(asNumeric(myf))]  ## 'order' is faster than 'sort.list'
}

Below are the new benchmarks (As Dirk Eddelbuettel says here, "Can't argue with empirics."):

Case 1 (large prime factors)

set.seed(100)
myList <- lapply(1:10^3, function(x) sample(10^6, 10^5))
benchmark(SortList=lapply(myList, function(x) sort.list(x)),
            OrderFun=lapply(myList, function(x) order(x)),
            replications=3,
            columns = c("test", "replications", "elapsed", "relative"))
      test replications elapsed relative
2 OrderFun            3   59.41    1.000
1 SortList            3   61.52    1.036

## The times are limited by "gmp::factorize" and since it relies on
## pseudo-random numbers, the times can vary (i.e. one pseudo random
## number may lead to a factorization faster than others). With this
## in mind, any differences less than a half of second
## (or so) should be viewed as the same. 
x <- pow.bigz(2,256)+1
system.time(z1 <- MyFactors(x))
user  system elapsed
14.94    0.00   14.94
system.time(z2 <- all_divisors(x))      ## system.time(factorize(x))
user  system elapsed                    ##  user  system elapsed
14.94    0.00   14.96                   ## 14.94    0.00   14.94 
all(z1==z2)
[1] TRUE

x <- as.bigz("12345678987654321321")
system.time(x1 <- MyFactors(x^2))
user  system elapsed 
20.66    0.02   20.71
system.time(x2 <- all_divisors(x^2))    ## system.time(factorize(x^2))
user  system elapsed                    ##  user  system elapsed
20.69    0.00   20.69                   ## 20.67    0.00   20.67
all(x1==x2)
[1] TRUE

Case 2 (smaller numbers)

set.seed(199)
samp <- sample(10^9, 10^5)
benchmark(JosephDivs=sapply(samp, MyFactors),
            DontasDivs=sapply(samp, all_divisors),
            OldDontas=sapply(samp, Oldall_divisors),
            replications=10,
            columns = c("test", "replications", "elapsed", "relative"),
            order = "relative")
        test replications elapsed relative
1 JosephDivs           10  470.31    1.000
2 DontasDivs           10  567.10    1.206  ## with vapply(..., USE.NAMES = FALSE)
3  OldDontas           10  626.19    1.331  ## with sapply

Case 3 (for complete thoroughness)

set.seed(97)
samp <- sample(10^6, 10^4)
benchmark(JosephDivs=sapply(samp, MyFactors),
            DontasDivs=sapply(samp, all_divisors),
            CottonDivs=sapply(samp, get_all_factors),
            ChaseDivs=sapply(samp, FUN),
            replications=5,
            columns = c("test", "replications", "elapsed", "relative"),
            order = "relative")
        test replications elapsed relative
1 JosephDivs            5   22.68    1.000
2 DontasDivs            5   27.66    1.220
3 CottonDivs            5  126.66    5.585
4  ChaseDivs            5  554.25   24.438


Original Post

The algorithm by @RichieCotton is a very nice R implementation. The brute force method will only get you so far and fails with large numbers. I have provided three algorithms that will meet different needs. The first one (is the original algorithm I posted in Jan 15 and has been updated slightly), is a stand-alone factorization algorithm which offers a combinatorial approach that is efficient, accurate, and can be easily translated into other languages. The second algorithm is more of a sieve that is very fast and extremely useful when you need the factorization of thousands of numbers quickly. The third is a short (posted above), yet powerful stand-alone algorithm that is superior for any number less than 2^70 (I scrapped almost everything from my original code). I drew inspiration from Richie Cotton's use of the plyr::count function (it inspired me to write my own rle function that has a very similar return as plyr::count), George Dontas's clean way of handling the trivial case (i.e. if (n==1) return(1)), and the solution provided by @Zelazny7 to a question I had regarding bigz vectors.

Algorithm 1 (original)

library(gmp)
factor2 <- function(MyN) {
    if (MyN == 1) return(1L)
    else {
        max_p_div <- factorize(MyN)
        prime_vec <- max_p_div <- max_p_div[sort.list(asNumeric(max_p_div))]
        my_factors <- powers <- as.bigz(vector())
        uni_p <- unique(prime_vec); maxp <- max(prime_vec)
        for (i in 1:length(uni_p)) {
            temp_size <- length(which(prime_vec == uni_p[i]))
            powers <- c(powers, pow.bigz(uni_p[i], 1:temp_size))
        }
        my_factors <- c(as.bigz(1L), my_factors, powers)
        temp_facs <- powers; r <- 2L
        temp_facs2 <- max_p_div2 <- as.bigz(vector())
        while (r <= length(uni_p)) {
            for (i in 1:length(temp_facs)) {
                a <- which(prime_vec >  max_p_div[i])
                temp <- mul.bigz(temp_facs[i], powers[a])
                temp_facs2 <- c(temp_facs2, temp)
                max_p_div2 <- c(max_p_div2, prime_vec[a])
            }
            my_sort <- sort.list(asNumeric(max_p_div2))
            temp_facs <- temp_facs2[my_sort]
            max_p_div <- max_p_div2[my_sort]
            my_factors <- c(my_factors, temp_facs)
            temp_facs2 <- max_p_div2 <- as.bigz(vector()); r <- r+1L
        }
    }
    my_factors[sort.list(asNumeric(my_factors))]
}

Algorithm 2 (sieve)

EfficientFactorList <- function(n) {
    MyFactsList <- lapply(1:n, function(x) 1)
    for (j in 2:n) {
        for (r in seq.int(j, n, j)) {MyFactsList[[r]] <- c(MyFactsList[[r]], j)}
    }; MyFactsList}

It gives the factorization of every number between 1 and 100,000 in less than 2 seconds. To give you an idea of the efficiency of this algorithm, the time to factor 1 - 100,000 using the brute force method takes close to 3 minutes.

system.time(t1 <- EfficientFactorList(10^5))
user  system elapsed 
1.04    0.00    1.05 
system.time(t2 <- sapply(1:10^5, MyFactors))
user  system elapsed 
39.21    0.00   39.23 
system.time(t3 <- sapply(1:10^5, all_divisors))
user  system elapsed 
49.03    0.02   49.05

TheTest <- sapply(1:10^5, function(x) all(t2[[x]]==t3[[x]]) && all(asNumeric(t2[[x]])==t1[[x]]) && all(asNumeric(t3[[x]])==t1[[x]]))
all(TheTest)
[1] TRUE



Final Thoughts

@Dontas’s original comment about factoring large numbers got me thinking, what about really really large numbers… like numbers greater than 2^200. You will see that whichever algorithm you choose on this page, they will all take a very long time because most of them rely on gmp::factorize which uses the Pollard-Rho algorithm. From this question, this algorithm is only reasonable for numbers less than 2^70. I am currently working on my own factorize algorithm which will implement the Quadratic Sieve, which should take all of these algorithms to the next level.

Upvotes: 8

Yorgos
Yorgos

Reputation: 30485

The following approach deliver correct results, even in cases of really big numbers (which should be passed as strings). And it's really fast.

# TEST
# x <- as.bigz("12345678987654321")
# all_divisors(x)
# all_divisors(x*x)

# x <- pow.bigz(2,89)-1
# all_divisors(x)

library(gmp)
  options(scipen =30)

  sort_listz <- function(z) {
  #==========================
    z <- z[order(as.numeric(z))] # sort(z)
  } # function  sort_listz  


  mult_listz <- function(x,y) {
   do.call('c', lapply(y, function(i) i*x)) 
  } 


  all_divisors <- function(x) {
  #==========================  
  if (abs(x)<=1) return(x) 
  else {

    factorsz <- as.bigz(factorize(as.bigz(x))) # factorize returns up to
    # e.g. x= 12345678987654321  factors: 3 3 3 3 37 37 333667 333667

    factorsz <- sort_listz(factorsz) # vector of primes, sorted

    prime_factorsz <- unique(factorsz)
    #prime_ekt <- sapply(prime_factorsz, function(i) length( factorsz [factorsz==i]))
    prime_ekt <- vapply(prime_factorsz, function(i) sum(factorsz==i), integer(1), USE.NAMES=FALSE)
    spz <- vector() # keep all divisors 
    all <-1
    n <- length(prime_factorsz)
    for (i in 1:n) {
      pr <- prime_factorsz[i]
      pe <- prime_ekt[i]
      all <- all*(pe+1) #counts all divisors 

      prz <- as.bigz(pr)
      pse <- vector(mode="raw",length=pe+1) 
      pse <- c( as.bigz(1), prz)

      if (pe>1) {
        for (k in 2:pe) {
          prz <- prz*pr
          pse[k+1] <- prz
        } # for k
      } # if pe>1

      if (i>1) {
       spz <- mult_listz (spz, pse)         
      } else {
       spz <- pse;
      } # if i>1
    } #for n
    spz <- sort_listz (spz)

    return (spz)
  }  
  } # function  factors_all_divisors  

  #====================================

Refined version, very fast. Code remains simple, readable & clean.

TEST

#Test 4 (big prime factor)
x <- pow.bigz(2,256)+1 # = 1238926361552897 * 93461639715357977769163558199606896584051237541638188580280321
 system.time(z2 <- all_divisors(x))
#   user  system elapsed 
 #  19.27    1.27   20.56


 #Test 5 (big prime factor)
x <- as.bigz("12345678987654321321") # = 3 * 19 * 216590859432531953

 system.time(x2 <- all_divisors(x^2))
#user  system elapsed 
 #25.65    0.00   25.67  

Upvotes: 7

Richie Cotton
Richie Cotton

Reputation: 121157

You can get all factors from the prime factors. gmp calculates these very quickly.

library(gmp)
library(plyr)

get_all_factors <- function(n)
{
  prime_factor_tables <- lapply(
    setNames(n, n), 
    function(i)
    {
      if(i == 1) return(data.frame(x = 1L, freq = 1L))
      plyr::count(as.integer(gmp::factorize(i)))
    }
  )
  lapply(
    prime_factor_tables, 
    function(pft)
    {
      powers <- plyr::alply(pft, 1, function(row) row$x ^ seq.int(0L, row$freq))
      power_grid <- do.call(expand.grid, powers)
      sort(unique(apply(power_grid, 1, prod)))
    }
  )
}

get_all_factors(c(1, 7, 60, 663, 2520, 75600, 15876000, 174636000, 403409160000))

Upvotes: 15

Chase
Chase

Reputation: 69221

To follow up on my comment (thanks to @Ramnath for my typo), the brute force method seems to work reasonably well here on my 64 bit 8 gig machine:

FUN <- function(x) {
    x <- as.integer(x)
    div <- seq_len(abs(x))
    factors <- div[x %% div == 0L]
    factors <- list(neg = -factors, pos = factors)
    return(factors)
}

A few examples:

> FUN(100)
$neg
[1]   -1   -2   -4   -5  -10  -20  -25  -50 -100

$pos
[1]   1   2   4   5  10  20  25  50 100

> FUN(-42)
$neg
[1]  -1  -2  -3  -6  -7 -14 -21 -42

$pos
[1]  1  2  3  6  7 14 21 42

#and big number

> system.time(FUN(1e8))
   user  system elapsed 
   1.95    0.18    2.14 

Upvotes: 27

Related Questions