Monday, March 16, 2015

Eratosthenes sieve - generating primes in a range





When we want to find all prime numbers up to some integer N algorithm called Eratosthenes sieve comes into play. Why? Well for one it's super fast. It runs in time O(n*log(log(n))) and O(n) space. But before I explain the algorithm and show some implementation in c/c++ I would like to discuss some simple observations in generating prime numbers.

By definition 2 is a prime.
Number k is a prime if k is not divisible by any number (but 1) up to k-1 or gcd(k, xi)=1 where xi is in range [2,k-1] .

So in order to check weather some number k is prime, we need to make sure that all numbers up to and including k-1 do not divide k.

case 1: -- (1)

Bool isPrime(k){ if(k==1) return false;
For(i=2; i < k; i++)
  If(k % i == 0)
    Return false;
Return true;
}



For example k = 11
i = 2:  11 mod 2 = 1 but 1 != 0
i = 3:  11 mod 3 = 2 but 2 != 0
.
.
.
i = 10: 11 mod 10 != 0, since this is last i in our loop, our function returns true. Which is correct. 11 is a prime number.

Let's try for some number which is not prime, let try for 4.
k = 4
i = 2: 4 mod 2 = 0, and function returns false, so for is not prime.
Let's optimize our function notice the following:

For k = 18, let'.s factor it.
18 = 2 * 9
18 = 3 * 6
18 = 6 * 3
18 = 9 * 2

Notice that after some point by commutation of product, pattern repeats. Let's check after what number as a function of k we can stop checking and be sure we've checked enough to claim no more checking is needed.

For k = 18, i has to go up to and including 3, which seems to be floor(sqrt(k)) - 1 but let's see what happens when k is square of prime number.

For k = 49
49 = 7*7, 7 is a prime, so iterator i has to be 7 in order to conclude that 49 is composite.

Now we see that in cases when k is perfect square our iterator i has to go up to and including sqrt(k).

So range we have to check is while i<=sqrt(k).

This is nice improvement for identifying whether some k is prime.
Our loop in conclusion becomes:

for(i=2; i <= sqrt(k); i++)              -- (2)
.
.
.

Of course we can optimize further, none of the even numbers can be prime, so we don't need to check them at all.
So our loop can start at 3 and go up to sqrt(k) in step of 2. Like this:

for(i=3; i <= sqrt(k); i+=2)               -- (3)

Our implementation of function for primality test can be optimized even further by the fact that any prime number can be of form (6k+1) k = -1,1,2,3,4 this optimization would optimize primality test by a factor of 3. In any case purpose of this text is to take another approach that is sieving primes up to some range, so now will look into idea of Eratosthens sieve.

However before we do that, we should note something about time complexity of stated functions.

Let's analyse time complexity:

Note that analysis of time complexity when dealing with prime numbers needs to take into account size of the input. Analysis we present is dependent of sole value of k or n, that is assumption is that modulo, division operations can be done in O(1) time which is not true for very large numbers. 

This should be taken with reserve. Since assumption in not true for when dealing with big integers! 

When checking if some number k is prime in first case we need to check from 2 to k-1 so time depends how large k is. And it depends linearly of k so time complexity is O(k) for single k.

In second case we check up to and including sqrt(k) so in big O notation time complexity is O(sqrt(k)).

What happens to time complexity if we want generate all primes up to some n?

Well let's see, for every positive integer k<=n we need to apply one of our functions. (Naive or improved)
So in first case with naive function, we have to call function isPrime n-2 times. 
Let's see the pattern here for example if n = 9:

isPrime(2) + isPrime(3) + ... + isPrime(8) + isPrime(9) generally we have [sum(n-k)] for k = 0 : n-2

[ (n + n-1 + n - 2 ... + n-n+2) ] asymptotic value of this sum is practically sum of first n natural numbers which can be evaluated to n(n+1)/ 2. so complete time complexity is of rang O(n^2). Which is far too slow.

if we apply discussed improved function which checks only sqrt(k) numbers, time complexity is better let's see how much: 

we have n-1 times to call improved function for some value k <= n. Our complexity is dependent of n and k in following way:

[ sum(sqrt(n-k)) ], k = 0 : n-2, so we have (n-1)[ (sqrt(n) + sqrt(n-1) + ... + sqrt(n-n+2) ]  

time complexity of this version is a little better, in order to estimate this we need to know how to expand this sequence of roots. For that purpose check out this link: math.stackexchange.com It would be something of a rang O(n^1.5) 

In case 3, time complexity is faster by factor ln(n). It becomes O(n^1.5/ln(n))

In order to improve time complexity we need another approach. Instead of for every k<=n checking whether it is prime and printing it, we will assume all of k up to n are prime and then discard all composite. That will leave us with all of primes up to n. This is central idea of Eratosthenes sieve algorithm.

Let's go back to definition:

2 is a prime. So by thinking, ok if 2 is prime then none of it's multiples are not prime.
So 4, 6, 8, 10, 12 ... are not so we discard them.

By same logic we come to next number 3 and discard it's multiples.
So 6, 9, 12, 15 ... Not prime so we discard them too.

Notice that as we go along every next k which is not discarded must be prime.


For example when we discard multiples of 2 and 3, next number which is not discarded is 5 because, 4 is discarded as a multiple of 2.

In this fashion we discard all multiples up to n and we're left with primes up to n.  

up to what k we have to sieve, well we know that for upper bound n we need to check sqrt(n) values and that is precisely how much we need to sieve to be sure, we've discarded all composite numbers up to n. 

The algorithm follows:

primes is array which will contain our primes. 
void primeSieve(int *primes, int n) {
bool *sieve = (bool*)malloc(n+1);

// assume 2 .. n are primes 
for(int i = 2; i <= n; i++)
  sieve[i] = true;

sieve[0] = false;
sieve[1] = false;

int count = 0;
for(i = 2; i <=sqrt(n); i++) {
   if(sieve[i] == false)
     continue;

// put primes in array we've passed as parameter of the function.
primes[count++] = i;

  for(int j = i*i; j <= n; j+=i)
     sieve[j] = false;
}

}


Running time of this algorithm is O(n*log(log(n))) but regarding input size it's complexity is exponential. 
Space complexity is O(n).

Finally when sieving primes, memory turns out to be very important factor for large n and so we have something called segmented sieve version. 

General idea is to sieve primes in segments as name suggest. 

[101  102 103 104 105 106 107 108 109 110] -- segment to be sieved. 

for each prime (precalculated) find first multiple such that lower end of the range is > then that multiple.

m = 101 , n = 110

for prime  = 2, 101 - 101 mod 2 = 100
now we discard all multiples of 2 starting from 100 up to 110 -- (100,110] 

we cross off 100, 102, .. ,110

for prime = 3, 101 - 101 mod 3  = 99

we cross off 102, 105, 108

and so on ... until we check all previous primes.

To put your skills at practice regarding segmented version of sieve, check out prime1 problem at spoj.

Hope that clears up some things for you, happy programming.   



No comments:

Post a Comment