# Fast implementation of the

segmented sieve of Eratosthenes

## Introduction

In order to verify as quickly as possible the Goldbach
conjecture in the vicinity of *10^17* (and higher) I implemented a cache-friendly
segmented sieve of Eratosthenes. My implementation appears to be quite fast, specially when
testing intervals near *10^18*. In this page I report the main ideas behind this
implementation, present some speed measurements, and provide C source code of a
proof-of-concept implementation of my ideas. In these (brief) explanations it is assumed that
the reader has some experience with this subject.

## Description of the algorithm

In the classical sieve of Eratosthenes one starts with the numbers *2*, *3*,
*4*, ..., *N*, and successively discard those that are multiples of the primes up to
*sqrt(N)*; these primes will be called sieving primes. On a computer, the character of each
integer (either prime or composite), or better yet, the character of each integer that is not a
multiple of the very first primes (say, *2*, *3* and *5*), can be represented by
one bit of information. Because each computer has a finite amount of physical memory, if *N*
is too large one is forced to subdivide the interval *[2,N]* in manageable subintervals.
This subdivision gives rise to the so-called segmented sieve of Eratosthenes.

In order to identify all prime numbers in the subinterval *[P,Q]* it is necessary to
discard all multiples of the sieving primes that belong to this subinterval. Assuming a constant
memory access time, the amount of time required to do this is of the form
*A (Q-P) log(log(Q))+B pi(sqrt(Q))* plus some less significant terms, where
*pi(sqrt(Q))* is the number of sieving primes and *A* and *B* are positive
constants. The first term is related to the number of expected multiples of the sieving primes
belonging to *[P,Q]*, and the second is related to the overhead incurred in the computation
of the first multiple, not smaller than *P*, of each sieving prime.

If *Q* is large when compared to *Q-P*, the overhead term *B pi(sqrt(Q))* will
dominate the execution time. To reduce its effect the popular approach is to increase the
subinterval size, i.e., increase *Q-P*, so that each sieving prime will have a few multiples
in the subinterval. When *Q* is quite large, say *Q=10^18*, this is currently not
feasible, and the *B pi(sqrt(Q))* term will dominate the execution time. Furthermore, a
large interval requires a large amount of memory, which will be accessed in an essentially random
way. The cache of modern processors will thus not be used efficiently, making the *A*
constant rather large.

In *2001* I devised an **apparently new** (and fast) algorithm to generate primes,
which is based on the observation that usually many consecutive subintervals will be sieved.
The main idea is to keep, for each subinterval, a list of the sieving primes that have multiples
in that subinterval. In an initialization phase, performed only once (at the start of the
computation), the sieving primes are distributed among the lists. To do this, the first multiple
(belonging to the interval being sieved) of each sieving prime is computed, the subinterval where
this multiple lies is identified, and the prime and respective multiple are placed in the
appropriate list. After the initialization, each subinterval is considered in turn. The list of
primes of the current subinterval is processed, and its sieving primes are put into the lists of
the next subintervals, according to the value of their next multiple. The computation stops after
the last subinterval is processed.

The amount of time spent in the initialization is of the form *C pi(sqrt(N))* plus
less significant terms. The amount of time required to process each subinterval is of the form
*D (Q-P) log(log(Q))* plus less significant terms. The amount of memory used in our new
algorithm is proportional to the number of sieving primes. The size of each subinterval should be
small, so that all important data can be kept in the processor' data caches.

## Implementation

In a practical implementation of these ideas, the number of lists is a fixed power
of *2*, and the lists are used in a circular fashion (using modular arithmetic for the
list number). Furthermore, each list is implemented using a linked list, each node of which
having a size that is a multiple of the line size of the processor's level 1 data cache
and having addresses aligned on multiples of that size; each node of the linked list will thus
hold information about several sieving primes. This is necessary to both reduce the storage
overheads of the linked list and to use the processor' data caches more efficiently.

Since the currently freely available implementations (that I am aware of) of the segmented
sieve of Eratosthenes are not very efficient when processing intervals near *10^16* or
higher, I have decided to offer two simple implementations of the algorithm.
For comparison purposes in each case two algorithms are offered: a "classical" segmented sieve
and the new segmented sieve.

## Performance

To show the effectiveness of the algorithm, in the following table I present the time, in
seconds, required to count the number of primes in an interval of *10^9* integers, starting
at *10^n*. Three algorithms are considered:

- A1: the implementation of the "classical" segmented sieve mentioned above (version 1),
- A2: the implementation of the new segmented sieve, also mentioned above (version 1), and
- A3: an optimized implementation of the new segmented sieve, using a
*mod 30*wheel and assembly language, used in my Goldbach conjecture verification project.

I have not taken into consideration the initialization time, which is approximately equal for
the three algorithms (less than one minute for an interval near *10^18*). In this way, to
estimate the execution time for an interval of, say, *10^12* integers one just has to
multiply the numbers given in the table by *1000*.

n | A1 | A2 | A3 | |
---|---|---|---|---|

12 | 13.31 | 14.08 | 2.63 | |

14 | 25.79 | 20.62 | 4.05 | |

16 | 97.39 | 26.26 | 5.58 | |

18 | 208.13 | 32.36 | 9.61 |

The machine used was a 900MHz Athlon computer with 512Mbytes of CAS3 PC133 memory running GNU/Linux. For each sieving prime, A1 uses about 8 bytes, A2 uses about 8.1 bytes, and A3 uses about 6.5 bytes. For the first two algorithms, more details are given in the source code. More timing results of algorithm A3 are available elsewhere [23KiB, PDF].

## Software (version 1) (last update made on May 3, 2003)

Simple single-threaded implementation [12KiB, gziped tar archive] of the segmented sieve of Eratosthenes, released under the version 2 (or any later version) of the GNU general public license. Please do not ask for my help to fit this source code (and the one described below) to your own purposes. I will almost surely not help you. Please note that this is proof-of-concept code; production code, using a mod 30 wheel, would be at least twice as fast.

## New software (version 2) (last update made on September 29, 2010)

A more refined single-threaded implementation [20KiB, gziped tar archive]
of the segmented sieve of Eratosthenes, released under the version 2 (or any later version) of
the GNU general
public license. On a 2.83GHz Intel Core2 Quad Q9550 processor, the new single-threaded
implementation can generate all primes in an interval of *2^30* near *2^64* in
*3* seconds using 1.6G bytes of memory. Please note that this is proof-of-concept code;
production code, using a mod 30 wheel, would be at least twice as fast.