Fast implementation of thesegmented sieve of Eratosthenes

Introduction

In order to verify as quickly as possible the Goldbach conjecture in the vicinity of 10^17 (and higher) I implemented a cache-friendly segmented sieve of Eratosthenes. My implementation appears to be quite fast, specially when testing intervals near 10^18. In this page I report the main ideas behind this implementation, present some speed measurements, and provide C source code of a proof-of-concept implementation of my ideas. In these (brief) explanations it is assumed that the reader has some experience with this subject.

Description of the algorithm

In the classical sieve of Eratosthenes one starts with the numbers 2, 3, 4, ..., N, and successively discard those that are multiples of the primes up to sqrt(N); these primes will be called sieving primes. On a computer, the character of each integer (either prime or composite), or better yet, the character of each integer that is not a multiple of the very first primes (say, 2, 3 and 5), can be represented by one bit of information. Because each computer has a finite amount of physical memory, if N is too large one is forced to subdivide the interval [2,N] in manageable subintervals. This subdivision gives rise to the so-called segmented sieve of Eratosthenes.

In order to identify all prime numbers in the subinterval [P,Q] it is necessary to discard all multiples of the sieving primes that belong to this subinterval. Assuming a constant memory access time, the amount of time required to do this is of the form A (Q-P) log(log(Q))+B pi(sqrt(Q)) plus some less significant terms, where pi(sqrt(Q)) is the number of sieving primes and A and B are positive constants. The first term is related to the number of expected multiples of the sieving primes belonging to [P,Q], and the second is related to the overhead incurred in the computation of the first multiple, not smaller than P, of each sieving prime.

If Q is large when compared to Q-P, the overhead term B pi(sqrt(Q)) will dominate the execution time. To reduce its effect the popular approach is to increase the subinterval size, i.e., increase Q-P, so that each sieving prime will have a few multiples in the subinterval. When Q is quite large, say Q=10^18, this is currently not feasible, and the B pi(sqrt(Q)) term will dominate the execution time. Furthermore, a large interval requires a large amount of memory, which will be accessed in an essentially random way. The cache of modern processors will thus not be used efficiently, making the A constant rather large.

In 2001 I devised an apparently new (and fast) algorithm to generate primes, which is based on the observation that usually many consecutive subintervals will be sieved. The main idea is to keep, for each subinterval, a list of the sieving primes that have multiples in that subinterval. In an initialization phase, performed only once (at the start of the computation), the sieving primes are distributed among the lists. To do this, the first multiple (belonging to the interval being sieved) of each sieving prime is computed, the subinterval where this multiple lies is identified, and the prime and respective multiple are placed in the appropriate list. After the initialization, each subinterval is considered in turn. The list of primes of the current subinterval is processed, and its sieving primes are put into the lists of the next subintervals, according to the value of their next multiple. The computation stops after the last subinterval is processed.

The amount of time spent in the initialization is of the form C pi(sqrt(N)) plus less significant terms. The amount of time required to process each subinterval is of the form D (Q-P) log(log(Q)) plus less significant terms. The amount of memory used in our new algorithm is proportional to the number of sieving primes. The size of each subinterval should be small, so that all important data can be kept in the processor' data caches.

Implementation

In a practical implementation of these ideas, the number of lists is a fixed power of 2, and the lists are used in a circular fashion (using modular arithmetic for the list number). Furthermore, each list is implemented using a linked list, each node of which having a size that is a multiple of the line size of the processor's level 1 data cache and having addresses aligned on multiples of that size; each node of the linked list will thus hold information about several sieving primes. This is necessary to both reduce the storage overheads of the linked list and to use the processor' data caches more efficiently.

Since the currently freely available implementations (that I am aware of) of the segmented sieve of Eratosthenes are not very efficient when processing intervals near 10^16 or higher, I have decided to offer two simple implementations of the algorithm. For comparison purposes in each case two algorithms are offered: a "classical" segmented sieve and the new segmented sieve.

Performance

To show the effectiveness of the algorithm, in the following table I present the time, in seconds, required to count the number of primes in an interval of 10^9 integers, starting at 10^n. Three algorithms are considered:

• A1: the implementation of the "classical" segmented sieve mentioned above (version 1),
• A2: the implementation of the new segmented sieve, also mentioned above (version 1), and
• A3: an optimized implementation of the new segmented sieve, using a mod 30 wheel and assembly language, used in my Goldbach conjecture verification project.

I have not taken into consideration the initialization time, which is approximately equal for the three algorithms (less than one minute for an interval near 10^18). In this way, to estimate the execution time for an interval of, say, 10^12 integers one just has to multiply the numbers given in the table by 1000.

n   A1   A2   A3
12   13.31   14.08   2.63
14   25.79   20.62   4.05
16   97.39   26.26   5.58
18   208.13   32.36   9.61

The machine used was a 900MHz Athlon computer with 512Mbytes of CAS3 PC133 memory running GNU/Linux. For each sieving prime, A1 uses about 8 bytes, A2 uses about 8.1 bytes, and A3 uses about 6.5 bytes. For the first two algorithms, more details are given in the source code. More timing results of algorithm A3 are available elsewhere [23KiB, PDF].

Software (version 1) (last update made on May 3, 2003)

Simple single-threaded implementation [12KiB, gziped tar archive] of the segmented sieve of Eratosthenes, released under the version 2 (or any later version) of the GNU general public license. Please do not ask for my help to fit this source code (and the one described below) to your own purposes. I will almost surely not help you. Please note that this is proof-of-concept code; production code, using a mod 30 wheel, would be at least twice as fast.

New software (version 2) (last update made on September 29, 2010)

A more refined single-threaded implementation [20KiB, gziped tar archive] of the segmented sieve of Eratosthenes, released under the version 2 (or any later version) of the GNU general public license. On a 2.83GHz Intel Core2 Quad Q9550 processor, the new single-threaded implementation can generate all primes in an interval of 2^30 near 2^64 in 3 seconds using 1.6G bytes of memory. Please note that this is proof-of-concept code; production code, using a mod 30 wheel, would be at least twice as fast.