commandini’s tech blog

GCD Performance Analysis

In this article, I explain several methods for finding the greatest common divisor (GCD) of two integers and provide their implementations in Java. A performance comparison follows.

Euclidean algorithm

Recursive version:

    int euclideanGcdRecursive(int a, int b) {
        if (b == 0) {
            return a;
        }
        return euclideanGcdRecursive(b, a % b);
    }

Iterative version:

    int euclideanGcdIterative(int a, int b) {
        while (b != 0) {
            int temp = b;
            b = a % b;
            a = temp;
        }
        return a;
    }

Binary GCD algorithm (Stein’s algorithm)

The following pseudocode is based on the algorithm described on this NIST page.

1. if a is 0 then return b
2. if b is 0 then return a 
3. g = 1
4. while a is even and b is even
        a = a/2
        b = b/2
        g = 2*g
   (after this loop, at least one of a and b is odd.)
5. while a > 0
        if a is even, a = a/2
        else if b is even, b = b/2
        else
            c = |a-b|/2
            if a < b, then b = c else a = c 
6. return b*g 

Here’s a straightforward Java implementation of the algorithm described above:

    int binaryGcd(int a, int b) {
        if (a == 0) {
            return b;
        }
        if (b == 0) {
            return a;
        }
        int g = 1;
        while (a % 2 == 0 && b % 2 == 0) {
            a /= 2;
            b /= 2;
            g *= 2;
        }
        while (a > 0) {
            if (a % 2 == 0) {
                a /= 2;
            } else if (b % 2 == 0) {
                b /= 2;
            } else {
                int c = Math.abs(a - b) / 2;
                if (a < b) {
                    b = c;
                } else {
                    a = c;
                }
            }
        }
        return b * g;
    }

What can be optimized in this implementation? Division/multiplication by 2 operations can be replaced with binary shifts. Modulo operations involve division under the hood, so replacing the parts checking if a number is even (e.g. a % 2 == 0) with bitwise AND operation (e.g. (a & 1) == 0) sounds good. Applying those ideas, we get the following implementation:

    int binaryGcdOptimized(int a, int b) {
        if (a == 0) {
            return b;
        }
        if (b == 0) {
            return a;
        }
        int g = 1;
        while ((a & 1) == 0 && (b & 1) == 0) {
            a >>= 1;
            b >>= 1;
            g <<= 1;
        }
        while (a > 0) {
            if ((a & 1) == 0) {
                a >>= 1;
            } else if ((b & 1) == 0) {
                b >>= 1;
            } else {
                int c = Math.abs(a - b) >> 1;
                if (a < b) {
                    b = c;
                } else {
                    a = c;
                }
            }
        }
        return b * g;
    }

And here’s a Java implementation of binary GCD algorithm based on the C++ implementation and optimization ideas from Daniel Lemire’s blog post.

    int binaryGcdLemire(int a, int b) {
        if (a == 0) {
            return b;
        }
        if (b == 0) {
            return a;
        }
        int trailingZerosA = Integer.numberOfTrailingZeros(a);
        int trailingZerosB = Integer.numberOfTrailingZeros(b);
        int shift = Math.min(trailingZerosA, trailingZerosB);
        a >>= trailingZerosA;
        do {
            b >>= trailingZerosB;
            int diff = b - a;
            trailingZerosB = Integer.numberOfTrailingZeros(diff);
            if (diff == 0) {
                break;
            }
            if (b < a) {
                a = b;
            }
            b = Math.abs(diff);
        } while (true);
        return a << shift;
    }

Denote that, Integer.numberOfTrailingZeros(int) in Java is functionally equivalent to __builtin_ctz(unsigned int) in C++. They return the number of trailing zero bits in the binary representation of a given integer.

Guava’s implementation of binary GCD algorithm is as follows.

  int gcd(int a, int b) {
    /*
     * The reason we require both arguments to be >= 0 is because otherwise, what do you return on
     * gcd(0, Integer.MIN_VALUE)? BigInteger.gcd would return positive 2^31, but positive 2^31 isn't
     * an int.
     */
    checkNonNegative("a", a);
    checkNonNegative("b", b);
    if (a == 0) {
      // 0 % b == 0, so b divides a, but the converse doesn't hold.
      // BigInteger.gcd is consistent with this decision.
      return b;
    } else if (b == 0) {
      return a; // similar logic
    }
    /*
     * Uses the binary GCD algorithm; see http://en.wikipedia.org/wiki/Binary_GCD_algorithm. This is
     * >40% faster than the Euclidean algorithm in benchmarks.
     */
    int aTwos = Integer.numberOfTrailingZeros(a);
    a >>= aTwos; // divide out all 2s
    int bTwos = Integer.numberOfTrailingZeros(b);
    b >>= bTwos; // divide out all 2s
    while (a != b) { // both a, b are odd
      // The key to the binary GCD algorithm is as follows:
      // Both a and b are odd. Assume a > b; then gcd(a - b, b) = gcd(a, b).
      // But in gcd(a - b, b), a - b is even and b is odd, so we can divide out powers of two.

      // We bend over backwards to avoid branching, adapting a technique from
      // http://graphics.stanford.edu/~seander/bithacks.html#IntegerMinOrMax

      int delta = a - b; // can't overflow, since a and b are nonnegative

      int minDeltaOrZero = delta & (delta >> (Integer.SIZE - 1));
      // equivalent to Math.min(delta, 0)

      a = delta - minDeltaOrZero - minDeltaOrZero; // sets a to Math.abs(a - b)
      // a is now nonnegative and even

      b += minDeltaOrZero; // sets b to min(old a, b)
      a >>= Integer.numberOfTrailingZeros(a); // divide out all 2s, since 2 doesn't divide b
    }
    return a << min(aTwos, bTwos);
  }

Correctness of the GCD implementations

Before jumping into the performance analysis, let’s ensure that these six GCD calculation methods produce correct and consistent results. Unit tests will help achieve that.

import com.google.common.math.IntMath;
import org.junit.jupiter.api.DisplayName;
import org.junit.jupiter.api.Test;
import org.junit.jupiter.params.ParameterizedTest;
import org.junit.jupiter.params.provider.CsvSource;
import org.junit.jupiter.params.provider.ValueSource;

import java.util.Random;

import static org.junit.jupiter.api.Assertions.assertEquals;
import static org.junit.jupiter.api.Assertions.assertTrue;

class GcdTest {
    @ParameterizedTest
    @CsvSource({
            "48, 18, 6",
            "100, 25, 25",
            "17, 13, 1",
            "1024, 512, 512",
            "97, 89, 1",
            "252, 105, 21",
            "1071, 462, 21"
    })
    @DisplayName("Known GCD values")
    void knownValues(int a, int b, int expectedGcd) {
        assertEquals(expectedGcd, Gcd.euclideanGcdRecursive(a, b));
        assertEquals(expectedGcd, Gcd.euclideanGcdIterative(a, b));
        assertEquals(expectedGcd, Gcd.binaryGcd(a, b));
        assertEquals(expectedGcd, Gcd.binaryGcdOptimized(a, b));
        assertEquals(expectedGcd, Gcd.binaryGcdLemire(a, b));
    }

    @Test
    @DisplayName("Edge cases")
    void edgeCases() {
        assertAllMethodsReturnSameResults(0, 5);
        assertAllMethodsReturnSameResults(7, 0);
        assertAllMethodsReturnSameResults(1, 100);
        assertAllMethodsReturnSameResults(50, 1);
        assertAllMethodsReturnSameResults(42, 42);
    }

    @Test
    @DisplayName("Random values")
    void randomValues() {
        Random random = new Random(12345); // Fixed seed for reproducibility
        for (int i = 0; i < 1000; i++) {
            int a = random.nextInt(10000) + 1;
            int b = random.nextInt(10000) + 1;
            assertAllMethodsReturnSameResults(a, b);
        }
    }

    @ParameterizedTest
    @ValueSource(ints = {2, 4, 8, 16, 32, 64, 128, 256, 512, 1024})
    @DisplayName("Powers of 2")
    void powersOfTwo(int power) {
        assertAllMethodsReturnSameResults(power, power / 2);
        assertAllMethodsReturnSameResults(power, power * 3);
        assertAllMethodsReturnSameResults(power, 7);
    }

    @Test
    @DisplayName("Large numbers")
    void largeNumbers() {
        assertAllMethodsReturnSameResults(1000000, 999999);
        assertAllMethodsReturnSameResults(524288, 262144);
        assertAllMethodsReturnSameResults(999983, 999979);
    }

    @Test
    @DisplayName("Performance sanity check")
    void performanceSanityCheck() {
        long startTime = System.currentTimeMillis();
        assertAllMethodsReturnSameResults(Integer.MAX_VALUE, Integer.MAX_VALUE - 1);
        assertAllMethodsReturnSameResults((int) Math.pow(2, 30), (int) Math.pow(2, 29));
        long endTime = System.currentTimeMillis();
        long duration = endTime - startTime;
        assertTrue(duration < 1000, "GCD calculations took too long: " + duration + "ms");
    }

    private void assertAllMethodsReturnSameResults(int a, int b) {
        int euclideanGcdRecursive = Gcd.euclideanGcdRecursive(a, b);
        int euclideanGcdIterative = Gcd.euclideanGcdIterative(a, b);
        int binaryGcd = Gcd.binaryGcd(a, b);
        int binaryGcdOptimized = Gcd.binaryGcdOptimized(a, b);
        int binaryGcdLemire = Gcd.binaryGcdLemire(a, b);
        int guavaGcd = IntMath.gcd(a, b);
        String message = String.format("GCD(%d, %d) - Methods disagree: euclideanGcdRecursive=%d, euclideanGcdIterative=%d,"
                        + " binaryGcd=%d, binaryGcdOptimized=%d, binaryGcdLemire=%d, guavaGcd=%d",
                a, b, euclideanGcdRecursive, euclideanGcdIterative, binaryGcd, binaryGcdOptimized, binaryGcdLemire, guavaGcd);
        assertEquals(euclideanGcdRecursive, euclideanGcdIterative, message);
        assertEquals(euclideanGcdRecursive, binaryGcd, message);
        assertEquals(euclideanGcdRecursive, binaryGcdOptimized, message);
        assertEquals(euclideanGcdRecursive, binaryGcdLemire, message);
        assertEquals(euclideanGcdRecursive, guavaGcd, message);
    }
}

Benchmarking with JMH

Benchmarking environment

OS                   : Linux 6.9.3-76060903-generic (amd64)
CPU                  : 12th Gen Intel Core i7-12700H (14 cores / 20 threads)
Max Frequency (MHz)  : 4700.0000
JVM version          : JDK 21.0.7, OpenJDK 64-Bit Server VM, 21.0.7+6-Ubuntu-0ubuntu122.04
JVM options          : -Xms1g -Xmx1g
JMH version          : 1.37
Warmup               : 3 iterations, 2 s each
Measurement          : 5 iterations, 3 s each
Timeout              : 10 min per iteration
Threads              : 1 thread, will synchronize iterations
Benchmark mode       : Throughput, ops/time

Benchmarking code setup

I use various types of test data for the performance benchmarking of those 6 GCD functions described so far. Small/medium/large numbers, mixed numbers i.e. combination of small, medium and large numbers, co-prime numbers, and numbers which are divisible by some power of 2.

import com.google.common.math.IntMath;
import org.openjdk.jmh.annotations.Benchmark;
import org.openjdk.jmh.annotations.BenchmarkMode;
import org.openjdk.jmh.annotations.Fork;
import org.openjdk.jmh.annotations.Level;
import org.openjdk.jmh.annotations.Measurement;
import org.openjdk.jmh.annotations.Mode;
import org.openjdk.jmh.annotations.OutputTimeUnit;
import org.openjdk.jmh.annotations.Param;
import org.openjdk.jmh.annotations.Scope;
import org.openjdk.jmh.annotations.Setup;
import org.openjdk.jmh.annotations.State;
import org.openjdk.jmh.annotations.Threads;
import org.openjdk.jmh.annotations.Warmup;

import java.util.Random;
import java.util.concurrent.TimeUnit;

@BenchmarkMode({Mode.Throughput})
@OutputTimeUnit(TimeUnit.SECONDS)
@State(Scope.Benchmark)
@Warmup(iterations = 3, time = 2, timeUnit = TimeUnit.SECONDS)
@Measurement(iterations = 5, time = 3, timeUnit = TimeUnit.SECONDS)
@Fork(value = 3, jvmArgs = {"-Xms1g", "-Xmx1g"})
@Threads(1)
public class GcdBenchmark {
    @Param({"SMALL", "MEDIUM", "LARGE", "MIXED", "COPRIME", "MULTIPLES_OF_POWERS_OF_TWO"})
    private TestDataCategory testDataCategory;
    private int[] aValues;
    private int[] bValues;
    private static final int TEST_PAIRS_COUNT = 1024;
    private static final int[] PRIMES = generateFirstNPrimes(4_000);
    private static final Random RANDOM = new Random(79);

    @State(Scope.Thread)
    public static class ThreadLocalIndex {
        int index = 0;
    }

    @Setup(Level.Trial)
    public void setup() {
        aValues = new int[TEST_PAIRS_COUNT];
        bValues = new int[TEST_PAIRS_COUNT];
        System.out.printf("Setting up test data: %s (%s)%n", testDataCategory.name(), testDataCategory.getDescription());
        switch (testDataCategory) {
            case SMALL:
                generateSmallNumbers();
                break;
            case MEDIUM:
                generateMediumNumbers();
                break;
            case LARGE:
                generateLargeNumbers();
                break;
            case MIXED:
                generateMixedNumbers();
                break;
            case COPRIME:
                generateCoprimeNumbers();
                break;
            case MULTIPLES_OF_POWERS_OF_TWO:
                generateMultiplesOfPowersOfTwo();
                break;
        }
    }

    @Benchmark
    public int euclideanRecursive(ThreadLocalIndex threadLocalIndex) {
        int index = (threadLocalIndex.index++) & (TEST_PAIRS_COUNT - 1);
        return Gcd.euclideanGcdRecursive(aValues[index], bValues[index]);
    }

    @Benchmark
    public int euclideanIterative(ThreadLocalIndex threadLocalIndex) {
        int index = (threadLocalIndex.index++) & (TEST_PAIRS_COUNT - 1);
        return Gcd.euclideanGcdIterative(aValues[index], bValues[index]);
    }

    @Benchmark
    public int binaryGcd(ThreadLocalIndex threadLocalIndex) {
        int index = (threadLocalIndex.index++) & (TEST_PAIRS_COUNT - 1);
        return Gcd.binaryGcd(aValues[index], bValues[index]);
    }

    @Benchmark
    public int binaryGcdOptimized(ThreadLocalIndex threadLocalIndex) {
        int index = (threadLocalIndex.index++) & (TEST_PAIRS_COUNT - 1);
        return Gcd.binaryGcdOptimized(aValues[index], bValues[index]);
    }

    @Benchmark
    public int binaryGcdLemire(ThreadLocalIndex threadLocalIndex) {
        int index = (threadLocalIndex.index++) & (TEST_PAIRS_COUNT - 1);
        return Gcd.binaryGcdLemire(aValues[index], bValues[index]);
    }

    @Benchmark
    public int guavaGcd(ThreadLocalIndex threadLocalIndex) {
        int index = (threadLocalIndex.index++) & (TEST_PAIRS_COUNT - 1);
        return IntMath.gcd(aValues[index], bValues[index]);
    }

    private void generateSmallNumbers() {
        for (int i = 0; i < TEST_PAIRS_COUNT; i++) {
            aValues[i] = RANDOM.nextInt(100) + 1;
            bValues[i] = RANDOM.nextInt(100) + 1;
        }
    }

    private void generateMediumNumbers() {
        for (int i = 0; i < TEST_PAIRS_COUNT; i++) {
            aValues[i] = RANDOM.nextInt(10000) + 1;
            bValues[i] = RANDOM.nextInt(10000) + 1;
        }
    }

    private void generateLargeNumbers() {
        for (int i = 0; i < TEST_PAIRS_COUNT; i++) {
            aValues[i] = RANDOM.nextInt(1000000) + 1;
            bValues[i] = RANDOM.nextInt(1000000) + 1;
        }
    }

    private void generateMixedNumbers() {
        for (int i = 0; i < TEST_PAIRS_COUNT; i++) {
            switch (RANDOM.nextInt(3)) {
                case 0 -> {
                    aValues[i] = RANDOM.nextInt(100) + 1;
                    bValues[i] = RANDOM.nextInt(100) + 1;
                }
                case 1 -> {
                    aValues[i] = RANDOM.nextInt(10_000) + 1;
                    bValues[i] = RANDOM.nextInt(10_000) + 1;
                }
                case 2 -> {
                    aValues[i] = RANDOM.nextInt(1_000_000) + 1;
                    bValues[i] = RANDOM.nextInt(1_000_000) + 1;
                }
            }
        }
    }

    private void generateCoprimeNumbers() {
        for (int i = 0; i < TEST_PAIRS_COUNT; i++) {
            int indexOfPrime = RANDOM.nextInt(PRIMES.length - 3);
            aValues[i] = PRIMES[indexOfPrime] * PRIMES[indexOfPrime + 1];
            bValues[i] = PRIMES[indexOfPrime + 2] * PRIMES[indexOfPrime + 3];
        }
    }

    private void generateMultiplesOfPowersOfTwo() {
        for (int i = 0; i < TEST_PAIRS_COUNT; i++) {
            int powerA = RANDOM.nextInt(20) + 1;
            int powerB = RANDOM.nextInt(20) + 1;
            int multiplierA = (RANDOM.nextInt(100) + 1);
            int multiplierB = (RANDOM.nextInt(100) + 1);
            aValues[i] = (1 << powerA) * multiplierA;
            bValues[i] = (1 << powerB) * multiplierB;
        }
    }

    private static int[] generateFirstNPrimes(int n) {
        int[] result = new int[n];
        int number = 2;
        int index = 0;
        while (index < n) {
            if (IntMath.isPrime(number)) {
                result[index++] = number;
            }
            number++;
        }
        return result;
    }

    public enum TestDataCategory {
        SMALL("Numbers 1-100"),
        MEDIUM("Numbers 1-10,000"),
        LARGE("Numbers 1-1,000,000"),
        MIXED("Mixed small/medium/large"),
        COPRIME("Coprime pairs (GCD=1)"),
        MULTIPLES_OF_POWERS_OF_TWO("Multiples of powers of 2");

        private final String description;

        TestDataCategory(String description) {
            this.description = description;
        }

        public String getDescription() {
            return description;
        }
    }
}

Benchmarking results

Runtime: 38 minutes 9 seconds

Benchmark                   (testDataCategory)   Mode  Cnt          Score          Error  Units
binaryGcd                                SMALL  thrpt   15   80093873.958 ± 21598842.550  ops/s
binaryGcd                               MEDIUM  thrpt   15   17819613.416 ±  4413404.244  ops/s
binaryGcd                                LARGE  thrpt   15    9567475.056 ±   725753.987  ops/s
binaryGcd                                MIXED  thrpt   15   20162041.959 ±  4193874.599  ops/s
binaryGcd                              COPRIME  thrpt   15    8079315.863 ±   263211.444  ops/s
binaryGcd           MULTIPLES_OF_POWERS_OF_TWO  thrpt   15   35847878.254 ±  7404436.967  ops/s
binaryGcdLemire                          SMALL  thrpt   15  250932146.168 ±  2115387.421  ops/s
binaryGcdLemire                         MEDIUM  thrpt   15  108345918.881 ± 10938982.071  ops/s
binaryGcdLemire                          LARGE  thrpt   15   28970832.010 ±  1344796.463  ops/s
binaryGcdLemire                          MIXED  thrpt   15  113334606.186 ± 14236010.646  ops/s
binaryGcdLemire                        COPRIME  thrpt   15   23301390.698 ±   239699.338  ops/s
binaryGcdLemire     MULTIPLES_OF_POWERS_OF_TWO  thrpt   15  249430736.491 ±  5927818.357  ops/s
binaryGcdOptimized                       SMALL  thrpt   15  124634675.892 ±  5527667.667  ops/s
binaryGcdOptimized                      MEDIUM  thrpt   15   25069070.741 ±   858523.595  ops/s
binaryGcdOptimized                       LARGE  thrpt   15   12314533.497 ±   319010.967  ops/s
binaryGcdOptimized                       MIXED  thrpt   15   24328557.052 ±  2182081.430  ops/s
binaryGcdOptimized                     COPRIME  thrpt   15    9891866.846 ±   316097.230  ops/s
binaryGcdOptimized  MULTIPLES_OF_POWERS_OF_TWO  thrpt   15   31905618.950 ±   909187.425  ops/s
euclideanIterative                       SMALL  thrpt   15  184124406.817 ±   619064.913  ops/s
euclideanIterative                      MEDIUM  thrpt   15   67391615.575 ±   894162.038  ops/s
euclideanIterative                       LARGE  thrpt   15   33709946.202 ±   211044.561  ops/s
euclideanIterative                       MIXED  thrpt   15   74621091.918 ±  1309003.818  ops/s
euclideanIterative                     COPRIME  thrpt   15   27762826.762 ±    75320.470  ops/s
euclideanIterative  MULTIPLES_OF_POWERS_OF_TWO  thrpt   15  177085933.434 ±   579951.675  ops/s
euclideanRecursive                       SMALL  thrpt   15  171817766.717 ±  3601958.600  ops/s
euclideanRecursive                      MEDIUM  thrpt   15   85117396.819 ±  2294969.999  ops/s
euclideanRecursive                       LARGE  thrpt   15   43682280.831 ±  1568800.268  ops/s
euclideanRecursive                       MIXED  thrpt   15   81862065.203 ±   711692.397  ops/s
euclideanRecursive                     COPRIME  thrpt   15   36241379.984 ±   570170.584  ops/s
euclideanRecursive  MULTIPLES_OF_POWERS_OF_TWO  thrpt   15  168124256.670 ±  6360404.297  ops/s
guavaGcd                                 SMALL  thrpt   15  267479968.138 ±   435308.620  ops/s
guavaGcd                                MEDIUM  thrpt   15  117407617.637 ±   487596.182  ops/s
guavaGcd                                 LARGE  thrpt   15   65493190.379 ±  2083317.443  ops/s
guavaGcd                                 MIXED  thrpt   15  107093681.790 ±   183118.261  ops/s
guavaGcd                               COPRIME  thrpt   15   33155612.258 ±  3383485.098  ops/s
guavaGcd            MULTIPLES_OF_POWERS_OF_TWO  thrpt   15  277656366.472 ±  6692938.959  ops/s

The visualization below is generated via JMH Visualizer. GCD Benchmark Visualization

Conclusion

Highly optimized binary GCD algorithms, i.e. Guava’s GCD and Lemire’s GCD achieve the highest throughput compared to the other GCD implementations. In general, Guava’s GCD is faster because it uses branchless techniques that improve CPU pipeline efficiency and branch prediction. It avoids some method calls and uses bitwise arithmetic that JIT compiles into fewer, faster instructions.
Next, Euclidian GCD versions come. Notice that iterative and recursive versions of Euclidian GCD achieved close results. However, iterative version performs better. This can be explained with limited tail-call optimization (TCO) in JVM. The JVM doesn’t reliably perform tail-call optimization, which means recursive calls aren’t transformed into jumps internally. So recursive calls still accumulate stack frames and overhead.
Plain binary GCD and its slightly optimized version achieved the lowest throughput. The optimized binary GCD performed better than the plain version due to reduced division and multiplication operations; nevertheless, both are nowhere near the performance of the other four GCD implementations.
One final caveat: All of the GCD implementations above — except for Guava’s — do not check whether the given integer values are non-negative. I’ve omitted that check for simplicity, but ideally, the following lines can be added at the beginning of those methods:
```
a = Math.abs(a);
b = Math.abs(b);
```

commandini's tech blog

{mathematics, algorithms and programming puzzles}