In December 2025, the DoltHub team found that their Go database was producing different query plans on ARM Macs and x86 Windows machines. The trace led to a single expression in the query planner’s cost model:

return lBest*seqIOCostFactor + selfJoinCard*(randIOCostFactor+seqIOCostFactor), nil

On ARM, the Go compiler emitted a Fused-Multiply-Add instruction for this expression. On x86, it did not. The FMA rounded once instead of twice, producing a result that differed by one unit in the last place. That was enough to flip a less-than comparison between two nearly identical plan costs and select a different join order.

The values: 3.09928472e+06 on one platform, 3.0992847200000007e+06 on the other. Both IEEE 754 compliant. Both correct. Different bits.

The DoltHub team documented the discovery and the fix on their blog. The fix is interesting because it reveals a deliberate design choice in the Go compiler, and that design choice has implications for any Go program whose correctness depends on identical floating-point output across platforms.

FMA at the Hardware Level

A Fused-Multiply-Add computes a * b + c as a single operation with a single rounding step. Without FMA, the CPU computes a * b, rounds the result to the destination precision, adds c, and rounds again. Two operations, two roundings. FMA eliminates the intermediate rounding. The result is generally more accurate, closer to the true mathematical value, because only one rounding error is introduced instead of two. But when the intermediate rounding would have rounded in a different direction than the final rounding, the FMA and non-FMA paths produce results that differ by one ULP.

IEEE 754 Section 5.4.1 defines fusedMultiplyAdd as a sanctioned operation. As Doug Priest noted in his appendix to Goldberg’s “What Every Computer Scientist Should Know About Floating-Point Arithmetic,” the standard requires correct rounding to the destination precision but does not require that intermediate precision be determined by the programmer’s source code.

The hardware landscape: ARMv8 chips universally support FMA (FMADD/FMSUB). On x86, FMA requires the FMA3 extension, available since Haswell (Intel, 2013) and Piledriver (AMD, 2012), but absent on older and some low-power chips. A compiler targeting ARM can emit FMA for any multiply-add expression. The same compiler targeting older x86 cannot. The same source code, the same compiler version, the same optimization level. Different instructions, different rounding, different bits.

Go’s Compiler Policy

Go addressed FMA in issue #17895, accepted for Go 1.9. The consensus:

A float64 conversion should be an explicit signal that a rounded float64 should be materialized.

The Go compiler is free to fuse a*b + c into a single FMA instruction whenever the target hardware supports it. The opt-out mechanism is an explicit float64() cast, which forces the compiler to materialize a rounded intermediate result:

// FMA allowed -- compiler may fuse on ARM
result := a*b + c

// FMA prevented -- explicit rounding after multiply
result := float64(a*b) + c

The DoltHub fix applied this directly:

return float64(lBest*seqIOCostFactor) +
    float64(selfJoinCard*(randIOCostFactor+seqIOCostFactor)), nil

This is a clean solution for expressions where you can identify and annotate the vulnerable multiply-add patterns. For a query planner, that may be a handful of cost-calculation expressions. For a digit-generation algorithm that involves extended sequences of floating-point arithmetic, the question becomes harder: how do you ensure that every expression in the pipeline is either not fusible or explicitly guarded?

Standard Library Formatters as Implementation Details

In January 2019, Anders Rundgren, the author of RFC 8785, filed Go issue #29491 against strconv.FormatFloat. His reference JCS implementation was producing incorrect rounding on Windows/amd64 for specific values:

IEEE 754 Hex FormatFloat Returned Correct Result
439babe4b56e8a39 498484681984085560 498484681984085570
c4dee27bdef22651 -5.8339553793802236e+23 -5.8339553793802237e+23

The root cause was a bug in strconv’s roundShortest function. The fix shipped in Go 1.13. The RFC author’s own reference implementation, broken by the standard library it delegated to.

This is not an indictment of strconv.FormatFloat. It is a high-quality implementation. But it is a general-purpose formatting function whose underlying algorithm has changed over Go’s history: Grisu3 with exact-arithmetic fallback through Go 1.16, Ryu from Go 1.17, and Dragonbox from Go 1.26. Each transition preserved round-trip correctness. None contractually guaranteed digit-sequence stability. The strconv documentation guarantees a shortest round-trip representation. It does not guarantee which valid shortest representation it will choose when two are equally correct, and it does not guarantee that the choice will remain stable across releases.

For a general-purpose formatting function, this is fine. Any shortest round-trip string is equally useful. For a canonicalization scheme, “valid but different” is a conformance failure, because the specification requires one specific digit sequence for each IEEE 754 bit pattern.

A Common Pattern in JCS Implementations

gowebpki/jcs, the maintained Go fork of Anders Rundgren’s RFC 8785 reference implementation, takes the obvious approach to number formatting: delegate to the standard library.

func NumberToJSON(ieeeF64 float64) (res string, err error) {
    ieeeU64 := math.Float64bits(ieeeF64)

    if (ieeeU64 & invalidPattern) == invalidPattern {
        return "null", errors.New("Invalid JSON number: " +
            strconv.FormatUint(ieeeU64, 16))
    }
    if ieeeF64 == 0 {
        return "0", nil
    }

    var sign string = ""
    if ieeeF64 < 0 {
        ieeeF64 = -ieeeF64
        sign = "-"
    }

    var format byte = 'e'
    if ieeeF64 < 1e+21 && ieeeF64 >= 1e-6 {
        format = 'f'
    }

    // The following should (in "theory") do the trick:
    es6Formatted := strconv.FormatFloat(ieeeF64, format, -1, 64)

    exponent := strings.IndexByte(es6Formatted, 'e')
    if exponent > 0 {
        if es6Formatted[exponent+2] == '0' {
            es6Formatted = es6Formatted[:exponent+2] +
                es6Formatted[exponent+3:]
        }
    }
    return sign + es6Formatted, nil
}

This is a reasonable design. The function handles the special cases (NaN, infinity, negative zero, sign extraction, ECMA-262 format selection, exponent normalization) and delegates the hard part, digit generation, to strconv.FormatFloat. This is the pattern most JCS implementations follow, in Go and in other languages, because digit generation is genuinely difficult and the standard library already does it.

The coupling this creates is straightforward: if strconv.FormatFloat changes its output for a given input, the canonical output changes. Whether that matters depends on requirements. For applications where JCS output is compared within a single Go version on a single architecture, it may not matter at all. For applications where canonical output must be identical across Go versions, across platforms, or across language implementations, the coupling is the mechanism by which platform differences propagate into the canonical output.

Approaches to Platform-Independent Digit Generation

Two approaches eliminate platform dependence in the digit-generation pipeline. They make different tradeoffs.

Arbitrary-Precision Integer Arithmetic

The Burger-Dybvig algorithm represents the float value and its rounding boundaries as ratios of arbitrary-precision integers (math/big.Int in Go). After the initial math.Float64bits bit-cast extracts the raw IEEE 754 pattern, every subsequent operation is exact integer arithmetic:

// Digit extraction: multiply R by 10, divide by S
state.r.Mul(state.r, bigTen)
d := new(big.Int)
d.DivMod(state.r, state.s, state.r)

big.Int.Mul, big.Int.DivMod, big.Int.Cmp: these are integer operations executed on the CPU’s integer ALU. The Go compiler cannot emit FMA instructions for them because FMA is a floating-point instruction. The guarantee holds not because the code is carefully written to avoid fusible expressions, but because the types involved make fusion inapplicable.

The single floating-point operation in the pipeline is a log10 estimate used for initial decimal scaling. This estimate is allowed to be wrong. Two integer fixup passes correct it using exact big.Int.Cmp comparisons. An FMA-affected log10 estimate that is off by one is corrected the same way as any other off-by-one estimate.

The tradeoff is performance. math/big operations allocate heap memory and are substantially slower than fixed-width integer arithmetic. For a canonicalization library where the output contract is more important than throughput, this is an acceptable cost. For a high-throughput formatter, it may not be.

Shortest Round-Trip: Implementing IEEE 754 to Decimal Conversion in Go covers the full Burger-Dybvig implementation. The determinism claim is backed by an offline replay harness that runs 60 independent executions across 12 Linux environments on both x86_64 and arm64, comparing SHA-256 digests of canonical output.

Fixed-Width Integer Algorithms

Ryu and Schubfach (the algorithm underlying Dragonbox) avoid arbitrary-precision arithmetic entirely. They use fixed-width 64-bit and 128-bit integer operations with precomputed lookup tables. Their core paths are substantially faster than Burger-Dybvig.

These core paths are mostly integer math. A Ryu implementation’s critical loop looks roughly like this:

// Ryu core: fixed-width 128-bit multiply against precomputed table
vr := mulShift64(m2, table[q], j)  // uint64 multiply + shift
vp := mulShift64(m2+1, table[q], j)
vm := mulShift64(m2-1, table[q], j)
// ... digit extraction from vr, vp, vm using integer division

These are integer multiplications and shifts, not floating-point multiply-adds. A careful Go implementation of Ryu or Schubfach would likely be immune to FMA for the same structural reason as Burger-Dybvig: the critical operations use integer types that the compiler cannot fuse.

The risk is at the edges. A Ryu port might use a floating-point log10 estimator, or a helper function that computes an exponent approximation through float64 arithmetic. These expressions could be fusible. Where Burger-Dybvig with math/big makes FMA inapplicable by type across the entire pipeline, a fixed-width implementation needs verification that:

  • No floating-point helper functions (log10 estimators, exponent approximations) use expressions the compiler could fuse.
  • No intermediate value is stored in a float64 where FMA could change the rounding.
  • The lookup table generation does not depend on platform-specific floating-point behavior.
  • These properties hold across Go compiler versions, as the compiler’s fusion heuristics evolve.

This is not a fundamental obstacle. It is ongoing maintenance work, analogous to the float64() cast discipline that the DoltHub team applied to their cost model. For a project where the performance difference between fixed-width and arbitrary-precision arithmetic matters, the audit cost is well worth paying.

The Broader Observation

IEEE 754 compliance guarantees that each result is correctly rounded to the destination precision. It does not guarantee that two IEEE 754 compliant implementations produce identical results for the same source expression, because the standard permits operations like FMA that change how many roundings occur.

For most software, this does not matter. A query planner that picks a different join order on ARM versus x86 is a correctness problem only because it affects deterministic testing. The alternative plan may be equally efficient. A numerical simulation that differs by one ULP across platforms is within tolerance for almost all applications. FMA is, in most contexts, a net benefit: faster and more accurate.

For software whose correctness property is byte-identical output across platforms (canonicalization schemes, content-addressed storage, reproducible builds), the platform independence of the digit-generation pipeline is an architectural requirement, not a detail. Delegating to strconv.FormatFloat binds the output to the standard library’s current algorithm, its FMA exposure on the current platform, and its stability guarantees, which are round-trip correctness for a shortest representation, not digit-sequence stability. The approaches to owning the pipeline differ in how they achieve independence and what they trade for it, but the requirement itself is a consequence of IEEE 754’s design: compliance governs accuracy, not identity.

The Burger-Dybvig approach discussed above is the one used in json-canon, an RFC 8785 JSON canonicalization library written in Go.

Revision History

Date Change
2026-03-04 Restored Dragonbox/Go 1.26 claim after verification against source tree at go1.26.0 tag
2026-03-04 Tightened FMA hardware section; restructured stdlib section to lead with bug report; corrected Ryu version (Go 1.17, not 1.14); removed Dragonbox/Go 1.26 claim (later found to be incorrect, see above); added Ryu code sketch to balance approaches section; added cross-architecture evidence link
2026-03-04 Shortened title; removed series references (standalone article); corrected x86 FMA/AVX2 distinction; consolidated redundant links
2026-03-04 Complete rewrite: restructured as subject-oriented technical article