From Multiprecision to Fixed-Width: Replacing Burger-Dybvig with Schubfach

Part 1 of this series implemented the Burger-Dybvig algorithm for IEEE 754 to decimal conversion. The implementation uses math/big.Int for exact multiprecision arithmetic, produces the shortest round-trip decimal string, and applies ECMA-262 even-digit tie-breaking. It is correct. It allocates on every call. This article describes what happens when you replace it with the Schubfach algorithm (Giulietti, 2022), implemented from scratch in Go with the same ECMA-262 conformance contract, validated against the same 286,362 oracle test vectors.

The result: 31.3% geometric mean throughput improvement across all API workloads (n=6, p=0.002 for every number-dense comparison), with zero conformance regressions. The performance gain concentrates where the algorithm change predicts it should (number-dense payloads), and disappears where it should (string-dominant payloads).

What Burger-Dybvig Actually Costs

The Burger-Dybvig implementation from Part 1 does four things per number. It initializes four big.Int values (R, S, M+, M-). It scales them by a power of 10 from a 700-entry precomputed cache. It extracts digits one at a time via big.Int division. It terminates when boundary conditions indicate the shortest representation. Every operation in that pipeline touches the heap. big.Int arithmetic allocates on multiply, on division, on comparison when the operand exceeds one machine word.

The CPU profile for the Burger-Dybvig path shows where math/big time goes:

70s  1.32%  math/big.mulAddVWW
33s  1.14%  math/big.nat.mulAddWW
28s  1.12%  math/big.(*Int).mul
80s  0.88%  math/big.nat.norm
71s  0.84%  math/big.nat.mul

That is 10.82s of CPU time in math/big during a benchmark run covering 19 workload categories (total sample time 204.18s). The memory profile adds detail. jcsfloat.FormatDouble accounts for 2.91% of cumulative heap allocation. That breaks down through the call chain: generateDigits (2.01% cumulative), which calls scaleByPower10 → pow10Big (1.29% flat). The pow10Big function returns a defensive new(big.Int).Set(cached) copy on every call to prevent callers from corrupting the cache. That defensive copy is correct. It also means every FormatDouble invocation allocates at least one big.Int just to look up a power of 10, before any digit extraction begins.

The per-call allocation profile from the standalone jcsfloat benchmarks in Part 1:

Workload	ns/op	B/op	allocs/op
integer (42)	803	64	5
fraction (3.14159…)	2,898	96	6
subnormal (5e-324)	1,406	224	4
max safe integer	2,938	88	5

Four to six allocations per number. For a JSON document with 2,048 numbers in an array, that is 8,000 to 12,000 allocations spent solely on converting numbers to strings.

Schubfach: The Same Problem, Different Arithmetic

Schubfach solves the same problem as Burger-Dybvig: given an IEEE 754 binary64 value, find the shortest decimal string that round-trips. The difference is how it computes the answer.

Where Burger-Dybvig represents the value and its boundaries as ratios of arbitrary-precision integers and iterates to extract digits, Schubfach works entirely in fixed-width integer arithmetic. The core operation is a 64x64-to-128-bit multiplication of the float’s significand against a precomputed power-of-10 constant, followed by a shift and comparison against interval bounds that are themselves fixed-width. No heap allocation in the digit generation path. No math/big. No defensive copies of cached values.

Go does not have a native uint128 type. The 128-bit multiplication uses math/bits.Mul64, which returns a (hi, lo) pair:

hi, lo := bits.Mul64(significand, pow10TableHi[index])

The upper 64 bits contain the information needed to determine the digit string. The lower 64 bits determine rounding. Carry propagation through the 128-bit product must be exact. A single bit error in the table or the multiplication produces wrong output for specific IEEE 754 values that may not surface during random testing. The oracle vectors catch these errors; random fuzzing alone does not reliably reach the affected bit patterns.

The precomputed table replaces Burger-Dybvig’s 700-entry *big.Int cache with a fixed-size array of [2]uint64 pairs (696 entries, covering exponents -348 to 347). Where pow10Big returns a heap-allocated defensive copy on every call, the Schubfach table is read-only after initialization and accessed by index with no allocation.

The ECMA-262 formatting stage (formatECMA and its four branch helpers from Part 1) is shared between implementations. That code takes (negative bool, digits string, n int) and produces the final string. It is algorithm-independent and unchanged.

The Tie-Breaking Problem

Part 1 described ECMA-262 even-digit tie-breaking as “the most subtle part of the algorithm.” That description applies to any shortest-round-trip algorithm, not just Burger-Dybvig.

In Burger-Dybvig, tie-breaking is explicit. The midpointDigit function compares 2R against S using exact big.Int arithmetic. When 2R == S, the digit d is checked: if even, keep it; if odd, round up. The invariants are visible in the code because the arithmetic is exact and the comparison is direct.

// Burger-Dybvig: exact midpoint detection
func midpointDigit(d int, state *digitState) byte {
    state.scratch3.Lsh(&state.r, 1)
    cmp := state.scratch3.Cmp(&state.s)
    if cmp < 0 {
        return byte('0' + d)
    }
    if cmp > 0 {
        return byte('0' + d + 1)
    }
    if d%2 == 0 {
        return byte('0' + d)
    }
    return byte('0' + d + 1)
}

In Schubfach, the midpoint condition is encoded in the interval bounds. The algorithm computes whether the value falls strictly inside, at the boundary of, or outside the interval of valid shortest representations. Modifying the boundary comparison to prefer even digits requires understanding where in the interval arithmetic the rounding decision is made, and adjusting the comparison predicates accordingly.

The adjustment is small in code. A few comparison operators change from strict to inclusive (or vice versa) depending on the parity of the significand’s least significant bit. But the reasoning behind which operators to change requires following the proof in Giulietti’s paper to understand which invariant each comparison maintains.

The even-digit preference also influences the boundary comparisons themselves. This is the same pattern described in Part 1 for Burger-Dybvig’s cmpRoundDown and cmpHigh functions, where isEven switches between inclusive and strict comparisons:

// Burger-Dybvig boundary comparison (from Part 1)
func cmpRoundDown(lhs, rhs *big.Int, isEven bool) bool {
    if isEven {
        return lhs.Cmp(rhs) <= 0   // Inclusive at boundary
    }
    return lhs.Cmp(rhs) < 0        // Strict
}

Schubfach has an analogous structure, but the comparison operates on 64-bit values derived from the 128-bit product rather than on big.Int ratios. The oracle vectors serve as the correctness proof. Any tie-breaking error produces a different digit string for at least one value in the 286,362-vector dataset. The oracle test fails with the exact bit pattern that triggered the divergence.

Benchmark Methodology

Comparing two implementations of the same specification requires more than go test -bench. Both binaries must produce byte-identical output for every input. The performance comparison must separate the number formatting cost from everything else in the pipeline: parsing, string escaping, UTF-16 key sorting, memory allocation for the value tree.

The benchmark lab runs both implementations against identical workloads at three levels:

API benchmarks via Go’s testing.B with -count=6, measuring Canonicalize() and Verify() calls directly. benchstat requires a minimum of 6 samples per comparison to compute confidence intervals at the 0.95 level.
CLI end-to-end timings with process startup, file I/O, and argument parsing included (9 runs per workload per implementation, 2 warmup runs discarded)
Conformance verification with SHA-256 digest comparison on every output, differential fuzzing (1,000 random cases), and determinism checks (identical output across repeated runs)

The 19 workload categories span the space intentionally. Some are dominated by number formatting (number-heavy, numeric-boundary). Some have zero numbers (long-string, control-escapes, surrogate-pair). Some mix both (small, medium, mixed-prod, nested-mixed). The zero-number workloads serve as a control group: any performance change on those workloads would indicate the algorithm swap affected code paths it should not have touched.

The environment: Linux 6.8, Go 1.25.6, 12th Gen Intel i7-12700K (20 logical cores), CPU governor performance. Both implementations run under the same conditions on the same machine in the same benchmark invocation.

Conformance Evidence

Before any performance discussion, the gates:

Suite	Cases	Passed
cyberphone (official)	36	36
lab workloads	116	116
RFC 8785 (official)	2	2
Total	154	154

Additional evidence:

SHA-256 digests of canonical output are byte-identical between implementations for every valid workload. Not equivalent. Identical.
1,000 differential fuzz cases with randomized seeds. Zero divergences.
Zero oracle mismatches. Zero determinism failures. Zero invalid-input parity issues (both implementations reject the same inputs with the same failure classes).

The Schubfach implementation produces the same bytes as Burger-Dybvig for every input tested. The remainder of this article is about cost, not correctness.

API Benchmarks: Number-Dense Workloads

These are workloads where FormatDouble is invoked frequently enough to dominate the profile. All results are means of 6 samples:

Workload	Schubfach (ns/op)	Burger-Dybvig (ns/op)	Ratio	p	allocs/op (S)	allocs/op (BD)
number-heavy	13,530	20,855	1.54x	0.002	60	80
numeric-boundary	12,396	18,294	1.48x	0.002	38	52
nested-mixed	2,182	3,066	1.41x	0.002	45	49
small	1,243	1,667	1.34x	0.002	22	24
medium	340,270	581,178	1.71x	0.002	6,380	7,394
mixed-prod	136,311	178,287	1.31x	0.002	2,340	2,595
array-256	303,348	474,369	1.56x	0.002	6,131	7,151
array-2048	2,549,000	4,048,000	1.59x	0.002	49,135	57,341

Every comparison reaches statistical significance at p=0.002. The speedups range from 1.31x (mixed-prod) to 1.71x (medium).

The allocation column is the structural signature. number-heavy drops from 80 to 60 allocations (25% reduction). numeric-boundary drops from 52 to 38 (27% reduction). array-2048 drops from 57,341 to 49,135 (14% reduction). Each eliminated allocation corresponds to a math/big operation that Schubfach replaces with fixed-width arithmetic.

API Benchmarks: String-Dominant Workloads (Control Group)

These workloads contain few or no numbers. The number formatter is not the bottleneck:

Workload	Schubfach (ns/op)	Burger-Dybvig (ns/op)	Ratio	p	allocs/op (S)	allocs/op (BD)
control-escapes	1,335	1,251	0.94x	0.002	21	21
long-string	74,802	73,588	0.98x	0.026	14	14
rfc-key-sorting	3,018	2,968	0.98x	0.009	47	47
surrogate-pair	647	640	0.99x	0.002	14	14
unicode	1,549	1,559	1.01x	0.108	19	19
deep	36,651	36,689	1.00x	0.784	843	843
deep-64	16,237	16,240	1.00x	0.589	387	387

Allocation counts are identical across all seven workloads. The number formatter was not invoked (or invoked for a trivial count of values). Burger-Dybvig wins on control-escapes (6%), long-string (2%), rfc-key-sorting (2%), and surrogate-pair (1%). These are small effects. deep, deep-64, and unicode show no statistically significant difference.

This is the control group. Identical allocation counts confirm that the Schubfach change is isolated to the number formatting path. The small Burger-Dybvig advantages on string workloads are consistent with binary layout, code alignment, or instruction cache effects from the different compilation units.

API Benchmarks: Large Payloads

The large workload (approximately 460KB of JSON) shows the largest absolute time savings:

Workload	Schubfach (ns/op)	Burger-Dybvig (ns/op)	Ratio	p	allocs/op (S)	allocs/op (BD)
canonicalize/large	11,928,619	20,545,684	1.72x	0.002	204,776	237,568
verify/canonical/large	11,506,131	20,018,040	1.74x	0.002	204,775	237,559
verify/noncanonical/large	11,908,081	20,607,667	1.73x	0.002	204,776	237,570

Schubfach is 72-74% faster across all three modes. The allocation reduction (204,776 vs 237,568, a 14% drop) understates the effect: the eliminated allocations are math/big operations that carry CPU cost beyond their memory footprint. At this payload size the math/big overhead accumulates to approximately 8.6ms of additional wall-clock time per canonicalization.

Memory Profile Comparison

The memory profile is where the implementation difference is most directly visible.

Burger-Dybvig, jcsfloat allocation footprint (cumulative, showing call chain):

jcsfloat.FormatDouble             2.91% cum
  └─ jcsfloat.generateDigits     2.01% cum
       └─ jcsfloat.pow10Big      1.29% flat

The cumulative values overlap: FormatDouble’s 2.91% includes generateDigits’s 2.01%, which includes pow10Big’s 1.29%. The flat allocation in pow10Big represents the defensive new(big.Int).Set() copies.

Schubfach, jcsfloat allocation footprint:

jcsfloat.FormatDouble             1.16% cum
  └─ jcsfloat.appendIntegerFixed  0.57% flat
  └─ jcsfloat.formatECMA         0.20% flat

The FormatDouble cumulative drops from 2.91% (Burger-Dybvig) to 1.16% (Schubfach). The remaining allocations are in appendIntegerFixed (byte slice growth for trailing zeros) and formatECMA (string building). Both are shared formatting concerns, not digit generation.

The math/big entries are gone entirely. No pow10Big. No nat.make. No nat.set. No generateDigits (which in Burger-Dybvig performs big.Int division in a loop). The 700-entry *big.Int power-of-10 cache is eliminated. The digitState pool with its nine big.Int fields (r, s, mPlus, mMinus, scratch1, scratch2, scratch3, quot, rem) is eliminated.

The CPU profile shows the same pattern from the time axis. The math/big functions that appear in the Burger-Dybvig profile do not appear anywhere in the Schubfach profile:

Function	Burger-Dybvig CPU	Schubfach CPU
`math/big.mulAddVWW`	2.70s (1.32%)	absent
`math/big.nat.mulAddWW`	2.33s (1.14%)	absent
`math/big.(*Int).mul`	2.28s (1.12%)	absent
`math/big.nat.norm`	1.80s (0.88%)	absent
`math/big.nat.mul`	1.71s (0.84%)	absent

They are replaced by math/bits.Mul64, which the Go compiler inlines. It does not appear as a named function in the profile because it executes as a single MULQ instruction on x86-64.

What the Profiles Reveal About the Pipeline

The top application-level functions in both CPU profiles are structurally identical:

Schubfach:
  jcs.serializeString             7.60s   3.73%
  jcstoken.(*parser).parseString  6.30s   3.09%
  jcs.validateValueTree           2.73s   1.34%
  jcs.serializeObject             2.60s   1.28%

Burger-Dybvig:
  jcs.serializeString             6.97s   3.41%
  jcstoken.(*parser).parseString  5.25s   2.57%
  jcs.validateValueTree           2.64s   1.29%
  jcs.serializeObject             2.48s   1.21%

(Both profiles are dominated by runtime functions above these: strconv.leftShift, runtime.nextFreeFast, runtime.memclrNoHeapPointers, and aeshashbody occupy the top positions in both builds. The application-level functions listed here appear below them.)

String serialization, string parsing, value tree validation, and object serialization (which includes UTF-16 key sorting via sort.Slice) dominate the application-level profile in both builds. These are the same code in both builds. Replacing the number formatter moved it off the critical path for number-dense payloads, but the pipeline bottleneck for general workloads was never number formatting. It was always parsing and serialization.

The Schubfach profile shows higher absolute times for these functions (7.60s vs 6.97s for serializeString) because the benchmark ran more iterations in the same wall-clock budget. Schubfach is faster per-call, so Go’s benchmark framework runs more iterations, and the cumulative CPU time in the shared code paths increases accordingly.

The next meaningful performance improvement for json-canon is not in number formatting. It is in the parser (which builds a full value tree with heap-allocated strings for every key and value) and the serializer (which sorts keys via sort.Slice with a closure). Those are architectural decisions that would require API changes to address, not algorithm swaps within an existing interface.

Geometric Means

benchstat reports the following geometric means across all 51 API benchmark comparisons (n=6):

Metric	Schubfach	Burger-Dybvig	Change
sec/op	12.32us	16.18us	+31.33% faster
B/s	31.20 MiB/s	23.76 MiB/s	+31.33% higher throughput
B/op	15.70 KiB	16.30 KiB	-3.81% less memory
allocs/op	157.7	174.3	-10.54% fewer allocations

The B/op reduction (3.81%) is smaller than the allocs/op reduction (10.54%) because the eliminated allocations are small (big.Int internal slices, typically 8-64 bytes each). The throughput improvement exceeds the allocation reduction because the saved allocations are not just memory operations. They are also CPU operations: big.Int multiply involves digit-by-digit multiplication with carry propagation, normalization, and bounds checking. Eliminating the allocations eliminates the arithmetic they supported.

Statistical Evidence

The API benchmarks were run with -count=6. benchstat computes valid confidence intervals for all 51 comparisons.

Of the 51 comparisons, 40 reach statistical significance at p < 0.05. The 11 that do not are workloads where the number formatter is irrelevant: deep (p=0.784), deep-64 (p=0.589), unicode (p=0.108 to 0.781 across modes), surrogate-pair/noncanonical (p=0.058), long-string/canonical (p=0.818), and rfc-key-sorting/canonical (p=0.240). These are the exact workloads where no performance difference is expected.

Every number-dense workload reaches p=0.002 (the minimum achievable p-value with n=6 using a permutation test). The Burger-Dybvig wins on string-dominant workloads also reach significance (control-escapes at p=0.002, 6% slower for Schubfach), confirming these are real effects from binary layout or instruction cache differences, not measurement noise.

The CLI end-to-end results (9 runs per comparison, permutation test with 3,000 resamples) provide independent corroboration. Five of 28 CLI comparisons reach significance at p < 0.05: canonicalize/array-2048 at p=0.0007 (1.32x, d=-3.14), canonicalize/nested-mixed at p=0.0017 (1.36x, d=-2.01), canonicalize/numeric-boundary at p=0.036 (1.22x, d=-1.08), verify/array-2048 at p=0.0007 (1.30x, d=-3.97), and verify/array-256 at p=0.038 (1.22x, d=-1.09). All five favor Schubfach. The remaining 23 comparisons do not reach significance because process startup noise dominates at the 1-7ms timescales of CLI execution.

The Decision

The conformance gate is clean. The performance improvement is statistically significant across every number-dense workload. The implementation is a drop-in replacement: same FormatDouble signature, same error types, same output bytes for every tested input.

Burger-Dybvig was the right choice for the initial implementation. It was straightforward to verify against the original paper (Burger and Dybvig, PLDI 1996). It was straightforward to debug when the oracle vectors caught errors during initial development. It was straightforward to reason about when modifying the tie-breaking policy for ECMA-262 compliance. It served its purpose: it proved the approach was viable and produced correct output while the rest of the pipeline was built around it.

Schubfach replaces Burger-Dybvig in json-canon. The change ships as a minor version. The API does not change. The output does not change. The canonical bytes for every input remain identical. What changes is the cost of producing them: fewer allocations, less memory, higher throughput on number-dense workloads, and the elimination of math/big from the formatting path’s runtime dependency graph.

Revision History

Date	Change
2026-03-05	Initial publication.

The implementation lives in the jcsfloat package of json-canon, an RFC 8785 JSON canonicalization library written in Go.