Main Topics in Computational Number Theory Inspired by Peter L. Montgomery

Topics in Computational Number Theory Inspired by Peter L. Montgomery

,
Peter L. Montgomery has made significant contributions to computational number theory, introducing many basic tools such as Montgomery multiplication, Montgomery simultaneous inversion, Montgomery curves, and the Montgomery ladder. This book features state-of-the-art research in computational number theory related to Montgomery's work and its impact on computational efficiency and cryptography. Topics cover a wide range of topics such as Montgomery multiplication for both hardware and software implementations; Montgomery curves and twisted Edwards curves as proposed in the latest standards for elliptic curve cryptography; and cryptographic pairings. This book provides a comprehensive overview of integer factorization techniques, including dedicated chapters on polynomial selection, the block Lanczos method, and the FFT extension for algebraic-group factorization algorithms. Graduate students and researchers in applied number theory and cryptography will benefit from this survey of Montgomery's work.
Year: 2017
Publisher: Cambridge University Press
Language: english
Pages: 281
ISBN 13: 9781316271575
File: PDF, 4.19 MB
Download (pdf, 4.19 MB)
Preview
 
You can write a book review and share your experiences. Other readers will always be interested in your opinion of the books you've read. Whether you've loved the book or not, if you give your honest and detailed thoughts then people will find new books that are right for them.
Topics in Computational Number Theory
Inspired by Peter L. Montgomery
Peter L. Montgomery has made significant contributions to computational number
theory, introducing many basic tools such as Montgomery multiplication, Montgomery
simultaneous inversion, Montgomery curves, and the Montgomery ladder. This book
features state-of-the-art research in computational number theory related to
Montgomery’s work and its impact on computational efficiency and cryptography. It
covers a wide range of topics such as Montgomery multiplication for both hardware
and software implementations; Montgomery curves and twisted Edwards curves as
proposed in the latest standards for elliptic curve cryptography; and cryptographic
pairings. This book provides a comprehensive overview of integer factorization
techniques, including dedicated chapters on polynomial selection, the block Lanczos
method, and the FFT extension for algebraic-group factorization algorithms. Graduate
students and researchers in applied number theory and cryptography will benefit from
this survey of Montgomery’s work.
Joppe W. Bos is a cryptographic researcher at the Innovation Center for Cryptography
& Security at NXP Semiconductors. He also currently serves as the Secretary of the
International Association for Cryptologic Research (IACR). His research focuses on
computational number theory and high-performance arithmetic as used in public-key
cryptography.
Arjen K. Lenstra is Professor of Computer Science at École Polytechnique Fédérale de
Lausanne. His research focuses on cryptography and computational number theory,
especially in areas such as integer factorization. He was closely involved in the
development of the number field sieve method for integer factorization as well as
several other cryptologic results. He is the recipient of the Excellence in the Field of
Mathematics RSA Conference 2008 Award and a Fellow of the International
Association for Cryptologic Research (IACR).

Downloaded from https://www.cambridge.org/core. University of Texas Libraries, on 08 Jan 2020 at 20:33:58, subject to the Cambridge Core
terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575

Downloaded from https://www.cambridge.org/core. University of Texas Libraries, on 08 Jan 2020 at 20:33:58, subject to the Cambridge Core
terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575

Topics in Computational Number Theory
Inspired by Peter L. Montgomery
Edited by

J O P P E W. B O S
NXP Semiconductors, Leuven, Belgium
ARJEN K. LENSTRA
EPFL, Lausanne, Switzerland

Downloaded from https://www.cambridge.org/core. University of Texas Libraries, on 08 Jan 2020 at 20:33:58, subject to the Cambridge Core
terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575

University Printing House, Cambridge CB2 8BS, United Kingdom
One Liberty Plaza, 20th Floor, New York, NY 10006, USA
477 Williamstown Road, Port Melbourne, VIC 3207, Australia
4843/24, 2nd Floor, Ansari Road, Daryaganj, Delhi - 110002, India
79 Anson Road, #06-04/06, Singapore 079906
Cambridge University Press is part of the University of Cambridge.
It furthers the University’s mission by disseminating knowledge in the pursuit of
education, learning, and research at the highest international levels of excellence.
www.cambridge.org
Information on this title: www.cambridge.org/9781107109353
DOI: 10.1017/9781316271575
© Joppe W. Bos and Arjen K. Lenstra 2017
This publication is in copyright. Subject to statutory exception
and to the provisions of relevant collective licensing agreements,
no reproduction of any part may take place without the written
permission of Cambridge University Press.
First published 2017
Printed in the United States of America by Sheridan Books, Inc.
A catalogue record for this publication is available from the British Library
Library of Congress Cataloging-in-Publication data
Names: Bos, Joppe W., editor. | Lenstra, A. K. (Arjen K.), 1956– editor.
Title: Topics in computational number theory inspired by Peter L. Montgomery / edited by
Joppe W. Bos, NXP Semiconductors, Belgium; Arjen K. Lenstra, EPFL, Lausanne,
Switzerland.
Description: Cambridge : Cambridge University Press, 2017. | Series: London
Mathematical Society lecture note series | Includes bibliographical references and index.
Identifiers: LCCN 2017023049 | ISBN 9781107109353 (pbk. : alk. paper)
Subjects: LCSH: Number theory. | Cryptography – Mathematics. | Montgomery, Peter L.,
1947–
Classification: LCC QA241 .T657 2017 | DDC 512.7 – dc23 LC record available at
https://lccn.loc.gov/2017023049
ISBN 978-1-107-10935-3 Paperback
Cambridge University Press has no responsibility for the persistence or accuracy of
URLs for external or third-party Internet Web sites referred to in this publication
and does not guarantee that any content on such Web sites is, or will remain,
accurate or appropriate.
Downloaded from https://www.cambridge.org/core. University of Texas Libraries, on 08 Jan 2020 at 20:33:58, subject to the Cambridge Core
terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575

Contents

List of Contributors
Preface

page xi
xiii
1
1
1
5
8

1
1.1
1.2
1.3
1.4

Introduction
Outline
Biographical Sketch
Overview
Simultaneous Inversion

2
2.1
2.2

Montgomery Arithmetic from a Software Perspective
Introduction
Montgomery Multiplication
2.2.1 Interleaved Montgomery Multiplication
2.2.2 Using Montgomery Arithmetic in Practice
2.2.3 Computing the Montgomery Constants μ and R2
2.2.4 On the Final Conditional Subtraction
2.2.5 Montgomery Multiplication in F2k
Using Primes of a Special Form
2.3.1 Faster Modular Reduction with Primes of a
Special Form
2.3.2 Faster Montgomery Reduction with Primes of a
Special Form
Concurrent Computing of Montgomery Multiplication
2.4.1 Related Work on Concurrent Computing of
Montgomery Multiplication
2.4.2 Montgomery Multiplication Using SIMD Extensions
2.4.3 A Column-Wise SIMD Approach

2.3

2.4

10
10
12
15
16
18
19
21
22
23
24
26
27
27
31

v
Downloaded from https://www.cambridge.org/core. City, University of London, on 09 Jan 2020 at 19:53:13, subject to the Cambridge Core
terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575

vi

Contents

2.4.4
3

Montgomery Multiplication Using the Residue
Number System Representation

36

3.15
3.16
3.17

Hardware Aspects of Montgomery Modular
Multiplication
Introduction and Summary
Historical Remarks
Montgomery’s Novel Modular Multiplication Algorithm
Standard Acceleration Techniques
Shifting the Modulus N
3.5.1 The Classical Algorithm
3.5.2 Montgomery
Interleaving Multiplication Steps with Modular Reduction
Accepting Inaccuracy in Quotient Digits
3.7.1 Traditional
3.7.2 Bounding the Partial Product
3.7.3 Montgomery
3.7.4 Summary
Using Redundant Representations
3.8.1 Traditional
3.8.2 Montgomery
Changing the Size of the Hardware Multiplier
Shifting an Operand
3.10.1 Traditional
3.10.2 Montgomery
Precomputing Multiples of B and N
Propagating Carries and Carry-Save Inputs
Scaling the Modulus
Systolic Arrays
3.14.1 A Systolic Array for A×B
3.14.2 Scalability
3.14.3 A Linear Systolic Array
3.14.4 A Systolic Array for Modular Multiplication
Side-Channel Concerns and Solutions
Logic Gate Technology
Conclusion

40
40
42
42
43
44
44
45
46
48
48
50
52
52
53
54
54
55
57
57
60
61
62
65
67
68
70
72
73
76
80
81

4
4.1
4.2

Montgomery Curves and the Montgomery Ladder
Introduction
Fast Scalar Multiplication on the Clock

82
82
83

3.1
3.2
3.3
3.4
3.5

3.6
3.7

3.8

3.9
3.10

3.11
3.12
3.13
3.14

Downloaded from https://www.cambridge.org/core. City, University of London, on 09 Jan 2020 at 19:53:13, subject to the Cambridge Core
terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575

Contents

4.3

4.4

4.5

4.6

4.7

4.8

5
5.1
5.2

vii

4.2.1 The Lucas Ladder
4.2.2 Differential Addition Chains
Montgomery Curves
4.3.1 Montgomery Curves as Weierstrass Curves
4.3.2 The Group Law for Weierstrass Curves
4.3.3 Other Views of the Group Law
4.3.4 Edwards Curves and Their Group Law
4.3.5 Montgomery Curves as Edwards Curves
4.3.6 Elliptic-Curve Cryptography (ECC)
4.3.7 Examples of Noteworthy Montgomery Curves
Doubling Formulas without y
4.4.1 Doubling: The Weierstrass View
4.4.2 Optimized Doublings
4.4.3 A Word of Warning: Projective Coordinates
4.4.4 Completeness of Generic Doubling Formulas
4.4.5 Doubling: The Edwards View
Differential-Addition Formulas
4.5.1 Differential Addition: The Weierstrass View
4.5.2 Optimized Differential Addition
4.5.3 Quasi-Completeness
4.5.4 Differential Addition: The Edwards View
The Montgomery Ladder
4.6.1 The Montgomery Ladder Step
4.6.2 Constant-Time Ladders
4.6.3 Completeness of the Ladder
A Two-Dimensional Ladder
4.7.1 Introduction to the Two-Dimensional Ladder
4.7.2 Recursive Definition of the Two-Dimensional Ladder
4.7.3 The Odd-Odd Pair in Each Line: First Addition
4.7.4 The Even-Even Pair in Each Line: Doubling
4.7.5 The Other Pair in Each Line: Second Addition
Larger Differences
4.8.1 Examples of Large-Difference Chains
4.8.2 CFRC, PRAC, etc.
4.8.3 Allowing d to Vary

85
85
87
87
88
89
90
92
93
94
95
95
96
97
97
98
99
99
101
101
103
104
104
105
106
107
108
109
110
110
111
111
112
114
114

General Purpose Integer Factoring
Introduction
General Purpose Factoring

116
116
117

Downloaded from https://www.cambridge.org/core. City, University of London, on 09 Jan 2020 at 19:53:13, subject to the Cambridge Core
terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575

viii

Contents

5.6

5.2.1 Two-Step Approach
5.2.2 Smoothness and L-notation
5.2.3 Generic Analysis
5.2.4 Smoothness Testing
5.2.5 Finding Dependencies
5.2.6 Filtering
5.2.7 Overall Effort
Presieving General Purpose Factoring
5.3.1 Dixon’s Random Squares Method
5.3.2 Continued Fraction Method
Linear and Quadratic Sieve
5.4.1 Linear Sieve
5.4.2 Quadratic Sieving: Plain
5.4.3 Quadratic Sieving: Fancy
5.4.4 Multiple Polynomial Quadratic Sieve
Number Field Sieve
5.5.1 Earlier Methods to Compute Discrete Logarithms
5.5.2 Special Number Field Sieve
5.5.3 General Number Field Sieve
5.5.4 Coppersmith’s Modifications
Provable Methods

117
119
120
121
123
123
126
126
126
127
129
129
132
133
134
137
139
145
152
158
160

6
6.1
6.2
6.3
6.4
6.5
6.6
6.7
6.8

Polynomial Selection for the Number Field Sieve
The Problem
Early Methods
General Remarks
A Lattice Based Method
Skewness
Base m Method and Skewness
Root Sieve
Later Developments

161
161
161
164
166
168
170
171
173

7
7.1
7.2
7.3
7.4
7.5
7.6
7.7
7.8
7.9

The Block Lanczos Algorithm
Linear Systems for Integer Factoring
The Standard Lanczos Algorithm
The Case of Characteristic Two
Orthogonalizing a Sequence of Subspaces
Construction of the Next Iterate
Simplifying the Recurrence Equation
Termination
Implementation in Parallel
Recent Developments

175
175
176
179
180
181
182
184
186
187

5.3

5.4

5.5

Downloaded from https://www.cambridge.org/core. City, University of London, on 09 Jan 2020 at 19:53:13, subject to the Cambridge Core
terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575

Contents

8
8.1
8.2

8.3

9
9.1

9.2

9.3

9.4

9.5

FFT Extension for Algebraic-Group Factorization
Algorithms
Introduction
FFT Extension for the Elliptic Curve Method
8.2.1 The Product Tree Algorithm
8.2.2 The POLYEVAL Algorithm
8.2.3 The POLYGCD Algorithm
8.2.4 Choice of Points of Evaluation
8.2.5 A Numerical Example
FFT Extension for the p − 1 and p + 1 Methods
8.3.1 Constructing F (X ) by Scaling and Multiplying
8.3.2 Evaluation of a Polynomial Along a Geometric
Progression

ix

189
189
191
192
192
195
197
199
199
201
202

Cryptographic Pairings
Preliminaries
9.1.1 Elliptic Curves
9.1.2 Pairings
9.1.3 Pairing-Friendly Elliptic Curves
Finite Field Arithmetic for Pairings
9.2.1 Montgomery Multiplication
9.2.2 Multiplication in Extension Fields
9.2.3 Finite Field Inversions
9.2.4 Simultaneous Inversions
Affine Coordinates for Pairing Computation
9.3.1 Costs for Doubling and Addition Steps
9.3.2 Working over Extension Fields
9.3.3 Simultaneous Inversions in Pairing Computation
The Double-Add Operation and Parabolas
9.4.1 Description of the Algorithm
9.4.2 Application to Scalar Multiplication
9.4.3 Application to Pairings
Squared Pairings
9.5.1 The Squared Weil Pairing
9.5.2 The Squared Tate Pairing

206
207
208
209
213
213
214
215
215
218
219
219
223
225
227
228
229
229
231
232
233

Bibliography
Subject Index

235
261

Downloaded from https://www.cambridge.org/core. City, University of London, on 09 Jan 2020 at 19:53:13, subject to the Cambridge Core
terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575

Downloaded from https://www.cambridge.org/core. City, University of London, on 09 Jan 2020 at 19:53:13, subject to the Cambridge Core
terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575

Contributors

Joppe W. Bos, NXP Semiconductors, Leuven, Belgium
Arjen K. Lenstra, EPFL, Lausanne, Switzerland
Herman te Riele, CWI, Amsterdam, Netherlands
Daniel Shumow, Microsoft Research, Redmond, USA
Peter L. Montgomery, Self
Colin D. Walter, Royal Holloway, University of London, Egham, United
Kingdom
Daniel J. Bernstein, University of Illinois at Chicago, Chicago, USA and
Technische Universiteit Eindhoven, Eindhoven, The Netherlands
Tanja Lange, Technische Universiteit Eindhoven, Eindhoven, The
Netherlands
Thorsten Kleinjung, University Leipzig, Leipzig, Germany and EPFL,
Lausanne, Switzerland
Emmanuel Thomé, Inria, Nancy, France
Richard P. Brent, Australian National University, Canberra, Australia
Alexander Kruppa, Technische Universität München, München, Germany
Paul Zimmermann, Inria/LORIA, Nancy, France
Kristin Lauter, Microsoft Research, Redmond, USA
Michael Naehrig, Microsoft Research, Redmond, USA

xi
Downloaded from https://www.cambridge.org/core. City, University of London, on 08 Jan 2020 at 20:34:16, subject to the Cambridge Core
terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575

Downloaded from https://www.cambridge.org/core. City, University of London, on 08 Jan 2020 at 20:34:16, subject to the Cambridge Core
terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575

Preface

This book was written in honor of Peter L. Montgomery and his inspirational
contributions to computational number theory. The editors would like to extend
their sincerest thanks to all authors for their enthusiastic response to our invitation to contribute, and to Nicole Verna for the cover design.

xiii
Downloaded from https://www.cambridge.org/core. Access paid by the UCSB Libraries, on 13 Sep 2018 at 19:36:36, subject to the Cambridge
Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575.001

Downloaded from https://www.cambridge.org/core. Access paid by the UCSB Libraries, on 13 Sep 2018 at 19:36:36, subject to the Cambridge
Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575.001

1
Introduction
Joppe W. Bos, Arjen K. Lenstra, Herman te Riele,
and Daniel Shumow

1.1 Outline
This introductory chapter collects some personal background on Peter L. Montgomery, both provided by himself and based on stories from his colleagues and
collaborators. This is followed by a brief introduction to the other chapters, and
by a description of Peter’s simultaneous inversion method.

1.2 Biographical Sketch
Peter was born in San Francisco, California, on September 25, 1947. He
described his activities since his 1965 graduation from San Rafael High School
(SRHS) in a letter for a 1995 high-school reunion; this letter is copied in its
entirety in Figure 1.1, and further commented on throughout this chapter.
At Berkeley, Peter’s undergraduate advisor was Derrick H. Lehmer, an excellent match given Peter’s research interests since high school. Ron Graham, one
of Peter’s coauthors on his first three papers (all on Ramsey theory and written shortly after his time in Berkeley), recalls [164] that Peter had the “breakthrough idea” for the first of these papers, with coauthor Bruce Rothschild
remembering Peter “as an extremely clever guy who came up with surprising
ideas” [295]. The papers in question ([144], [145] and [146]) were further coauthored by Paul Erdös, Joel Spencer, and Ernst Straus. After this auspicious start,
Peter’s next paper describes an algorithm that he developed at SDC to evaluate
arbitrary Fortran boolean expressions in such a way that they result in a small
number of instructions when compiled on a Control Data 7600 computer [250].
The paper is illustrative of the type of tinkering in which Peter excels.
About twenty years after his time in Berkeley, and by the time that Peter
was already “internationally famous” (cf. Figure 1.1), David G. Cantor supervised Peter’s PhD dissertation An FFT extension of the elliptic curve method
1
Downloaded from https://www.cambridge.org/core. Teachers College Library - Columbia University, on 02 Nov 2017 at 06:38:23, subject to
the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575.002

2

Introduction

Figure 1.1 Peter’s letter for his 1995 high-school reunion.

of factorization [254]. Milos Ercegovac, who served on the PhD committee,
related [143] that Cantor referred to Peter as “a genius UCLA never saw
before,” how Peter was brilliant in solving engineering problems related to
computer arithmetic, and that, as a result of his incisive reviews, authors on
several occasions asked if they could include this “marvelous” reviewer as a
coauthor because Peter’s comments invariably provided better solutions than

Downloaded from https://www.cambridge.org/core. Teachers College Library - Columbia University, on 02 Nov 2017 at 06:38:23, subject to
the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575.002

1.2 Biographical Sketch

(a) At age 15

3

(b) In 2009

Figure 1.2 Peter L. Montgomery (from Peter’s collection and Wikipedia).

those given in the submissions. Another member of the committee, Murray
Schacher, recalled [299] how Peter “regularly swamped the system” in what
may have been his first venture into large scale computing: verifying reducibility of X 8k + X m + 1 ∈ F2 [X] for large k and 0 < m < 8k. For this project
Peter taught himself the Cantor–Zassenhaus method because the non-sparse
8k × 8k matrix required by the usual Berlekamp algorithm surpassed the system’s capacity [34, 87].
Peter did not accept the offer for a “research position in a New Jersey community lacking sidewalks” (cf. Figure 1.1). The offer was made after Peter’s
successful visits at the Bellcore research facility in Morristown, NJ, during
the summers of 1994 and 1995. During his stays Peter worked with the second author’s summer internship students Scott Contini and Phong Nguyen on
implementing Peter’s new methods that were useful for the number field sieve
integer factorization algorithm (cf. Section 1.3 below): block Lanczos with
Scott in 1994 and the next year the final square root method with Phong (who
for his project had to learn about integer lattices and lattice basis reduction,
and did so only very reluctantly – after a scolding by Ramarathnam “Venkie”
Venkatesan). A memorable event was a lab outing where Peter beat Phong at
bowling and afterwards bought ice cream for the whole party.

Downloaded from https://www.cambridge.org/core. Teachers College Library - Columbia University, on 02 Nov 2017 at 06:38:23, subject to
the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575.002

4

Introduction

In 1998 Peter accepted a position with the Security and Cryptography group
at Microsoft Research (MSR) in Redmond, WA. He retired in 2014. His work
at MSR was largely focused on pure research (cf. [227], [99], [259], [137],
[138], and [73]). However, Peter began implementing libraries for asymmetric
encryption for use in products as well. Peter’s first contribution was to write
Diffie–Hellman key exchange and the Digital Signature Algorithm in the static
bignum library. This was included in the default Microsoft Windows CAPI 1.0
library dssenh.dll, that shipped in Windows 2000. A later version of this library,
msbignum, also included RSA as well as elliptic curve cryptography. In 2006,
with Windows Vista, msbignum became the basis of the API for the Cryptographic Next Generation and underlay in the following decade all cryptography
running on Windows at Microsoft. Peter’s role may have been unique at MSR
at that time: over the course of the organization’s existence there have been very
few teams at MSR, let alone individual researchers, delivering entire functional
software libraries.
During his time at MSR, Peter continued collaborations with researchers
world-wide and was a regular long-term guest at various universities and
research institutes. The first three authors of this chapter have fond memories
of Peter’s visits (and are not the only ones who have lugged around Peter’s
considerable suitcase when regular cabs were not up to the task [164]) – as
Milos Ercegovac put it: a modest man with a sense of humor, a gentle chuckle
when people make silly technical comments [143]. Numerous papers were the
result of his visits: several record factorizations (140 digits [91], 512 bits [92],
768 bits [209, 211], and Mersenne numbers [71]); security estimates [68];
elliptic curve discrete logarithm cryptanalysis using Playstation 3 game consoles [70, 69]; improved stage two for Pollard’s p ± 1 method [261]; better
elliptic curve factoring parameter selection [26]; and results on the period of
Bell numbers [262].
While discussing his work at SDC in the early software industry Peter liked
to proudly show a copy of The System Builders: The Story of SDC [32] with
his name displayed on the front cover, about an inch underneath the title (cf.
Figure 1.3). Peter never showed a similar pride in his more celebrated algorithmic and mathematical results – possibly because his work as a software
engineer greatly motivated his mathematical research and he invented new
approaches just to make his implementations more efficient. This is illustrated
by Peter’s explanation how he got the inspiration for his paper “Modular multiplication without trial division” [251], arguably one of his most widely used
results: when implementing textbook multiprecision multiplication in assembler on a PDP series computer, he noticed there were unused registers and

Downloaded from https://www.cambridge.org/core. Teachers College Library - Columbia University, on 02 Nov 2017 at 06:38:23, subject to
the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575.002

1.3 Overview

5

Figure 1.3 Front cover of [32] (© System Development Corporation, 1981).

wanted to find a way to exploit them. He managed to do so by interleaving
the multiplication with the modular reduction.
When Peter began working, at SDC, he programmed using punch cards on
time-shared main frame computers. By the time he had retired from MSR, he
had implemented and optimized arithmetic code that was running on the mobile
processors in smartphones. Peter worked as a programmer from the earliest
stages of software engineering when computers needed to be held in their own
buildings until the era of ubiquitous computing where people carry computing
devices in their pockets. Both his algorithmic and mathematical contributions
and his work as a programmer were instrumental to these massive advances
that spanned his career – a career of more than forty years, impressive by the
standards of the software industry.

1.3 Overview
The subsequent chapters of this book are on active research areas that were initiated by Peter or to which he has made substantial contributions. This section
contains an overview of the subjects that are treated.
As mentioned above, Peter’s 1985 method for modular multiplication without trial division [251] was invented almost serendipitously when Peter tried
to put otherwise idle registers to good use. Because it allows simple implementation both in software and in hardware, Montgomery multiplication (as
the method is now commonly referred to) has found broad application. Its software and hardware aspects are covered in Chapters 2 and 3, respectively.
Peter’s contributions as described and expanded upon in Chapters 4 through 8
were all inspired by integer factorization, his main research interest since high
school. In [252, section 10.3.1] Peter describes how his original implementation

Downloaded from https://www.cambridge.org/core. Teachers College Library - Columbia University, on 02 Nov 2017 at 06:38:23, subject to
the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575.002

6

Introduction

of the elliptic curve method of integer factorization [239] uses the Weierstrass
equation with affine coordinates, and how the modular inversion required for
the point addition can be made faster if several curves are run at once (cf.
Section 1.4 on page 8). He then writes “The author later discovered an alternative parametrization that requires no inversions” and continues with the description of what is now known as Montgomery curves and the Montgomery ladder.
The former is an elliptic curve parametrization that allows fast computation of
the sum of two points if their difference is known and that does not require
modular inversion. The latter refers to the addition chain that can then be used
to compute any scalar multiple of a point. Initially these results were relevant
solely to those implementing the elliptic curve integer factorization method,
but the later growing importance of elliptic curve cryptography increased the
interest in Peter’s and yet other curve parametrizations and addition chains.
Chapter 4 describes the developments in this field that have taken place since
the publication of [252].
In parallel with his work on special-purpose integer factoring methods in
[252], Peter also heavily influenced the two most important methods of the
more generic integer factorization algorithms that are described in Chapter 5:
the quadratic sieve and the number field sieve, two methods that were initially plagued by severe performance problems [282, 234]. Although Peter was
not the first to propose a practical version of the quadratic sieve, his multiple
polynomial quadratic sieve (as published by Robert Silverman in [315] and
described in Chapter 5), represented the state of the art in integer factorization from the mid 1980s until the early 1990s. It was later enhanced by selfinitialization [287], but by that time the number field sieve had taken over as the
record-breaking factoring method, a position it holds to the present day. Peter
played a prominent role in that development, contributing important ideas to all
steps of the number field sieve: polynomial selection, (early) sieving, building
a matrix, processing the matrix, and the final square root calculation. A brief
summary follows, with Chapter 5 describing in more detail how the steps fit
together.
In the first step of the number field sieve two or more polynomials must be
found that have a root in common modulo the number to be factored. Though
theoretically such polynomials are easily found, small modifications may trigger orders of magnitude reductions of the resulting factoring effort. Finding
good polynomials has become a subject in its own right since Peter’s early
contributions (many of which were first published in Brian Murphy’s 1999 PhD
dissertation [266]). Chapter 6 is entirely devoted to the current state of the art
of polynomial selection for the number field sieve.

Downloaded from https://www.cambridge.org/core. Teachers College Library - Columbia University, on 02 Nov 2017 at 06:38:23, subject to
the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575.002

1.3 Overview

7

The number field sieve (and many other sieving-based integer factorization
methods) relies on the ability to quickly recognize which values in an arithmetic progression have many small factors. Chapter 5 briefly mentions some
of Peter’s many tricks which are now part of the standard repertoire to speed up
this so-called sieving step. His improvements to the earliest line sieving software for the number field sieve by the second author were used for numerous
integer factorizations in the 1990s [78, 235]. For most composites, however,
line sieving has since then been surpassed by lattice sieving [281].
The sieving step results in a large, very sparse matrix. In the matrix step
dependencies among the columns of this matrix must be found. Given its sparsity, it is advantageous to first transform the matrix into an equivalent matrix
that is smaller but less sparse. Originally, when regular Gaussian elimination
was still used for the matrix step, a set of ad hoc methods referred to as structured Gaussian elimination [226, 286] was used to find a transformed matrix
with a smallest possible number of rows, but without paying attention to its
density. The more sophisticated tools that became available for the matrix step
in the mid 1990s required an adaption of the matrix transformation method that
would minimize the product of the number of rows of the matrix and its number
of non-zero entries. These days this is referred to as filtering. The details of the
various filtering steps proposed by Peter (and based on the earlier work in [226,
286]) can be found in [89, 90] and are sketched in Section 5.2.6 on page 123.
The above “more sophisticated tools” are the block Wiedemann and block
Lanczos methods, based on the classical Wiedemann and Lanczos methods.
They were almost simultaneously and independently developed by Don Coppersmith and Peter, respectively, and have comparable performance [106, 257].
Though both methods allow efficient distributed implementation, for block
Lanczos all processors must be able to communicate quickly (i.e., a single
cluster with a fast interconnection network), whereas block Wiedemann can be
made to run efficiently on a modest number of disjoint clusters. Almost since
its inception in the mid 1990s, block Lanczos became the method of choice
for all academic factoring projects. Around 2005 it became more convenient
for these applications to use a number of separate clusters for the matrix step,
which implied a switch to block Wiedemann [209, 211]. Chapter 7 describes
the block Lanczos method in detail.
For the number field sieve each column dependency produced by the matrix
step requires a square root calculation in the number fields involved before a
factorization of the targeted composite may be derived. In [255] Peter presented
an iterative approach that cleverly exploits the special form of the squared algebraic numbers, thereby replacing Jean–Marc Couveignes’ much slower and

Downloaded from https://www.cambridge.org/core. Teachers College Library - Columbia University, on 02 Nov 2017 at 06:38:23, subject to
the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575.002

8

Introduction

more restrictive method [113]. Peter’s method is used, essentially unmodified,
until the present day, and sketched in Section 5.5.3 on page 157.
Returning to special-purpose integer factorization, in the last paragraph of
his p − 1 paper [278] John Pollard mentions “we would theoretically use the
fast Fourier transform on Step 2,” concluding his paper with “I cannot say at
what point this becomes a practical proposition.” In [252, section 4] Peter refers
to Pollard’s remark, and further pursues the idea for Pollard’s p − 1 method,
first with Silverman in [263] and later with Alexander Kruppa in [261]. For
the elliptic curve factoring method it is the subject of Peter’s PhD dissertation [254]. Peter’s work on this subject is presented in Chapter 8.
In Chapter 9 Peter’s contributions to cryptographic pairings are outlined.
Given Peter’s simultaneous inversion method (described below in Section 1.4)
and his fast modular inversion it became attractive, contrary to conventional wisdom, to replace projective coordinates by affine ones for applications involving multiple or parallelized pairings, products of pairings, or pairings at high security levels [137, 138, 227]. Moreover, Peter coauthored a
paper introducing algorithms to compute the squared pairing, which has the
advantage, compared to Victor Miller’s algorithm, that its computation cannot
fail [248].

1.4 Simultaneous Inversion
For completeness, this section describes Peter Montgomery’s simultaneous
inversion method, a popular method that did not lead to any follow-up work
and which is therefore not treated in the later chapters.
At a computational number theory conference held in August 1985 at Humboldt State University in Arcata, CA, Peter gave a presentation about his experience using the elliptic curve integer factorization method ([239], informally
announced in February 1985). He proposed to maximize the overall throughput
of the method as opposed to minimizing the latency per curve, thus exploiting
the method’s inherent parallelism. The idea is simple: at a modest overhead the
cost of the modular inversion required for each elliptic curve group operation
in the Weierstrass model can be shared among any number of simultaneously
processed curves, assuming all curves target the same number to be factored. It
has become known as simultaneous inversion, and was later briefly mentioned
in [252, section 10.3.1]:
When working over several curves, the program used a scheme similar to (1.2) to
do all the inversions at once, since (1/x) = y(1/xy) and (1/y) = x(1/xy). This
reduces the asymptotic cost of an inversion to that of 3 multiplications.

Downloaded from https://www.cambridge.org/core. Teachers College Library - Columbia University, on 02 Nov 2017 at 06:38:23, subject to
the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575.002

1.4 Simultaneous Inversion

9

The “scheme similar to (1.2)” suggests that Peter took his inspiration from
John Pollard’s observation that the cost of a number of greatest common divisor calculations with the same to-be-factored modulus can be reduced to that
same number of multiplications modulo the same modulus, plus a single gcd
calculation. In the inversion context, there may have been a historical precedent for Peter’s trick, with Richard Schroeppel recalling he may have seen a
similar method to avoid floating divides in pre-1970s Fortran programs. But
even if it is not entirely original, Peter’s insight cleverly adapts an old idea to a
regime where numerics are exact [303]. These days, simultaneous inversion is,
for instance, used in record computations of elliptic curve discrete logarithms
(e.g., [69] and the record attempt [20]; it was not used in [344]).
Let z1 , z2 , . . . , zk ∈ Z/nZ for some modulus n be the elements that must be
inverted; in the rare case that at least one of the inverses does not exist, simultaneous inversion fails, and an appropriate alternative approach is used. Working
in Z/nZ and with w0 = 1, first for i = 1, 2, . . . , k in succession calculate and
store wi = wi−1 zi . Next, calculate w = wk−1 . Finally, for i = k, k − 1, . . . , 1
as wwi−1 and next replace w by wzi (so that
in succession first compute z−1
i
−1
). The overall cost is 3(k − 1) multiplications, a single inversion, and
w = wi−1
storage for k temporary values, all in Z/nZ. Given that for relevant modulus
sizes and software implementation of arithmetic in Z/nZ inversion is at least
five times costlier than multiplication, the resulting speed up can be significant. One should be aware, though, that in hardware the difference can be made
smaller [200].

Downloaded from https://www.cambridge.org/core. Teachers College Library - Columbia University, on 02 Nov 2017 at 06:38:23, subject to
the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575.002

2
Montgomery Arithmetic from a Software
Perspective
Joppe W. Bos and Peter L. Montgomery

We propose a representation of residue classes so as to speed modular
multiplication without affecting the modular addition and subtraction
algorithms.
– Peter L. Montgomery [251]

Abstract
This chapter describes Peter L. Montgomery’s modular multiplication method
and the various improvements to reduce the latency for software implementations on devices that have access to many computational units.

2.1 Introduction
Modular multiplication is a fundamental arithmetical operation, for instance
when computing in a finite field or a finite ring, and forms one of the basic
operations underlying almost all currently deployed public-key cryptographic
protocols. The efficient computation of modular multiplication is an important research area since optimizations, even ones resulting in a constant-factor
speedup, have a direct impact on our day-to-day information security life. In
this chapter we review the computational aspects of Peter L. Montgomery’s
modular multiplication method [251] (known as Montgomery multiplication)
from a software perspective (while the next chapter highlights the hardware
perspective).
Throughout this chapter we use the terms digit and word interchangeably. To
be more precise, we typically assume that a b-bit non-negative multi-precision
integer X is represented by an array of n = b/r computer words as
X=

n−1


xi ri

i=0

10
Downloaded from https://www.cambridge.org/core. Cambridge University Main, on 18 Dec 2019 at 22:01:47, subject to the Cambridge Core
terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575.003

2.1 Introduction

11

(the so-called radix-r representation), where r = 2w for the word size w of
the target computer architecture and 0 ≤ xi < r. Here xi is the ith word of the
integer X.
Let N be the positive modulus consisting of n digits while the input values to the modular multiplication method (A and B) are non-negative integers
smaller than N and consist of up to n digits. When computing modular multiplication C = AB mod N, the definitional approach is first to compute the product
P = AB. Next, a division is computed to obtain P = NQ + C such that both C
and Q are non-negative integers less than N. Knuth studies such algorithms for
multi-precision non-negative integers [215, Alg. 4.3.1.D]. Counting word-byword instructions, the method described by Knuth requires O(n2 ) multiplications and O(n) divisions when implemented on a computer platform. However,
on almost all computer platforms divisions are expensive (relative to the cost
of a multiplication). Is it possible to perform modular multiplication without
using any division instructions?
If one is allowed to perform some precomputation which only depends on the
modulus N, then this question can be answered affirmatively. When computing
the division step, the idea is to use only “cheap” divisions and “cheap” modular
reductions when computing the modular multiplication in combination with a
precomputed constant (the computation of which may require “expensive” divisions). These “cheap” operations are computations which either come for free
or at a low cost on computer platforms. Virtually all modern computer architectures internally store and compute on data in binary format using some fixed
word-size r = 2w as above. In practice, this means that all arithmetic operations
are implicitly computed modulo 2w (i.e., for free) and divisions or multiplications by (powers of) 2 can be computed by simply shifting the content of the
register that holds this value.
Barrett introduced a modular multiplication approach (known as Barrett
multiplication [31]) using this idea. This approach can be seen as a Newton
method which uses a precomputed scaled variant of the modulus’ reciprocal in order to use only such “cheap” divisions when computing (estimating
and adjusting) the division step. After precomputing a single (multi-precision)
value, an implementation of Barrett multiplication does not use any division
instructions and requires O(n2 ) multiplication instructions.
Another and earlier approach based on precomputation is the main topic
of this chapter: Montgomery multiplication. This method is the preferred
choice in cryptographic applications when the modulus has no “special” form
(besides being an odd positive integer) that would allow more efficient modular reduction techniques. See Section 2.3 on page 22 for applications of
Montgomery multiplication in the “special” setting. In practice, Montgomery

Downloaded from https://www.cambridge.org/core. Cambridge University Main, on 18 Dec 2019 at 22:01:47, subject to the Cambridge Core
terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575.003

12

Montgomery Arithmetic from a Software Perspective

multiplication is the most efficient method when a generic modulus is used
(see e.g., the comparison performed by Bosselaers, Govaerts, and Vandewalle [75]) and has a very regular structure which speeds up the implementation. Moreover, the structure of the algorithm (especially if its single branch,
the notorious conditional “subtraction step,” can be avoided, cf. page 20 in
Section 2.2.4) has advantages when guarding against certain types of cryptographic attacks (for more information on differential power analysis attacks
see the seminal paper by Kocher, Jaffe, and Jun [221]). In the next chapter,
Montgomery’s method is compared with a version of Barrett multiplication
in order to be more precise about the computational advantages of the former
technique.
As observed by Shand and Vuillemin in [310], Montgomery multiplication
can be seen as a generalization of Hensel’s odd division [184] for 2-adic numbers. In this chapter we explain the motivation behind Montgomery arithmetic.
More specifically, we show how a change of the residue class representatives
used improves the performance of modular multiplication. Next, we summarize some of the proposed modifications to Montgomery multiplication, which
can further speed up the algorithm in certain settings. Finally, we show how
to implement Montgomery arithmetic in software. We especially study how to
compute a single Montgomery multiplication concurrently using either vector
instructions or when many computational units are available that can compute
in parallel.

2.2 Montgomery Multiplication
Let N be an odd b-bit integer and P a 2b-bit integer such that 0 ≤ P < N 2 .
The idea behind Montgomery multiplication is to change the representatives of
the residue classes and change the modular multiplication accordingly. More
precisely, we are not going to compute P mod N but 2−b P mod N instead. This
explains the requirement that N needs to be odd (otherwise 2−b mod N does
not exist). It turns out that this approach is more efficient (by a constant factor)
on computer platforms.
Let us start with a basic example to illustrate the strategy used. A first idea
is to reduce the value P one bit at a time and repeat this for b bits such that the
result has been reduced from 2b to b bits (as required). This can be achieved
without computing any expensive modular divisions by noting that

P/2
if P is even,
−1
2 P mod N =
(P + N)/2 if P is odd.

Downloaded from https://www.cambridge.org/core. Cambridge University Main, on 18 Dec 2019 at 22:01:47, subject to the Cambridge Core
terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575.003

2.2 Montgomery Multiplication

13

When P is even, the division by two can be computed with a basic operation
on computer architectures: shift the number one position to the right. When P
is odd one can not simply compute this division by shifting. A computationally
efficient approach to compute this division by two is to make this number P even
by adding the odd modulus N, since obviously modulo N this is the same. This
allows one to compute 2−1 P mod N at the cost of (at most) a single addition
and a single shift.
Let us compute D < 2N and Q < 2b such that P = 2b D − NQ since then
D ≡ 2−b P mod N. Initially set D equal to P and Q equal to zero. We denote

i
by qi the ith digit when Q is written in binary (radix-2), i.e., Q = b−1
i=0 qi 2 .
Next perform the following two steps b times starting at i = 0 until the last time
when i = b − 1:
(Step 1) qi = D mod 2,

(Step 2) D = (D + qi N)/2.

This procedure gradually builds the desired Q and at the start of every iteration
P = 2i D − NQ
remains invariant. The process is illustrated in the example below.
Example 2.1 (Radix-2 Montgomery reduction)
For N = 7 (3 bits) and P = 20 < 72 we compute D ≡ 2−3 P mod N. At the start
of the algorithm, set D = P = 20 and Q = 0.
i = 0, 20 = 20 · 20 − 7 · 0
(Step 1) q0 = 20 mod 2 = 0,
i = 1, 20 = 21 · 10 − 7 · 0
(Step 1) q1 = 10 mod 2 = 0,
i = 2, 20 = 22 · 5 − 7 · 0
(Step 1) q2 = 5 mod 2 = 1,

⇒ 2−0 · 20 ≡ 20 mod 7
(Step 2) D = (20 + 0 · 7)/2 = 10
⇒ 2−1 · 20 ≡ 10 mod 7
(Step 2) D = (10 + 0 · 7)/2 = 5
⇒ 2−2 · 20 ≡ 5 mod 7
(Step 2) D = (5 + 1 · 7)/2 = 6

Since Q = q0 20 + q1 21 + q2 22 = 4 and P = 20 = 2k D − NQ = 23 · 6 − 7 · 4,
we have computed 2−3 · 20 ≡ 6 mod 7.
The approach behind Montgomery multiplication [251] generalizes this idea.
Instead of dividing by two at every iteration the idea is to divide by a Montgomery radix R which needs to be coprime to, and should be larger than, N. By
precomputing the value
μ = −N −1 mod R,

Downloaded from https://www.cambridge.org/core. Cambridge University Main, on 18 Dec 2019 at 22:01:47, subject to the Cambridge Core
terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575.003

14

Montgomery Arithmetic from a Software Perspective

Algorithm 2.1 The Montgomery reduction algorithm. Compute PR−1 modulo
the odd modulus N given the Montgomery radix R > N and using the precomputed Montgomery constant μ = −N −1 mod R.
Input: N, P, such that 0 ≤ P < N 2 .
Output: C ≡ PR−1 mod N such that 0 ≤ C < N.
1: q ← μ(P mod R) mod R
2: C ← (P + Nq)/R
3: if C ≥ N then
4:
C ←C−N
5: end if
6: return C
adding a specific multiple of the modulus N to the current value P ensures that


(2.1)
P + N (μP mod R) ≡ P − N N −1 P mod R
≡P−P≡0

(mod R).

Hence, P + N (Pμ mod R) is divisible by the Montgomery radix R while P
does not change modulo N. Let P be the product of two non-negative integers
that are both less than N. After applying Equation (2.1) and dividing by R, the
value P (bounded by N 2 ) has been reduced to at most 2N since
0≤

P + N(μP mod R)
N 2 + NR
<
< 2N
R
R

(2.2)

(since R was chosen larger than N). This approach is summarized in
Algorithm 2.1: given an integer P bounded by N 2 , it computes PR−1 mod N,
bounded by N, without using any “expensive” division instructions when
assuming the reductions modulo R and divisions by R can be computed (almost)
for free. On most computer platforms, where one chooses R as a power of two,
this assumption is indeed true.
Equation (2.2) guarantees that the output is bounded by 2N. Hence, a conditional subtraction needs to be computed at the end of Algorithm 2.1 to ensure
the output is less than N. The process is illustrated in the example below, where
P is the product of integers A, B with 0 ≤ A, B < N.
Example 2.2 (Montgomery multiplication)
Exact divisions by 102 = 100 are visually convenient when using a decimal
system: just shift the number two places to the right (or “erase” the two least
significant digits). Assume the following modular reduction approach: use the
Montgomery radix R = 100 when computing modulo N = 97. This example

Downloaded from https://www.cambridge.org/core. Cambridge University Main, on 18 Dec 2019 at 22:01:47, subject to the Cambridge Core
terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575.003

2.2 Montgomery Multiplication

15

computes the Montgomery product of A = 42 with B = 17. First, precompute
the Montgomery constant
μ = −N −1 mod R = −97−1 mod 100 = 67.
After computing the product P = AB = 42 · 17 = 714, compute the first two
steps of Algorithm 2.1 omitting the division by R:
P + N(μP mod R) = 714 + 97(67 · 714 mod 100)
= 714 + 97(67 · 14 mod 100)
= 714 + 97(938 mod 100)
= 714 + 97 · 38
= 4400.
Indeed, 4400 is divisible by R = 100 and we have computed
(AB)R−1 ≡ 42 · 17 · 100−1 ≡ 44

(mod 97)

without using any “difficult” modular divisions.

2.2.1 Interleaved Montgomery Multiplication
When working with multi-precision integers, integers consisting of n digits of w
bits each, it is common to write the Montgomery radix R as
R = rn = 2wn ,
where w is the word-size of the architecture where Montgomery multiplication is implemented. The Montgomery multiplication method (as outlined in
Algorithm 2.1) assumes the multiplication is computed before performing the
Montgomery reduction. This has the advantage that one can use asymptotically
fast multiplication methods (like e.g., Karatsuba [202], Toom–Cook [327, 102],
Schönhage–Strassen [301], or Fürer [152] multiplication). However, this has
the disadvantage that the intermediate results can be as large as r2n+1 . Or, stated
differently, when using a machine word size of w bits the intermediate results
are represented using at most 2n + 1 computer words.
The multi-precision setting was already handled in Montgomery’s original
paper [251, section 2] and the reduction and multiplication were meant to be
interleaved by design (see also the remark in the second to last paragraph of

Downloaded from https://www.cambridge.org/core. Cambridge University Main, on 18 Dec 2019 at 22:01:47, subject to the Cambridge Core
terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575.003

16

Montgomery Arithmetic from a Software Perspective

Algorithm 2.2 The radix-r interleaved Montgomery multiplication algorithm.
Compute (AB)R−1 modulo the odd modulus N given the Montgomery radix
R = rn and using the precomputed Montgomery constant μ = −N −1 mod r.
The modulus N is such that rn−1 ≤ N < rn and r and N are coprime.

i
Input: A = n−1
i=0 ai r , B, N such that 0 ≤ ai < r, 0 ≤ A, B < R.
Output: C ≡ (AB)R−1 mod N such that 0 ≤ C < N.
1: C ← 0
2: for i = 0 to n − 1 do
3:
C ← C + ai B
4:
q ← μC mod r
5:
C ← (C + Nq)/r
6: end for
7: if C ≥ N then
8:
C ←C−N
9: end if
10: return C
Section 1.2 on page 5). When representing the integers in radix-r representation
A=

n−1


ai ri ,

such that 0 ≤ ai < r

i=0

then the radix-r interleaved Montgomery multiplication (see also the work by
Dussé and Kaliski Jr. in [134]) ensures the intermediate results never exceed
r + 2 computer words. This approach is presented in Algorithm 2.2. Note
that this interleaves the naive schoolbook multiplication algorithm with the
Montgomery reduction and therefore does not make use of any asymptotically
faster multiplication algorithm. The idea is that every iteration divides by the
value r (instead of dividing once by R = rn in the “non-interleaved” Montgomery multiplication algorithm). Hence, the value for μ is adjusted accordingly. In [93], Koç, Acar, and Kaliski Jr. compare different approaches to implementing multi-precision Montgomery multiplication. According to this analysis, the interleaved radix-r approach, referred to as coarsely integrated operand
scanning in [93], performs best in practice.

2.2.2 Using Montgomery Arithmetic in Practice
As we have seen earlier in this section and in Algorithm 2.1 on page 14 Montgomery multiplication computes C ≡ PR−1 mod N. It follows that, in order to
use Montgomery multiplication in practice, one should transform the input

Downloaded from https://www.cambridge.org/core. Cambridge University Main, on 18 Dec 2019 at 22:01:47, subject to the Cambridge Core
terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575.003

2.2 Montgomery Multiplication

17

operands A and B to à = AR mod N and B̃ = BR mod N: this is called the
Montgomery representation. The transformed inputs (converted to the Montgomery representation) are used in the Montgomery multiplication algorithm.
At the end of a series of modular multiplications the result, in Montgomery
representation, is transformed back. This works correctly since Montgomery
multiplication M(Ã, B̃, N ) computes (ÃB̃)R−1 mod N and it is indeed the case
that the Montgomery representation C̃ of C = AB mod N is computed from the
Montgomery representations of A and B since
C̃ ≡ M(Ã, B̃, N ) ≡ (ÃB̃)R−1
≡ (AR)(BR)R−1
≡ (AB)R
≡ CR

(mod N).

Converting an integer A to its Montgomery representation à = AR mod N can
be performed using Montgomery multiplication with the help of the precomputed constant R2 mod N since
M(A, R2 , N ) ≡ (AR2 )R−1 ≡ AR ≡ Ã

(mod N).

Converting (the result) back from the Montgomery representation to the regular
representation is the same as computing a Montgomery multiplication with the
integer value one since
M(Ã, 1, N ) ≡ (Ã · 1)R−1 ≡ (AR)R−1 ≡ A

(mod N).

As mentioned earlier, due to the overhead of changing representations, Montgomery arithmetic is best when used to replace a sequence of modular multiplications, since this overhead is amortized. A typical use-case scenario is
when computing a modular exponentiation as required in the RSA cryptosystem [294].
As noted in the original paper [251] (see the quote at the start of this chapter) computing with numbers in Montgomery representation does not affect the
modular addition and subtraction algorithms. This can be seen from

à ± B̃ ≡ AR ± BR ≡ (A ± B)R ≡ A
±B

(mod N).

Computing the Montgomery inverse is, however, affected. The Montgomery

−1 . This is different
inverse of a value à in Montgomery representation is A
−1
from computing the inverse of à modulo N since à ≡ (AR)−1 ≡ A−1 R−1
(mod N) is the Montgomery representation of the value A−1 R−2 . One of the
correct ways of computing the Montgomery inverse is to invert the number

Downloaded from https://www.cambridge.org/core. Cambridge University Main, on 18 Dec 2019 at 22:01:47, subject to the Cambridge Core
terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575.003

18

Montgomery Arithmetic from a Software Perspective

in its Montgomery representation and Montgomery multiply this result by R3
since

−1
M(Ã−1 , R3 , N ) ≡ ((AR)−1 R3 )R−1 ≡ A−1 R ≡ A

(mod N).

Another approach, which does not require any precomputed constant, is to compute the Montgomery reduction of a Montgomery residue à twice before inverting since
M(M(Ã, 1, N ), 1, N )−1 ≡ M((AR)R−1 , 1, N )−1
≡ M(A, 1, N )−1
≡ (AR−1 )−1
≡ A−1 R

−1
≡A

(mod N).

2.2.3 Computing the Montgomery Constants μ and R2
In order to use Montgomery multiplication one has to precompute the Montgomery constant μ = −N −1 mod r. This can be computed with, for instance,
the extended Euclidean algorithm. A particularly efficient algorithm to compute μ when r is a power of two and N is odd, the typical setting used in cryptology, is given by Dussé and Kaliski Jr. and presented in [134]. This approach
is recalled in Algorithm 2.3.
To show that this approach is correct, it suffices to show that at the start of
Algorithm 2.3 and at the end of every iteration we have Nyi ≡ 1 (mod 2i ). This
can be shown by induction as follows. At the start of the algorithm we set y to
Algorithm 2.3 Compute the Montgomery constant μ = −N −1 mod r for odd
values N and r = 2w as presented by Dussé and Kaliski Jr. in [134].
Input: Odd integer N and r = 2w for w ≥ 1.
Output: μ = −N −1 mod r
y←1
for i = 2 to w do
if (Ny mod 2i ) = 1 then
y ← y + 2i−1
end if
end for
return μ ← r − y

Downloaded from https://www.cambridge.org/core. Cambridge University Main, on 18 Dec 2019 at 22:01:47, subject to the Cambridge Core
terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575.003

2.2 Montgomery Multiplication

19

one, denote this start setting with y1 , and the condition holds since N is odd
by assumption. Denote with yi , for 2 ≤ i ≤ w, the value of y in Algorithm 2.3
at the end of iteration i. When i > 1, our induction hypothesis is that Nyi−1 =
1 + 2i−1 m for some positive integer m, at the end of iteration i − 1.
We consider two cases
r (m is even) Since Nyi−1 = 1 + m 2i ≡ 1 (mod 2i ) we can simply update yi
2
to yi−1 and the condition holds.
r (m is odd) Since Nyi−1 = 1 + 2i−1 + m−1 2i ≡ 1 + 2i−1 (mod 2i ), we
2
update yi with yi−1 + 2i−1 . We obtain N(yi−1 + 2i−1 ) = 1 + 2i−1 (1 + N) +
m−1 i
2 ≡ 1 (mod 2i ) since N is odd.
2
Hence, after the for loop yw is such that Nyw ≡ 1 mod 2w and the returned value
μ = r − yw ≡ 2w − N −1 ≡ −N −1 (mod 2w ) has been computed correctly.
The precomputed constant R2 mod N is required when converting a residue
modulo N from its regular to its Montgomery representation (see Section 2.2.2
on page 16). When R = rn is a power of two, which in practice is typically the
case since r = 2w , then this precomputed value R2 mod N can also be computed
efficiently. For convenience, assume R = rn = 2wn and 2wn−1 ≤ N < 2wn (but
this approach can easily be adapted when N is smaller than 2wn−1 ). Commence
by setting the initial c0 = 2wn−1 < N. Next, start at i = 1 and compute
ci ≡ ci−1 + ci−1 mod N
and increase i until i = wn + 1. The final value
cwn+1 ≡ 2wn+1 c0 ≡ 2wn+1 2wn−1 ≡ 22wn ≡ (2wn )2 ≡ R2 mod N
as required and can be computed with wn + 1 modular additions.

2.2.4 On the Final Conditional Subtraction
It is possible to alter or even completely remove the conditional subtraction from lines 3–4 in Algorithm 2.1 on page 14. This is often motivated by
either performance considerations or turning the (software) implementation
into straight-line code that requires no conditional branches. This is one of
the basic requirements for cryptographic implementations that need to protect
themselves against a variety of (simple) side-channel attacks as introduced by
Kocher, Jaffe, and Jun [221] (those attacks that use physical information, such
as elapsed time, obtained when executing an implementation, to deduce information about the secret key material used). Ensuring constant running time is a
first step in achieving this goal. In order to change or remove this final conditional subtraction the general idea is to bound the input and output of the

Downloaded from https://www.cambridge.org/core. Cambridge University Main, on 18 Dec 2019 at 22:01:47, subject to the Cambridge Core
terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575.003

20

Montgomery Arithmetic from a Software Perspective

Montgomery multiplication in such a way that they can be re-used in a subsequent Montgomery multiplication computation. This means using a redundant
representation, in which the representation of the residues used is not unique
and can be larger than N.
2.2.4.1 Subtraction-less Montgomery Multiplication
The conditional subtraction can be omitted when the size of the modulus N
is appropriately selected with respect to the Montgomery radix R. (This is a
result presented by Shand and Vuillemin in [310] but see also the sequence of
papers by Walter, Hachez, and Quisquater [337, 177, 340].) The idea is to select
the modulus N such that 4N < R and to use a redundant representation for the
input and output values of the algorithm. More specifically, we assume A, B ∈
Z/2NZ (residues modulo 2N), where 0 ≤ A, B < 2N, since then the outputs of
Algorithm 2.1 on page 14 and Algorithm 2.2 on page 16 are bounded by
0≤

(2N)2 + NR
NR + NR
AB + N(μAB mod R)
<
<
= 2N. (2.3)
R
R
R

Hence, the result can be reused as input to the same Montgomery multiplication algorithm. This avoids the need for the conditional subtraction except in a
final correction step (after having computed a sequence of Montgomery multiplications) where one reduces the value to a unique representation with a single
conditional subtraction.
In practice, this might reduce the number of arithmetic operations whenever the modulus can be chosen beforehand and, moreover, simplifies the code.
However, in the popular use-cases in cryptography, e.g., in the setting of computing modular exponentiations when using schemes based on RSA [294]
where the bit-length of the modulus must be a power of two due to compliance
with cryptographic standards, the condition 4N < R results in a Montgomeryradix R, which is represented using one additional computer word (compared
to the number of words needed to represent the modulus N). Hence, in this setting, such a multi-precision implementation without a conditional subtraction
needs one more iteration (when using the interleaved Montgomery multiplication algorithm) to compute the result compared to a version that computes the
conditional subtraction.
2.2.4.2 Montgomery Multiplication with a Simpler Final Comparison
Another approach is not to remove the subtraction but make this operation computationally cheaper. See the analysis by Walter and Thompson in [342, section 2.2] which is introduced again by Pu and Zhao in [292]. In practice, the

Downloaded from https://www.cambridge.org/core. Cambridge University Main, on 18 Dec 2019 at 22:01:47, subject to the Cambridge Core
terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575.003

2.2 Montgomery Multiplication

21

Montgomery-radix R = rn is often chosen as a multiple of the word-size of the
computer architecture used (e.g., r = 2w for w ∈ {8, 16, 32, 64}). The idea is
to reduce the output of the Montgomery multiplication to {0, 1, . . . , R − 1}.
instead of to the smaller range {0, 1, . . . , N − 1}. Just as above, this is a redundant representation but working with residues from Z/RZ. This representation
does not need more computer words to represent the result and therefore does
not increase the number of iterations one needs to compute; something which
might be the case when the Montgomery radix is increased to remove the con
ditional subtraction. Computing the comparison if an integer x = ni=0 xi ri is
at least R = rn can be done efficiently by verifying if the most significant word
xn is non-zero. This is significantly more efficient compared with computing a
full multi-precision comparison.
This approach is correct since if the input values A and B are bounded by R
then the output of the Montgomery multiplication, before the conditional subtraction, is bounded by
0≤

R2 + NR
B + N(μAB mod R)
<
= R + N.
R
R

(2.4)

Subtracting N whenever the result is at least R ensures that the output is also
less than R. Hence, one still requires to evaluate the condition for subtraction in
every Montgomery multiplication. However, the greater-or-equal comparison
becomes significantly cheaper and the number of iterations required to compute the interleaved Montgomery multiplication algorithm remains unchanged.
In the setting where a constant running time is required this approach does not
seem to bring a significant advantage (see the security analysis by Walter and
Thompson in [342, section 2.2] for more details). A simple constant running
time solution is to compute the subtraction and select this result if no borrow
occurred. However, when constant running time is not an issue this approach
(using a cheaper comparison) can speed up the Montgomery multiplication
algorithm.

2.2.5 Montgomery Multiplication in F2k
The idea behind Montgomery multiplication carries over to finite fields of cardinality 2k as well. Such finite fields are known as binary fields or characteristictwo finite fields. The application of Montgomery multiplication to this setting
is outlined by Koç and Acar in [219]. Let n(x) be an irreducible polynomial of
degree k. Then an element a(x) ∈ F2k ∼
= F2 [x]/(n(x)) can be represented in the

Downloaded from https://www.cambridge.org/core. Cambridge University Main, on 18 Dec 2019 at 22:01:47, subject to the Cambridge Core
terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575.003

22

Montgomery Arithmetic from a Software Perspective

polynomial-basis representation by a polynomial of degree at most k − 1
a(x) =

k−1


ai xi

where ai ∈ F2 .

i=0

The equivalent of the Montgomery-radix is the polynomial r(x) ∈ F2 [x]/(n(x)),
which in practice is chosen as r(x) = xk . Since n(x) is irreducible this ensures
that the inverse r−1 (x) mod n(x) exists and the Montgomery multiplication
a(x)b(x)r−1 (x) mod n(x)
is well defined.
Let a(x), b(x) ∈ F2 [x]/(n(x)) and their product p(x) = a(x)b(x) of degree
at most 2(k − 1). Computing the Montgomery reduction p(x)r−1 (x) mod n(x)
of p(x) can be done using the same steps as presented in Algorithm 2.1 on
page 14 given the precomputed Montgomery constant μ(x) = −n(x)−1 mod
r(x). Hence, one computes
q(x) = p(x)μ(x) mod r(x)
c(x) = (p(x) + q(x)n(x))/r(x).
Note that the final conditional subtraction step is not required since
deg(c(x)) ≤ max(2(k − 1), k − 1 + k) − k = k − 1,
(because r(x) is a degree k polynomial). A large characteristic version of this
approach using the interleaved Montgomery multiplication for finite fields of
large prime characteristic from Section 2.2.1 on page 15, works here as well.

2.3 Using Primes of a Special Form
In some settings in cryptography, most notably in elliptic curve cryptography
(introduced independently by Miller and Koblitz in [247, 217]), the (prime)
modulus can be chosen freely and is fixed for a large number of modular arithmetic operations. In order to gain a constant factor speedup when computing
the modular multiplication, Solinas suggested [316] a specific set of special
primes which were subsequently included in the FIPS 186 standard [330] used
in public-key cryptography. More recently, prime moduli of the form 2s ± δ
have gained popularity where s, δ ∈ Z>0 and δ < 2s such that δ is a (very)
small integer. More precisely, the constant δ is small compared to the typical word-size of computer architectures used (e.g., less than 232 ) and often is
chosen as the smallest integer such that one of 2s ± δ is prime. One should

Downloaded from https://www.cambridge.org/core. Cambridge University Main, on 18 Dec 2019 at 22:01:47, subject to the Cambridge Core
terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575.003

2.3 Using Primes of a Special Form

23

be aware that the usage of such primes of a special form not only accelerates the cryptographic implementations, but the cryptanalytic methods benefit
as well. See, for instance, the work by this chapter’s authors, Kleinjung, and
Lenstra related to efficient arithmetic to factor Mersenne numbers (numbers
of the form 2M − 1) in [71]. An example of one of the primes, suggested by
Solinas, is 2256 − 2224 + 2192 + 296 − 1 where the exponents are selected to
be a multiple of 32 to speed up implementations on 32-bit platforms (but see
for instance the work by Käsper how to implement such primes efficiently on
64-bit platforms [204]). A more recent example proposed by Bernstein [35]
is to use the prime 2255 − 19 to implement efficient modular arithmetic in the
setting of elliptic curve cryptography.

2.3.1 Faster Modular Reduction with Primes of
a Special Form
We use the Mersenne prime 2127 − 1 as an example to illustrate the various
modular reduction techniques in this section. Given two integers a and b, such
that 0 ≤ a, b < 2127 − 1, the modular product ab mod 2127 − 1 can be computed efficiently as follows. (We follow the description given by the first author
of this chapter, Costello, Hisil, and Lauter from [64].) First compute the product
with one’s preferred multiplication algorithm and write the result in radix-2128
representation
c = ab = c1 2128 + c0 ,

where 0 ≤ c0 < 2128 and 0 ≤ c1 < 2126 .

This product can be almost fully reduced by subtracting 2c1 times the modulus
since
c ≡ c1 2128 + c0 − 2c1 (2127 − 1) ≡ c0 + 2c1

(mod 2127 − 1).

Moreover, we can subtract 2127 − 1 one more time if the bit c0 /2127 (the
most significant bit of c0 ) is set. When combining these two observations a
first reduction step can be computed as
c = (c0 mod 2127 ) + 2c1 + c0 /2127 ≡ c

(mod 2127 − 1)

(2.5)

This already ensures that the result 0 ≤ c < 2128 since
c ≤ 2127 − 1 + 2(2126 − 1) + 1 < 2128 .
One can then reduce c further, using conditional subtractions. Reduction modulo 2127 − 1 can therefore be computed without using any multiplications or
expensive divisions by taking advantage of the form of the modulus.

Downloaded from https://www.cambridge.org/core. Cambridge University Main, on 18 Dec 2019 at 22:01:47, subject to the Cambridge Core
terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575.003

24

Montgomery Arithmetic from a Software Perspective

2.3.2 Faster Montgomery Reduction with Primes of
a Special Form
Along the same lines one can select moduli that speed up the operations when
computing Montgomery reduction. Such special moduli have been proposed
many times in the literature to reduce the number of arithmetic operations (see
for instance the work by Lenstra [231], Acar and Shumow [1], Knezevic, Vercauteren, and Verbauwhede [214], Hamburg [178], and the first author of this
chapter, Costello, Hisil, and Lauter [64, 65]). They are sometimes referred to
as Montgomery-friendly moduli or Montgomery-friendly primes. Techniques
to scale an existing modulus such that this scaled modulus has a special shape
which reduces the number of arithmetic operations, using the same techniques
as for the Montgomery-friendly moduli, are also known and called Montgomery tail tailoring by Hars in [183]. Following the description in the book
by Brent and Zimmermann [80], this can be seen as a form of preconditioning
as suggested by Svoboda in [320] in the setting of division.
When one is free to select the modulus N beforehand, then the number of
arithmetic operations can be reduced if the modulus is selected such that
μ = −N −1 mod r = ±1
in the setting of interleaved Montgomery multiplication (as also used by Dixon
and Lenstra in [127]). This ensures that the multiplication by μ can be avoided
(since μ = ±1) in every iteration of the interleaved Montgomery multiplication
algorithm. This puts a first requirement on the modulus, namely N≡ ∓ 1 mod r.
In practice, r = 2w where w is the word-size of the computer architecture.
Hence, this requirement puts a restriction on the least significant word of the
modulus (which equals either 1 or −1 ≡ 2w − 1 mod 2w ).
Combining lines 4 and 5 from the interleaved Montgomery multiplication
(Algorithm 2.2 on page 16) we see that one has to compute C+N(μCr mod r) .
Besides the multiplication by μ one has to compute a multi-word multiplication with the (fixed) modulus N. In the same vein as the techniques from
Section 2.3.1 above, one can require N to have a special shape such that this
multiplication can be computed faster in practice. This can be achieved, for
instance, when one of the computer words of the modulus is small or has some
special shape while the remainder of the digits are zero except for the most
significant word (e.g., when μ = 1). Along these lines the first author of this
chapter, Costello, Longa, and Naehrig select primes for usage in elliptic curve
cryptography where
N = 2α (2β − γ ) ± 1

(2.6)

where α, β, and γ are integers such that γ < 2β ≤ r.
Downloaded from https://www.cambridge.org/core. Cambridge University Main, on 18 Dec 2019 at 22:01:47, subject to the Cambridge Core
terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575.003

2.3 Using Primes of a Special Form

25

The final requirement on the modulus is to ensure that 4N < R = rn
since this avoids the final conditional subtraction (as shown on page 20 in
Section 2.2.4). Examples of such Montgomery-friendly moduli include
2252 − 2232 − 1 = 2192 (260 − 240 ) − 1 = 2224 (228 − 28 ) − 1
(written in different form to show the usage on different architectures that can
compute with β-bit integers) proposed by Hamburg in [178] and the modulus
2240 (214 − 127) − 1 = 2222 (232 − 218 · 127) − 1
proposed by the first author of this chapter, Costello, Longa, and Naehrig
in [66]. The approach is illustrated in the example below. Other examples of
Montgomery-friendly moduli are given in [166, table 4] based on Chung–
Hasan arithmetic [98].
Example 2.3 (Montgomery-friendly reduction modulo 2127 − 1)
Let us consider Montgomery reduction modulo 2127 − 1 on a 64-bit computer
architecture (w = 64). This means we have α = 64, β = 63, and γ = 0 in
Equation (2.6) on page 24 (2127 − 1 = 264 (263 − 0) − 1). Multiplication by μ
can be avoided since μ = −(2127 − 1)−1 mod 264 = 1. Furthermore, due to the
special form of the modulus the multiplication by 2127 − 1 can be simplified.
The computation of C+N(μCr mod r) , which needs to be done for each computer
word (twice in this setting), can be simplified when using the Montgomery
interleaved multiplication algorithm.
Write C = c2 2128 + c1 264 + c0 (see line 3 in Algorithm 2.2 on page 16) with
0 ≤ c2 , c1 , c0 < 264 , then
C + (2127 − 1)(C mod 264 )
C + N(μC mod r)
=
r
264
=

(c2 2128 + c1 264 + c0 ) + (2127 − 1)c0
264

=

c2 2128 + c1 264 + c0 2127
264

= c2 264 + c1 + c0 263 .
Hence, only two addition and two shift operations are needed in this computation.
Downloaded from https://www.cambridge.org/core. Cambridge University Main, on 18 Dec 2019 at 22:01:47, subject to the Cambridge Core
terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575.003

26

Montgomery Arithmetic from a Software Perspective

2.4 Concurrent Computing of Montgomery Multiplication
Since the seminal paper by the second author introducing modular multiplication without trial division, people have studied ways to obtain better performance on different computer architectures. Many of these techniques are
specifically tailored towards a (family) of platforms motivated by the desire to
enhance the practical performance of public-key cryptosystems.
One approach focuses on reducing the latency of the Montgomery multiplication operation. This might be achieved by computing the Montgomery product using many computational units in parallel. One example is to use the single instruction, multiple data (SIMD) programming paradigm. In this setting
a single vector instruction applies to multiple data elements in parallel. Many
modern computer architectures have access to vector instruction set extensions
to perform SIMD operations. Example platforms include the popular high-end
x86 architecture as well as the embedded ARM platform which can be found
in the majority of modern smartphones and tablets. To highlight the potential,
Gueron and Krasnov were the first to show in [175] that the computation of
Montgomery multiplication on the 256-bit wide vector instruction set AVX2 is
faster than the same computation on the classical arithmetic logic unit (ALU)
on the x86_64 platform.
In Section 2.4.2 below we outline the approach by the authors of this chapter,
Shumow, and Zaverucha from [73] for computing a single Montgomery multiplication using vector instruction set extensions which support 2-way SIMD
operations (i.e., perform the same operation on two data points simultaneously).
This approach allows one to split the computation of the interleaved Montgomery multiplication into two parts which can be computed in parallel. Note
that in a follow-up work [308] by Seo, Liu, Großschädl, and Choi it is shown
how to improve the performance on 2-way SIMD architectures even further.
Instead of computing the two multiplications concurrently, as is presented in
Section 2.4.2, they compute every multiplication using 2-way SIMD instructions. By careful scheduling of the instructions they manage to significantly
reduce the read-after-write dependencies which reduces the number of bubbles
(execution delays in the instruction pipeline). This results in a software implementation which outperforms the one presented in [73]. It would be interesting
to see if these two approaches (from [73] and [308]) can be merged on platforms
which support efficient 4-way SIMD instructions.
In Section 2.4.4 on page 36 we show how to compute Montgomery multiplication when integers are represented in a residue number system. This approach
can be used to compute Montgomery arithmetic efficiently on highly parallel
computer architectures which have hundreds of computational cores or more
and when large moduli are used (such as in the RSA cryptosystem).
Downloaded from https://www.cambridge.org/core. Cambridge University Main, on 18 Dec 2019 at 22:01:47, subject to the Cambridge Core
terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575.003

2.4 Concurrent Computing of Montgomery Multiplication

27

2.4.1 Related Work on Concurrent Computing of
Montgomery Multiplication
A parallel software approach describing systolic Montgomery multiplication is
described by Dixon and Lenstra in [127], by Iwamura, Matsumoto, and Imai
in [192], and Walter in [336]. See Chapter 3.14 on page 67 for more information about systolic Montgomery multiplication. Another approach is to use the
SIMD vector instructions to compute multiple Montgomery multiplications in
parallel. This can be useful in applications where many computations need to
be processed in parallel such as batch-RSA or cryptanalysis. This approach is
studied by Page and Smart in [274] using the SSE2 vector instructions on a
Pentium 4 and by the first author of this chapter in [63] on the Cell Broadband
Engine (see Section 2.4.3 on page 31 for more details about this platform).
An approach based on Montgomery multiplication which allows one to split
the operand into two parts, which can then be processed in parallel, is called
bipartite modular multiplication and was introduced by Kaihara and Takagi
in [199]. The idea is to use a Montgomery radix R = rαn where α is a rational
number such that αn is an integer and 0 < αn < n: hence, the radix R is smaller
than the modulus N. For example, one can choose α such that αn = n2 . In
order to compute xyr−αn mod N (where 0 ≤ x, y < N) write y = y1 rαn + y0
and compute
xy1 mod N

and

xy0 r−αn mod N

in parallel using a regular interleaved modular multiplication algorithm (see,
e.g., the work by Brickell [81]) and the interleaved Montgomery multiplication
algorithm, respectively. The sum of the two products gives the correct Montgomery product of x and y since
(xy1 mod N) + (xy0 r−αn mod N) ≡ x(y1 rαn + y0 )r−αn
≡ xyr−αn

(mod N).

2.4.2 Montgomery Multiplication Using SIMD Extensions
This section is an extended version of the description of the idea presented
by this chapter’s authors, Shumow, and Zaverucha in [73] where an algorithm
is presented to compute the interleaved Montgomery multiplication using two
threads running in parallel performing identical arithmetic steps. Hence, this
algorithm runs efficiently when using 2-way SIMD vector instructions as frequently found on modern computer architectures. For illustrative purposes we
assume a radix-232 system, but this can be adjusted accordingly to any other
radix system. Note that for efficiency considerations the choice of the radix
system depends on the vector instructions available.
Downloaded from https://www.cambridge.org/core. Cambridge University Main, on 18 Dec 2019 at 22:01:47, subject to the Cambridge Core
terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575.003

28

Montgomery Arithmetic from a Software Perspective

Algorithm 2.2 on page 16 outlines the interleaved Montgomery multiplication algorithm and computes two 1 × n → (n + 1) computer word multiplications, namely ai B and qN, and a single 1 × 1 → 1 computer word multiplication (μC mod 232 ) every iteration. Unfortunately, these three multiplications
depend on each other and therefore can not be computed in parallel. Every iteration computes (see Algorithm 2.2)
1. C ← C + ai B
2. q ← μC mod 232
3. C ← (C + qN)/232
In order to reduce latency, we would like to compute the two 1 × n → (n + 1)
word multiplications in parallel using vector instructions. This can be achieved
by removing the dependency between these two multi-word multiplications by

32i
is comcomputing the value of q first. The first word c0 of C = n−1
i=0 ci 2
puted twice: once for the computation of q (in μC mod 232 ) and then again in
the parallel computation of C + ai B. This is a relatively minor penalty of one
additional one-word multiplication and addition per iteration to make these two
multi-word multiplications independent of each other. This means an iteration i
can be computed as
1. q ← μ(c0 + ai b0 ) mod 232
2. C ← C + ai B
3. C ← (C + qN)/232
and the 1 × n → (n + 1) word multiplications in steps 2 and 3 (ai B and qN) can
be computed in parallel using, for instance, 2-way SIMD vector instructions.
In order to rewrite the remaining operations, besides the multiplication, the
authors of [73] suggest inverting the sign of the Montgomery constant μ,
i.e., instead of using −N −1 mod 232 use μ = N −1 mod 232 . When computing
the Montgomery product C = AB2−32n mod N, one can compute D (which
contains the sum of the products ai B) and E (which contains the sum of
the products qN), separately and in parallel using the same arithmetic operations. Due to the modified choice of the Montgomery constant μ we have
C = D − E ≡ AB2−32n (mod M), where 0 ≤ D, E < N: the maximum values of both D and E fit in an n-word integer. This approach is presented in
Algorithm 2.4 on page 29.
At the start of every iteration in the for-loop iterating with j, the two separate
computational streams running in parallel need to communicate information to
compute the value of q. More precisely, this requires the knowledge of both d0
and e0 , the least significant words of D and E respectively. Once the values
of both d0 and e0 are known to one of the computational streams, the updated

Downloaded from https://www.cambridge.org/core. Cambridge University Main, on 18 Dec 2019 at 22:01:47, subject to the Cambridge Core
terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575.003

2.4 Concurrent Computing of Montgomery Multiplication

29

Algorithm 2.4 A parallel radix-232 interleaved Montgomery multiplication
algorithm. Except for the computation of q, the arithmetic steps in the outer
for-loop, performed by computation 1 and computation 2, are identical. This
approach is suitable for 32-bit 2-way SIMD vector instruction units.
⎧
A, B, M, μ such that
⎪
⎪

n−1 32i
n−1
⎪
⎨
32i
32i
A = n−1
i=0 ai 2 , B =
i=0 bi 2 , M =
i=0 mi 2 ,
Input:
32
32(n−1)
⎪
0 ≤ ai , bi < 2 , 0 ≤ A, B < M, 2
≤ M < 232n ,
⎪
⎪
⎩
2  M, μ = M −1 mod 232 ,
−32n
mod M such that 0 ≤ C < M
Output: C ≡ AB2
Computation 1
Computation 2
ei = 0 for 0 ≤ i < n
di = 0 for 0 ≤ i < n
for j = 0 to n − 1 do
t0 ← a j b0 + d0
t0
t0 ← 32
2
for i = 1 to n − 1 do
p0 ← a j bi + t0 + di
p0
t0 ← 32
2
di−1 ← p0 mod 232
end for
dn−1 ← t0
end for

for j = 0 to n − 1 do
q ← ((μb0 )a j + μ(d0 − e0 )) mod 232
t1 ← qm0 + e0 // where t0 ≡ t1 (mod 232 )
t1
t1 ← 32
2
for i = 1 to n − 1 do
p1 ← qmi + t1 + ei
p1
t1 ← 32
2
ei−1 ← p1 mod 232
end for
en−1 ← t1
end for





C ←D−E

// where D =

if C < 0 do C ← C + M end if
return C

n−1


di 232i , E =

i=0

n−1


ei 232i

i=0

value of q can be computed as
q = ((μa0 )b j + μ(d0 − e0 )) mod 232
= μ(c0 + b j a0 ) mod 232
since c0 = d0 − e0 .
Except for this communication cost between the two streams, to compute
the value of q, all arithmetic computations performed by computation 1 and
computation 2 in the outer for-loop are identical but work on different data.

Downloaded from https://www.cambridge.org/core. Cambridge University Main, on 18 Dec 2019 at 22:01:47, subject to the Cambridge Core
terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575.003

30

Montgomery Arithmetic from a Software Perspective

Table 2.1 A simplified comparison, only stating the number of wordlevel instructions required, to compute the Montgomery multiplication
when using a 32n-bit modulus for a positive even integer n. Two
approaches are shown, a sequential one on the classical ALU
(Algorithm 2.2 on page 16) and a parallel one using 2-way SIMD
instructions (performing two operations in parallel, cf. Algorithm 2.4 on
page 29).
Classical
Instruction
add
sub
shortmul
muladd
muladdadd
SIMD muladd
SIMD muladdadd

32-bit

64-bit

–
–
n
2n
2n(n − 1)
–
–

–
–
n
2

n
n( n2 − 1)
–
–

2-way SIMD
32-bit
n
n
2n
–
–
n
n(n − 1)

This makes this approach suitable for computation using 2-way 32-bit SIMD
vector instructions. This technique benefits from 2-way SIMD 32 × 32 →
64-bit multiplication and matches exactly the 128-bit wide vector instructions
as present in many modern computer architectures. By changing the radix used
in Algorithm 2.4, larger or smaller vector instructions can be supported.
Note that as opposed to a conditional subtraction in Algorithm 2.1 on page 14
and Algorithm 2.2 on page 16, Algorithm 2.4 computes a conditional addition
because of the inverted sign of the precomputed Montgomery constant μ. This
condition is based on the fact that if D − E is negative (produces a borrow),
then the modulus is added to make the result positive.
2.4.2.1 Expected Performance
We follow the analysis of the expected performance from [73], which just
considers execution time. The idea is to perform an analysis of the number
of arithmetic instructions as an indicator of the expected performance when
using a 2-way SIMD implementation instead of a regular (non-SIMD) implementation for the classical ALU. We assume the 2-way SIMD implementation
works on pairs of 32-bit words in parallel and has access to a 2-way SIMD
32 × 32 → 64-bit multiplication instruction. A comparison to a regular implementation is not straightforward since the word-size can be different, the platform might be able to compute multiple instructions in parallel (on different

Downloaded from https://www.cambridge.org/core. Cambridge University Main, on 18 Dec 2019 at 22:01:47, subject to the Cambridge Core
terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575.003

2.4 Concurrent Computing of Montgomery Multiplication

31

ALUs) and the number of instructions per arithmetic operation might differ.
This is why we present a simplified comparison based on the number of arithmetic operations when computing Montgomery multiplication using a 32n-bit
modulus for a positive even integer n. We denote by muladdw (e, a, b, c) and
muladdaddw (e, a, b, c, d) the computation of e = ab + c and e = ab + c + d,
respectively, for 0 ≤ a, b, c, d < 2w (and thus 0 ≤ e < 22w ). These are basic
operations on a computer architecture which works on w-bit words. Some platforms have these operations as a single instruction (e.g., on some ARM architectures) while others implement this using separate multiplication and addition instructions (as on the x86 platform). Furthermore, let shortmulw (e, a, b)
denote e = ab mod 2w : this w × w → w-bit multiplication only computes the
least significant w bits of the result and is faster than computing a full doubleword product on most modern computer platforms.
Table 2.1 summarizes the expected performance of Algorithm 2.2 on page 16
and Algorithm 2.4 on page 29 in terms of arithmetic operations only. The shifting and masking operations are omitted for simplicity as well as the operations required to compute the final conditional subtraction or addition. When
just taking the muladd and muladdadd instructions into account it becomes
clear from Table 2.1 that the SIMD approach uses exactly half the number of
instructions compared to the 32-bit classical implementation and almost twice
as many operations compared to the classical 64-bit implementations. However,
the SIMD approach requires more operations to compute the value of q every
iteration and has various other overheads (e.g., inserting and extracting values
from the large vector registers). Hence, when assuming that all the characteristics of the SIMD and classical (non-SIMD) instructions are identical, which
is most likely not the case on most platforms, then we expect Algorithm 2.4
running on a 2-way 32-bit SIMD unit to outperform a classical 32-bit implementation using Algorithm 2.4 by at most a factor of two while being roughly
half as fast as a classical 64-bit implementation.

2.4.3 A Column-Wise SIMD Approach
A different approach, suitable for computing Montgomery multiplication on
architectures supporting 4-way SIMD instructions is outlined by the first chapter author and Kaihara in [67]. This approach is particularly efficient on the
Cell Broadband Engine (see a brief introduction to this architecture below),
since it was designed for usage on this platform, but can be used on any platform supporting the SIMD instructions used in this approach. This approach
differs from the one described in the previous section in that it uses the SIMD
instructions to compute the multi-precision arithmetic in parallel, so it works

Downloaded from https://www.cambridge.org/core. Cambridge University Main, on 18 Dec 2019 at 22:01:47, subject to the Cambridge Core
terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575.003

32

Montgomery Arithmetic from a Software Perspective

on a lower level, while the approach from Section 2.4.2 above computes the
arithmetic operations itself sequentially but divides the work into two steps,
which can be computed concurrently.
2.4.3.1 The Cell Broadband Engine
The Cell Broadband Engine (cf. the introductions given by Hofstee [187] and
Gschwind [173]), denoted by “Cell” and jointly developed by Sony, Toshiba,
and IBM, is a powerful heterogeneous multiprocessor which was released in
2005. The Cell contains a Power Processing Element, a dual-threaded Power
architecture-based 64-bit processor with access to a 128-bit AltiVec/VMX single instruction, multiple data (SIMD) unit (which is not considered in this chapter). Its main processing power, however, comes from eight Synergistic Processing Elements (SPEs). For an introduction to the circuit design see the work
by Takahashi et al. [323]. Each SPE consists of a Synergistic Processing Unit
(SPU), 256 KB of private memory called Local Store (LS), and a Memory Flow
Controller (MFC). To avoid the complexity of sending explicit direct memory
access requests to the MFC, all code and data must fit within the LS.
Each SPU runs independently from the others at 3.192 GHz and is equipped
with a large register file containing 128 registers of 128 bits each. Most SPU
instructions work on 128-bit operands denoted as quadwords. The instruction
set is partitioned into two sets: one set consists of (mainly) 4- and 8-way SIMD
arithmetic instructions on 32-bit and 16-bit operands respectively, while the
other set consists of instructions operating on the whole quadword (including
the load and store instructions) in a single instruction, single data (SISD) manner. The SPU is an asymmetric processor; each of these two sets of instructions
is executed in a separate pipeline, denoted by the even and odd pipeline for the
SIMD and SISD instructions, respectively. For instance, the {4, 8}-way SIMD
left-rotate instruction is an even instruction, while the instruction left-rotating
the full quadword is dispatched into the odd pipeline. When dependencies are
avoided, a single pair consisting of one odd and one even instruction can be
dispatched every clock cycle.
One of the first applications of the Cell processor was to serve as the brain of
Sony’s PlayStation 3 game console. Due to the widespread availability of this
game console and the fact that one could install and run one’s own software
this platform has been used to accelerate cryptographic operations [112, 111,
95, 272, 74, 63] as well as cryptanalytic algorithms [318, 72, 69].
2.4.3.2 Montgomery Multiplication on the Cell Broadband Engine
In this section we outline the approach presented by the first author of this
chapter and Kaihara tailored towards the instruction set of the Cell Broadband Engine. Most notably, the presented techniques rely on an efficient 4-way
Downloaded from https://www.cambridge.org/core. Cambridge University Main, on 18 Dec 2019 at 22:01:47, subject to the Cambridge Core
terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575.003

2.4 Concurrent Computing of Montgomery Multiplication

33

Figure
2.1 The 16-bit words xi of a 16(n + 1)-bit
 n+1 positive integer X =
n
16i
x
2
<
2N
are
stored
column-wise
using
s
=
128-bit vectors X j on
i
i=0
4
the SPE architecture.

SIMD instruction to multiply two 16-bit integers and add another 16-bit integer
to the 32-bit result, and a large register file. Therefore, the approach described
here uses a radix r = 216 to divide the large numbers into words that match
the input sizes of the 4-SIMD multipliers of the Cell. This can easily be
adapted to any other radix size for different platforms with different SIMD
instructions.

The idea is to represent integers X in a radix-216 system, i.e., X = ni=0 xi 216i
where 0 ≤ xi < 216 . However, in order to use the 4-way SIMD instructions of
this platform efficiently these 16-bit digits xi are stored in a 32-bit datatype.
The usage of this 32-bit space is to ensure that intermediate values of the
form ab + c do not produce any carries since when 0 ≤ a, b, c < 216 then
0 ≤ ab + c < 232 . Hence, given an odd 16n-bit modulus N, then a Montgomery
 
residue X, such that 0 ≤ X < 2N < 216(n+1) , is represented using s = n+1
4
vectors of 128 bits. Note that this representation uses roughly twice the number of bits when compared to storing it in a “normal” radix-representation. The
single additional 16-bit word is required because the intermediate accumulating
result of Montgomery multiplication can be almost as large as 2N (see page 20
in Section 2.2.4, above).
The 16-bit digits xi are placed column-wise in the four 32-bit datatypes of the
128-bit vectors. This representation is illustrated in Figure 2.1. The four 32-bit
parts of the j-th 128-bit vector X j are denoted by
X j = {X j [3], X j [2], X j [1], X j [0]}.
Each of the (n+1) 16-bit words xi of X is stored in the most significant 16
bits of Xi mod s [ si ]. The motivation for using this column-wise representation
Downloaded from https://www.cambridge.org/core. Cambridge University Main, on 18 Dec 2019 at 22:01:47, subject to the Cambridge Core
terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575.003

34

Montgomery Arithmetic from a Software Perspective

is that a division by 216 can be computed efficiently: simply move the digits
in vector X0 “one position to the right,” which in practice means a logical 32bit right shift, and relabeling of the indices such that X j becomes X j−1 , for
1 ≤ j < s − 1 and the modified vector X0 becomes the new Xs−1 . Algorithm 2.5
on page 35 computes Montgomery multiplication using such a 4-way columnwise SIMD representation.
In each iteration, the indices of the vectors that contain the accumulating partial product U change cyclically among the s registers. In Algorithm 2.5, each
16-bit word of the inputs X, Y and N and the output Z is stored in the upper part
(the most significant 16 bits) of each of the four 32-bit words in a 128-bit vector. The vector μ contains the replicated values of −N −1 mod 216 in the lower
16-bit positions of the four 32-bit words. In its most significant 16-bit positions, the temporary vector K stores the replicated values of yi , i.e., each of the
parsed coefficients of the multiplier Y corresponding to the ith iteration of the
main loop. The operation A ← muladd(B, c, D), which is a single instruction
on the SPE, represents the operation of multiplying the vector B (where data are
stored in the higher 16-bit positions of 32 bit words) by a vector with replicated
16-bit values of c across all higher positions of the 32-bit words. This product
is added to D (in 4-way SIMD manner) and the overall result is placed into A.
The temporary vector V stores the replicated values of u0 in the least significant 16-bit words. This u0 refers to the least significant 16-bit word of the

updated value of U, where U = nj=0 u j 216 j and is stored as s vectors of 128bit Ui mod s , Ui+1 mod s , . . . , Ui+n mod s (where i refers to the index of the main
loop). The vector Q is computed as an element-wise logical left shift by 16 bits
of the 4-SIMD product of vectors V and μ.
The propagation of the higher 16-bit carries of U(i+ j) mod s as stated in lines 10
and 18 of Algorithm 2.5 consists of extracting the higher 16-bit words of these
vectors and placing them into the lower 16-bit positions of temporary vectors.
These vectors are then added to the “next” vector U(i+ j+1) mod s correspondingly.
The operation is carried out for the vectors with indices j ∈ {0, 1, . . . , s − 2}.
For j = s − 1, the last index, the temporary vector that contains the words is
logically shifted 32 bits to the left and added to the vector Ui mod s . Similarly, the
carry propagation of the higher words of U(i+ j) mod s in line 22 of Algorithm 2.5
is performed with 16-bit word extraction and addition, but requires a sequential
parsing over the (n + 1) 16-bit words.
Hence, the approach outlined in Algorithm 2.5 computes Montgomery multiplication by computing the multi-word multiplications using SIMD instructions
and representing the integers using a column-wise approach (see Figure 2.1 on
page 33). This approach comes at the cost that a single 16n-bit integer is rep bits: requiring slightly over twice the amount of storage.
resented by 128 n+1
4

Downloaded from https://www.cambridge.org/core. Cambridge University Main, on 18 Dec 2019 at 22:01:47, subject to the Cambridge Core
terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575.003

2.4 Concurrent Computing of Montgomery Multiplication

35

Algorithm 2.5 Montgomery multiplication algorithm for the Cell
⎧
⎪
N represented by s 128-bit vectors: Ns−1 , . . . , N0 , such that
⎪
⎪
⎪
16(n−1)
⎪
≤ N < 216n , 2  N,
⎪
⎪ 2
⎨
X, Y each represented by s 128-bit vectors: Xs−1 , . . . , X0 , and
Input:
⎪
Ys−1 , . . . , Y0 , such that 0 ≤ X, Y < 2N,
⎪
⎪
⎪
⎪
μ
: a 128-bit vector containing (−N)−1 (mod 216 )
⎪
⎪
⎩
replicated in all 4 elements.

Z represented by s 128-bit vectors: Zs−1 , . . . , Z0 , such that
Output:
Z ≡ XY 2−16(n+1) mod N, 0 ≤ Z < 2N.
1:
2:
3:
4:
5:
6:
7:
8:
9:
10:
11:
12:
13:
14:
15:
16:
17:
18:
19:
20:
21:
22:
23:
24:
25:

for j = 0 to s − 1 do
Uj = 0
end for
for i = 0 to n do
/* lines 6–9 compute U = yi X + U */
K = {yi , yi , yi , yi }
for j = 0 to s − 1 do
U(i+ j) mod s = muladd(X j , K, U(i+ j) mod s )
end for
Carry propagation on U(i+ j) mod s for j = 0, . . . , s − 1 (see text)
/* lines 12–13 compute Q = μV mod 216 */
V = {u0 , u0 , u0 , u0 }
Q = shiftleft(mul(V , μ), 16)
/* lines 15–17 compute U = NQ + U */
for j = 0 to s − 1 do
U(i+ j) mod s = muladd(N j , Q, U(i+ j) mod s )
end for
Carry propagation on U(i+ j) mod s for j = 0, . . . , s − 1 (see text)
/* line 20 computes the division by 216 */
Ui mod s = vshiftright(Ui mod s , 32)
end for
Carry propagation on Ui mod s for i = n + 1, . . . , 2n + 1 (see text)
for j = 0 to s − 1 do
Z j = U(n+ j+1) mod s
end for

Note, however, that an implementation of this technique outperforms the native
multi-precision big-number library on the Cell processor by a factor of about
2.5, as summarized in [67].

Downloaded from https://www.cambridge.org/core. Cambridge University Main, on 18 Dec 2019 at 22:01:47, subject to the Cambridge Core
terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575.003

36

Montgomery Arithmetic from a Software Perspective

2.4.4 Montgomery Multiplication Using the Residue Number
System Representation
The residue number system (RNS) as introduced by Garner [156] and Merrill [244] is an approach, based on the Chinese remainder theorem, to represent an integer as a number of residues modulo smaller (coprime) integers.
The advantage of RNS is that additions, subtractions, and multiplication can
be performed independently and concurrently on these smaller residues. Given
an RNS basis βn = {r1 , r2 , . . . , rn }, where gcd(ri , r j ) = 1 for i = j, the RNS

modulus is defined as R = ni=1 ri . Given an integer x ∈ Z/RZ and the RNS
basis βn , this integer x is represented as an n-tuple x = (x1 , x2 , . . . , xn ), where
xi = x mod ri for 1 ≤ i ≤ n. In order to convert an n-tuple back to its integer
value one can apply the Chinese remainder theorem (CRT)
 
 
 n

R −1
R
xi
mod ri
mod R.
(2.7)
x=
r
r
i
i
i=1
Modular multiplication using Montgomery multiplication in the RNS setting
has been studied, for instance, by Posch and Posch in [290] and by Bajard,
Didier, and Kornerup in [23] and subsequent work. In this section we outline
how to achieve this. First note for the application in which we are interested, we
can not use the modulus N as the RNS modulus since N is either prime (in the
setting of elliptic curve cryptography) or a product of two large primes (when
using RSA). When computing the Montgomery reduction one has to perform
arithmeti