Main
Topics in Computational Number Theory Inspired by Peter L. Montgomery
Topics in Computational Number Theory Inspired by Peter L. Montgomery
Joppe W. Bos, Arjen K. Lenstra
Peter L. Montgomery has made significant contributions to computational number theory, introducing many basic tools such as Montgomery multiplication, Montgomery simultaneous inversion, Montgomery curves, and the Montgomery ladder. This book features stateoftheart research in computational number theory related to Montgomery's work and its impact on computational efficiency and cryptography. Topics cover a wide range of topics such as Montgomery multiplication for both hardware and software implementations; Montgomery curves and twisted Edwards curves as proposed in the latest standards for elliptic curve cryptography; and cryptographic pairings. This book provides a comprehensive overview of integer factorization techniques, including dedicated chapters on polynomial selection, the block Lanczos method, and the FFT extension for algebraicgroup factorization algorithms. Graduate students and researchers in applied number theory and cryptography will benefit from this survey of Montgomery's work.
Categories:
Mathematics\\Computer Algebra
Year:
2017
Publisher:
Cambridge University Press
Language:
english
Pages:
281
ISBN 13:
9781316271575
File:
PDF, 4.19 MB
Download (pdf, 4.19 MB)
Preview
 Open in Browser
 Checking other formats...
 Convert to EPUB
 Convert to FB2
 Convert to MOBI
 Convert to TXT
 Convert to RTF
 Converted file can differ from the original. If possible, download the file in its original format.
 Please login to your account first

Need help? Please read our short guide how to send a book to Kindle.
The file will be sent to your email address. It may take up to 15 minutes before you receive it.
The file will be sent to your Kindle account. It may takes up to 15 minutes before you received it.
Please note you need to add our NEW email km@bookmail.org to approved email addresses. Read more.
Please note you need to add our NEW email km@bookmail.org to approved email addresses. Read more.
You can write a book review and share your experiences. Other readers will always be interested in your opinion of the books you've read. Whether you've loved the book or not, if you give your honest and detailed thoughts then people will find new books that are right for them.
1

2

Topics in Computational Number Theory Inspired by Peter L. Montgomery Peter L. Montgomery has made significant contributions to computational number theory, introducing many basic tools such as Montgomery multiplication, Montgomery simultaneous inversion, Montgomery curves, and the Montgomery ladder. This book features stateoftheart research in computational number theory related to Montgomery’s work and its impact on computational efficiency and cryptography. It covers a wide range of topics such as Montgomery multiplication for both hardware and software implementations; Montgomery curves and twisted Edwards curves as proposed in the latest standards for elliptic curve cryptography; and cryptographic pairings. This book provides a comprehensive overview of integer factorization techniques, including dedicated chapters on polynomial selection, the block Lanczos method, and the FFT extension for algebraicgroup factorization algorithms. Graduate students and researchers in applied number theory and cryptography will benefit from this survey of Montgomery’s work. Joppe W. Bos is a cryptographic researcher at the Innovation Center for Cryptography & Security at NXP Semiconductors. He also currently serves as the Secretary of the International Association for Cryptologic Research (IACR). His research focuses on computational number theory and highperformance arithmetic as used in publickey cryptography. Arjen K. Lenstra is Professor of Computer Science at École Polytechnique Fédérale de Lausanne. His research focuses on cryptography and computational number theory, especially in areas such as integer factorization. He was closely involved in the development of the number field sieve method for integer factorization as well as several other cryptologic results. He is the recipient of the Excellence in the Field of Mathematics RSA Conference 2008 Award and a Fellow of the International Association for Cryptologic Research (IACR). Downloaded from https://www.cambridge.org/core. University of Texas Libraries, on 08 Jan 2020 at 20:33:58, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575 Downloaded from https://www.cambridge.org/core. University of Texas Libraries, on 08 Jan 2020 at 20:33:58, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575 Topics in Computational Number Theory Inspired by Peter L. Montgomery Edited by J O P P E W. B O S NXP Semiconductors, Leuven, Belgium ARJEN K. LENSTRA EPFL, Lausanne, Switzerland Downloaded from https://www.cambridge.org/core. University of Texas Libraries, on 08 Jan 2020 at 20:33:58, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575 University Printing House, Cambridge CB2 8BS, United Kingdom One Liberty Plaza, 20th Floor, New York, NY 10006, USA 477 Williamstown Road, Port Melbourne, VIC 3207, Australia 4843/24, 2nd Floor, Ansari Road, Daryaganj, Delhi  110002, India 79 Anson Road, #0604/06, Singapore 079906 Cambridge University Press is part of the University of Cambridge. It furthers the University’s mission by disseminating knowledge in the pursuit of education, learning, and research at the highest international levels of excellence. www.cambridge.org Information on this title: www.cambridge.org/9781107109353 DOI: 10.1017/9781316271575 © Joppe W. Bos and Arjen K. Lenstra 2017 This publication is in copyright. Subject to statutory exception and to the provisions of relevant collective licensing agreements, no reproduction of any part may take place without the written permission of Cambridge University Press. First published 2017 Printed in the United States of America by Sheridan Books, Inc. A catalogue record for this publication is available from the British Library Library of Congress CataloginginPublication data Names: Bos, Joppe W., editor.  Lenstra, A. K. (Arjen K.), 1956– editor. Title: Topics in computational number theory inspired by Peter L. Montgomery / edited by Joppe W. Bos, NXP Semiconductors, Belgium; Arjen K. Lenstra, EPFL, Lausanne, Switzerland. Description: Cambridge : Cambridge University Press, 2017.  Series: London Mathematical Society lecture note series  Includes bibliographical references and index. Identifiers: LCCN 2017023049  ISBN 9781107109353 (pbk. : alk. paper) Subjects: LCSH: Number theory.  Cryptography – Mathematics.  Montgomery, Peter L., 1947– Classification: LCC QA241 .T657 2017  DDC 512.7 – dc23 LC record available at https://lccn.loc.gov/2017023049 ISBN 9781107109353 Paperback Cambridge University Press has no responsibility for the persistence or accuracy of URLs for external or thirdparty Internet Web sites referred to in this publication and does not guarantee that any content on such Web sites is, or will remain, accurate or appropriate. Downloaded from https://www.cambridge.org/core. University of Texas Libraries, on 08 Jan 2020 at 20:33:58, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575 Contents List of Contributors Preface page xi xiii 1 1 1 5 8 1 1.1 1.2 1.3 1.4 Introduction Outline Biographical Sketch Overview Simultaneous Inversion 2 2.1 2.2 Montgomery Arithmetic from a Software Perspective Introduction Montgomery Multiplication 2.2.1 Interleaved Montgomery Multiplication 2.2.2 Using Montgomery Arithmetic in Practice 2.2.3 Computing the Montgomery Constants μ and R2 2.2.4 On the Final Conditional Subtraction 2.2.5 Montgomery Multiplication in F2k Using Primes of a Special Form 2.3.1 Faster Modular Reduction with Primes of a Special Form 2.3.2 Faster Montgomery Reduction with Primes of a Special Form Concurrent Computing of Montgomery Multiplication 2.4.1 Related Work on Concurrent Computing of Montgomery Multiplication 2.4.2 Montgomery Multiplication Using SIMD Extensions 2.4.3 A ColumnWise SIMD Approach 2.3 2.4 10 10 12 15 16 18 19 21 22 23 24 26 27 27 31 v Downloaded from https://www.cambridge.org/core. City, University of London, on 09 Jan 2020 at 19:53:13, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575 vi Contents 2.4.4 3 Montgomery Multiplication Using the Residue Number System Representation 36 3.15 3.16 3.17 Hardware Aspects of Montgomery Modular Multiplication Introduction and Summary Historical Remarks Montgomery’s Novel Modular Multiplication Algorithm Standard Acceleration Techniques Shifting the Modulus N 3.5.1 The Classical Algorithm 3.5.2 Montgomery Interleaving Multiplication Steps with Modular Reduction Accepting Inaccuracy in Quotient Digits 3.7.1 Traditional 3.7.2 Bounding the Partial Product 3.7.3 Montgomery 3.7.4 Summary Using Redundant Representations 3.8.1 Traditional 3.8.2 Montgomery Changing the Size of the Hardware Multiplier Shifting an Operand 3.10.1 Traditional 3.10.2 Montgomery Precomputing Multiples of B and N Propagating Carries and CarrySave Inputs Scaling the Modulus Systolic Arrays 3.14.1 A Systolic Array for A×B 3.14.2 Scalability 3.14.3 A Linear Systolic Array 3.14.4 A Systolic Array for Modular Multiplication SideChannel Concerns and Solutions Logic Gate Technology Conclusion 40 40 42 42 43 44 44 45 46 48 48 50 52 52 53 54 54 55 57 57 60 61 62 65 67 68 70 72 73 76 80 81 4 4.1 4.2 Montgomery Curves and the Montgomery Ladder Introduction Fast Scalar Multiplication on the Clock 82 82 83 3.1 3.2 3.3 3.4 3.5 3.6 3.7 3.8 3.9 3.10 3.11 3.12 3.13 3.14 Downloaded from https://www.cambridge.org/core. City, University of London, on 09 Jan 2020 at 19:53:13, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575 Contents 4.3 4.4 4.5 4.6 4.7 4.8 5 5.1 5.2 vii 4.2.1 The Lucas Ladder 4.2.2 Differential Addition Chains Montgomery Curves 4.3.1 Montgomery Curves as Weierstrass Curves 4.3.2 The Group Law for Weierstrass Curves 4.3.3 Other Views of the Group Law 4.3.4 Edwards Curves and Their Group Law 4.3.5 Montgomery Curves as Edwards Curves 4.3.6 EllipticCurve Cryptography (ECC) 4.3.7 Examples of Noteworthy Montgomery Curves Doubling Formulas without y 4.4.1 Doubling: The Weierstrass View 4.4.2 Optimized Doublings 4.4.3 A Word of Warning: Projective Coordinates 4.4.4 Completeness of Generic Doubling Formulas 4.4.5 Doubling: The Edwards View DifferentialAddition Formulas 4.5.1 Differential Addition: The Weierstrass View 4.5.2 Optimized Differential Addition 4.5.3 QuasiCompleteness 4.5.4 Differential Addition: The Edwards View The Montgomery Ladder 4.6.1 The Montgomery Ladder Step 4.6.2 ConstantTime Ladders 4.6.3 Completeness of the Ladder A TwoDimensional Ladder 4.7.1 Introduction to the TwoDimensional Ladder 4.7.2 Recursive Definition of the TwoDimensional Ladder 4.7.3 The OddOdd Pair in Each Line: First Addition 4.7.4 The EvenEven Pair in Each Line: Doubling 4.7.5 The Other Pair in Each Line: Second Addition Larger Differences 4.8.1 Examples of LargeDifference Chains 4.8.2 CFRC, PRAC, etc. 4.8.3 Allowing d to Vary 85 85 87 87 88 89 90 92 93 94 95 95 96 97 97 98 99 99 101 101 103 104 104 105 106 107 108 109 110 110 111 111 112 114 114 General Purpose Integer Factoring Introduction General Purpose Factoring 116 116 117 Downloaded from https://www.cambridge.org/core. City, University of London, on 09 Jan 2020 at 19:53:13, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575 viii Contents 5.6 5.2.1 TwoStep Approach 5.2.2 Smoothness and Lnotation 5.2.3 Generic Analysis 5.2.4 Smoothness Testing 5.2.5 Finding Dependencies 5.2.6 Filtering 5.2.7 Overall Effort Presieving General Purpose Factoring 5.3.1 Dixon’s Random Squares Method 5.3.2 Continued Fraction Method Linear and Quadratic Sieve 5.4.1 Linear Sieve 5.4.2 Quadratic Sieving: Plain 5.4.3 Quadratic Sieving: Fancy 5.4.4 Multiple Polynomial Quadratic Sieve Number Field Sieve 5.5.1 Earlier Methods to Compute Discrete Logarithms 5.5.2 Special Number Field Sieve 5.5.3 General Number Field Sieve 5.5.4 Coppersmith’s Modifications Provable Methods 117 119 120 121 123 123 126 126 126 127 129 129 132 133 134 137 139 145 152 158 160 6 6.1 6.2 6.3 6.4 6.5 6.6 6.7 6.8 Polynomial Selection for the Number Field Sieve The Problem Early Methods General Remarks A Lattice Based Method Skewness Base m Method and Skewness Root Sieve Later Developments 161 161 161 164 166 168 170 171 173 7 7.1 7.2 7.3 7.4 7.5 7.6 7.7 7.8 7.9 The Block Lanczos Algorithm Linear Systems for Integer Factoring The Standard Lanczos Algorithm The Case of Characteristic Two Orthogonalizing a Sequence of Subspaces Construction of the Next Iterate Simplifying the Recurrence Equation Termination Implementation in Parallel Recent Developments 175 175 176 179 180 181 182 184 186 187 5.3 5.4 5.5 Downloaded from https://www.cambridge.org/core. City, University of London, on 09 Jan 2020 at 19:53:13, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575 Contents 8 8.1 8.2 8.3 9 9.1 9.2 9.3 9.4 9.5 FFT Extension for AlgebraicGroup Factorization Algorithms Introduction FFT Extension for the Elliptic Curve Method 8.2.1 The Product Tree Algorithm 8.2.2 The POLYEVAL Algorithm 8.2.3 The POLYGCD Algorithm 8.2.4 Choice of Points of Evaluation 8.2.5 A Numerical Example FFT Extension for the p − 1 and p + 1 Methods 8.3.1 Constructing F (X ) by Scaling and Multiplying 8.3.2 Evaluation of a Polynomial Along a Geometric Progression ix 189 189 191 192 192 195 197 199 199 201 202 Cryptographic Pairings Preliminaries 9.1.1 Elliptic Curves 9.1.2 Pairings 9.1.3 PairingFriendly Elliptic Curves Finite Field Arithmetic for Pairings 9.2.1 Montgomery Multiplication 9.2.2 Multiplication in Extension Fields 9.2.3 Finite Field Inversions 9.2.4 Simultaneous Inversions Affine Coordinates for Pairing Computation 9.3.1 Costs for Doubling and Addition Steps 9.3.2 Working over Extension Fields 9.3.3 Simultaneous Inversions in Pairing Computation The DoubleAdd Operation and Parabolas 9.4.1 Description of the Algorithm 9.4.2 Application to Scalar Multiplication 9.4.3 Application to Pairings Squared Pairings 9.5.1 The Squared Weil Pairing 9.5.2 The Squared Tate Pairing 206 207 208 209 213 213 214 215 215 218 219 219 223 225 227 228 229 229 231 232 233 Bibliography Subject Index 235 261 Downloaded from https://www.cambridge.org/core. City, University of London, on 09 Jan 2020 at 19:53:13, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575 Downloaded from https://www.cambridge.org/core. City, University of London, on 09 Jan 2020 at 19:53:13, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575 Contributors Joppe W. Bos, NXP Semiconductors, Leuven, Belgium Arjen K. Lenstra, EPFL, Lausanne, Switzerland Herman te Riele, CWI, Amsterdam, Netherlands Daniel Shumow, Microsoft Research, Redmond, USA Peter L. Montgomery, Self Colin D. Walter, Royal Holloway, University of London, Egham, United Kingdom Daniel J. Bernstein, University of Illinois at Chicago, Chicago, USA and Technische Universiteit Eindhoven, Eindhoven, The Netherlands Tanja Lange, Technische Universiteit Eindhoven, Eindhoven, The Netherlands Thorsten Kleinjung, University Leipzig, Leipzig, Germany and EPFL, Lausanne, Switzerland Emmanuel Thomé, Inria, Nancy, France Richard P. Brent, Australian National University, Canberra, Australia Alexander Kruppa, Technische Universität München, München, Germany Paul Zimmermann, Inria/LORIA, Nancy, France Kristin Lauter, Microsoft Research, Redmond, USA Michael Naehrig, Microsoft Research, Redmond, USA xi Downloaded from https://www.cambridge.org/core. City, University of London, on 08 Jan 2020 at 20:34:16, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575 Downloaded from https://www.cambridge.org/core. City, University of London, on 08 Jan 2020 at 20:34:16, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575 Preface This book was written in honor of Peter L. Montgomery and his inspirational contributions to computational number theory. The editors would like to extend their sincerest thanks to all authors for their enthusiastic response to our invitation to contribute, and to Nicole Verna for the cover design. xiii Downloaded from https://www.cambridge.org/core. Access paid by the UCSB Libraries, on 13 Sep 2018 at 19:36:36, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575.001 Downloaded from https://www.cambridge.org/core. Access paid by the UCSB Libraries, on 13 Sep 2018 at 19:36:36, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575.001 1 Introduction Joppe W. Bos, Arjen K. Lenstra, Herman te Riele, and Daniel Shumow 1.1 Outline This introductory chapter collects some personal background on Peter L. Montgomery, both provided by himself and based on stories from his colleagues and collaborators. This is followed by a brief introduction to the other chapters, and by a description of Peter’s simultaneous inversion method. 1.2 Biographical Sketch Peter was born in San Francisco, California, on September 25, 1947. He described his activities since his 1965 graduation from San Rafael High School (SRHS) in a letter for a 1995 highschool reunion; this letter is copied in its entirety in Figure 1.1, and further commented on throughout this chapter. At Berkeley, Peter’s undergraduate advisor was Derrick H. Lehmer, an excellent match given Peter’s research interests since high school. Ron Graham, one of Peter’s coauthors on his first three papers (all on Ramsey theory and written shortly after his time in Berkeley), recalls [164] that Peter had the “breakthrough idea” for the first of these papers, with coauthor Bruce Rothschild remembering Peter “as an extremely clever guy who came up with surprising ideas” [295]. The papers in question ([144], [145] and [146]) were further coauthored by Paul Erdös, Joel Spencer, and Ernst Straus. After this auspicious start, Peter’s next paper describes an algorithm that he developed at SDC to evaluate arbitrary Fortran boolean expressions in such a way that they result in a small number of instructions when compiled on a Control Data 7600 computer [250]. The paper is illustrative of the type of tinkering in which Peter excels. About twenty years after his time in Berkeley, and by the time that Peter was already “internationally famous” (cf. Figure 1.1), David G. Cantor supervised Peter’s PhD dissertation An FFT extension of the elliptic curve method 1 Downloaded from https://www.cambridge.org/core. Teachers College Library  Columbia University, on 02 Nov 2017 at 06:38:23, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575.002 2 Introduction Figure 1.1 Peter’s letter for his 1995 highschool reunion. of factorization [254]. Milos Ercegovac, who served on the PhD committee, related [143] that Cantor referred to Peter as “a genius UCLA never saw before,” how Peter was brilliant in solving engineering problems related to computer arithmetic, and that, as a result of his incisive reviews, authors on several occasions asked if they could include this “marvelous” reviewer as a coauthor because Peter’s comments invariably provided better solutions than Downloaded from https://www.cambridge.org/core. Teachers College Library  Columbia University, on 02 Nov 2017 at 06:38:23, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575.002 1.2 Biographical Sketch (a) At age 15 3 (b) In 2009 Figure 1.2 Peter L. Montgomery (from Peter’s collection and Wikipedia). those given in the submissions. Another member of the committee, Murray Schacher, recalled [299] how Peter “regularly swamped the system” in what may have been his first venture into large scale computing: verifying reducibility of X 8k + X m + 1 ∈ F2 [X] for large k and 0 < m < 8k. For this project Peter taught himself the Cantor–Zassenhaus method because the nonsparse 8k × 8k matrix required by the usual Berlekamp algorithm surpassed the system’s capacity [34, 87]. Peter did not accept the offer for a “research position in a New Jersey community lacking sidewalks” (cf. Figure 1.1). The offer was made after Peter’s successful visits at the Bellcore research facility in Morristown, NJ, during the summers of 1994 and 1995. During his stays Peter worked with the second author’s summer internship students Scott Contini and Phong Nguyen on implementing Peter’s new methods that were useful for the number field sieve integer factorization algorithm (cf. Section 1.3 below): block Lanczos with Scott in 1994 and the next year the final square root method with Phong (who for his project had to learn about integer lattices and lattice basis reduction, and did so only very reluctantly – after a scolding by Ramarathnam “Venkie” Venkatesan). A memorable event was a lab outing where Peter beat Phong at bowling and afterwards bought ice cream for the whole party. Downloaded from https://www.cambridge.org/core. Teachers College Library  Columbia University, on 02 Nov 2017 at 06:38:23, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575.002 4 Introduction In 1998 Peter accepted a position with the Security and Cryptography group at Microsoft Research (MSR) in Redmond, WA. He retired in 2014. His work at MSR was largely focused on pure research (cf. [227], [99], [259], [137], [138], and [73]). However, Peter began implementing libraries for asymmetric encryption for use in products as well. Peter’s first contribution was to write Diffie–Hellman key exchange and the Digital Signature Algorithm in the static bignum library. This was included in the default Microsoft Windows CAPI 1.0 library dssenh.dll, that shipped in Windows 2000. A later version of this library, msbignum, also included RSA as well as elliptic curve cryptography. In 2006, with Windows Vista, msbignum became the basis of the API for the Cryptographic Next Generation and underlay in the following decade all cryptography running on Windows at Microsoft. Peter’s role may have been unique at MSR at that time: over the course of the organization’s existence there have been very few teams at MSR, let alone individual researchers, delivering entire functional software libraries. During his time at MSR, Peter continued collaborations with researchers worldwide and was a regular longterm guest at various universities and research institutes. The first three authors of this chapter have fond memories of Peter’s visits (and are not the only ones who have lugged around Peter’s considerable suitcase when regular cabs were not up to the task [164]) – as Milos Ercegovac put it: a modest man with a sense of humor, a gentle chuckle when people make silly technical comments [143]. Numerous papers were the result of his visits: several record factorizations (140 digits [91], 512 bits [92], 768 bits [209, 211], and Mersenne numbers [71]); security estimates [68]; elliptic curve discrete logarithm cryptanalysis using Playstation 3 game consoles [70, 69]; improved stage two for Pollard’s p ± 1 method [261]; better elliptic curve factoring parameter selection [26]; and results on the period of Bell numbers [262]. While discussing his work at SDC in the early software industry Peter liked to proudly show a copy of The System Builders: The Story of SDC [32] with his name displayed on the front cover, about an inch underneath the title (cf. Figure 1.3). Peter never showed a similar pride in his more celebrated algorithmic and mathematical results – possibly because his work as a software engineer greatly motivated his mathematical research and he invented new approaches just to make his implementations more efficient. This is illustrated by Peter’s explanation how he got the inspiration for his paper “Modular multiplication without trial division” [251], arguably one of his most widely used results: when implementing textbook multiprecision multiplication in assembler on a PDP series computer, he noticed there were unused registers and Downloaded from https://www.cambridge.org/core. Teachers College Library  Columbia University, on 02 Nov 2017 at 06:38:23, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575.002 1.3 Overview 5 Figure 1.3 Front cover of [32] (© System Development Corporation, 1981). wanted to find a way to exploit them. He managed to do so by interleaving the multiplication with the modular reduction. When Peter began working, at SDC, he programmed using punch cards on timeshared main frame computers. By the time he had retired from MSR, he had implemented and optimized arithmetic code that was running on the mobile processors in smartphones. Peter worked as a programmer from the earliest stages of software engineering when computers needed to be held in their own buildings until the era of ubiquitous computing where people carry computing devices in their pockets. Both his algorithmic and mathematical contributions and his work as a programmer were instrumental to these massive advances that spanned his career – a career of more than forty years, impressive by the standards of the software industry. 1.3 Overview The subsequent chapters of this book are on active research areas that were initiated by Peter or to which he has made substantial contributions. This section contains an overview of the subjects that are treated. As mentioned above, Peter’s 1985 method for modular multiplication without trial division [251] was invented almost serendipitously when Peter tried to put otherwise idle registers to good use. Because it allows simple implementation both in software and in hardware, Montgomery multiplication (as the method is now commonly referred to) has found broad application. Its software and hardware aspects are covered in Chapters 2 and 3, respectively. Peter’s contributions as described and expanded upon in Chapters 4 through 8 were all inspired by integer factorization, his main research interest since high school. In [252, section 10.3.1] Peter describes how his original implementation Downloaded from https://www.cambridge.org/core. Teachers College Library  Columbia University, on 02 Nov 2017 at 06:38:23, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575.002 6 Introduction of the elliptic curve method of integer factorization [239] uses the Weierstrass equation with affine coordinates, and how the modular inversion required for the point addition can be made faster if several curves are run at once (cf. Section 1.4 on page 8). He then writes “The author later discovered an alternative parametrization that requires no inversions” and continues with the description of what is now known as Montgomery curves and the Montgomery ladder. The former is an elliptic curve parametrization that allows fast computation of the sum of two points if their difference is known and that does not require modular inversion. The latter refers to the addition chain that can then be used to compute any scalar multiple of a point. Initially these results were relevant solely to those implementing the elliptic curve integer factorization method, but the later growing importance of elliptic curve cryptography increased the interest in Peter’s and yet other curve parametrizations and addition chains. Chapter 4 describes the developments in this field that have taken place since the publication of [252]. In parallel with his work on specialpurpose integer factoring methods in [252], Peter also heavily influenced the two most important methods of the more generic integer factorization algorithms that are described in Chapter 5: the quadratic sieve and the number field sieve, two methods that were initially plagued by severe performance problems [282, 234]. Although Peter was not the first to propose a practical version of the quadratic sieve, his multiple polynomial quadratic sieve (as published by Robert Silverman in [315] and described in Chapter 5), represented the state of the art in integer factorization from the mid 1980s until the early 1990s. It was later enhanced by selfinitialization [287], but by that time the number field sieve had taken over as the recordbreaking factoring method, a position it holds to the present day. Peter played a prominent role in that development, contributing important ideas to all steps of the number field sieve: polynomial selection, (early) sieving, building a matrix, processing the matrix, and the final square root calculation. A brief summary follows, with Chapter 5 describing in more detail how the steps fit together. In the first step of the number field sieve two or more polynomials must be found that have a root in common modulo the number to be factored. Though theoretically such polynomials are easily found, small modifications may trigger orders of magnitude reductions of the resulting factoring effort. Finding good polynomials has become a subject in its own right since Peter’s early contributions (many of which were first published in Brian Murphy’s 1999 PhD dissertation [266]). Chapter 6 is entirely devoted to the current state of the art of polynomial selection for the number field sieve. Downloaded from https://www.cambridge.org/core. Teachers College Library  Columbia University, on 02 Nov 2017 at 06:38:23, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575.002 1.3 Overview 7 The number field sieve (and many other sievingbased integer factorization methods) relies on the ability to quickly recognize which values in an arithmetic progression have many small factors. Chapter 5 briefly mentions some of Peter’s many tricks which are now part of the standard repertoire to speed up this socalled sieving step. His improvements to the earliest line sieving software for the number field sieve by the second author were used for numerous integer factorizations in the 1990s [78, 235]. For most composites, however, line sieving has since then been surpassed by lattice sieving [281]. The sieving step results in a large, very sparse matrix. In the matrix step dependencies among the columns of this matrix must be found. Given its sparsity, it is advantageous to first transform the matrix into an equivalent matrix that is smaller but less sparse. Originally, when regular Gaussian elimination was still used for the matrix step, a set of ad hoc methods referred to as structured Gaussian elimination [226, 286] was used to find a transformed matrix with a smallest possible number of rows, but without paying attention to its density. The more sophisticated tools that became available for the matrix step in the mid 1990s required an adaption of the matrix transformation method that would minimize the product of the number of rows of the matrix and its number of nonzero entries. These days this is referred to as filtering. The details of the various filtering steps proposed by Peter (and based on the earlier work in [226, 286]) can be found in [89, 90] and are sketched in Section 5.2.6 on page 123. The above “more sophisticated tools” are the block Wiedemann and block Lanczos methods, based on the classical Wiedemann and Lanczos methods. They were almost simultaneously and independently developed by Don Coppersmith and Peter, respectively, and have comparable performance [106, 257]. Though both methods allow efficient distributed implementation, for block Lanczos all processors must be able to communicate quickly (i.e., a single cluster with a fast interconnection network), whereas block Wiedemann can be made to run efficiently on a modest number of disjoint clusters. Almost since its inception in the mid 1990s, block Lanczos became the method of choice for all academic factoring projects. Around 2005 it became more convenient for these applications to use a number of separate clusters for the matrix step, which implied a switch to block Wiedemann [209, 211]. Chapter 7 describes the block Lanczos method in detail. For the number field sieve each column dependency produced by the matrix step requires a square root calculation in the number fields involved before a factorization of the targeted composite may be derived. In [255] Peter presented an iterative approach that cleverly exploits the special form of the squared algebraic numbers, thereby replacing Jean–Marc Couveignes’ much slower and Downloaded from https://www.cambridge.org/core. Teachers College Library  Columbia University, on 02 Nov 2017 at 06:38:23, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575.002 8 Introduction more restrictive method [113]. Peter’s method is used, essentially unmodified, until the present day, and sketched in Section 5.5.3 on page 157. Returning to specialpurpose integer factorization, in the last paragraph of his p − 1 paper [278] John Pollard mentions “we would theoretically use the fast Fourier transform on Step 2,” concluding his paper with “I cannot say at what point this becomes a practical proposition.” In [252, section 4] Peter refers to Pollard’s remark, and further pursues the idea for Pollard’s p − 1 method, first with Silverman in [263] and later with Alexander Kruppa in [261]. For the elliptic curve factoring method it is the subject of Peter’s PhD dissertation [254]. Peter’s work on this subject is presented in Chapter 8. In Chapter 9 Peter’s contributions to cryptographic pairings are outlined. Given Peter’s simultaneous inversion method (described below in Section 1.4) and his fast modular inversion it became attractive, contrary to conventional wisdom, to replace projective coordinates by affine ones for applications involving multiple or parallelized pairings, products of pairings, or pairings at high security levels [137, 138, 227]. Moreover, Peter coauthored a paper introducing algorithms to compute the squared pairing, which has the advantage, compared to Victor Miller’s algorithm, that its computation cannot fail [248]. 1.4 Simultaneous Inversion For completeness, this section describes Peter Montgomery’s simultaneous inversion method, a popular method that did not lead to any followup work and which is therefore not treated in the later chapters. At a computational number theory conference held in August 1985 at Humboldt State University in Arcata, CA, Peter gave a presentation about his experience using the elliptic curve integer factorization method ([239], informally announced in February 1985). He proposed to maximize the overall throughput of the method as opposed to minimizing the latency per curve, thus exploiting the method’s inherent parallelism. The idea is simple: at a modest overhead the cost of the modular inversion required for each elliptic curve group operation in the Weierstrass model can be shared among any number of simultaneously processed curves, assuming all curves target the same number to be factored. It has become known as simultaneous inversion, and was later briefly mentioned in [252, section 10.3.1]: When working over several curves, the program used a scheme similar to (1.2) to do all the inversions at once, since (1/x) = y(1/xy) and (1/y) = x(1/xy). This reduces the asymptotic cost of an inversion to that of 3 multiplications. Downloaded from https://www.cambridge.org/core. Teachers College Library  Columbia University, on 02 Nov 2017 at 06:38:23, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575.002 1.4 Simultaneous Inversion 9 The “scheme similar to (1.2)” suggests that Peter took his inspiration from John Pollard’s observation that the cost of a number of greatest common divisor calculations with the same tobefactored modulus can be reduced to that same number of multiplications modulo the same modulus, plus a single gcd calculation. In the inversion context, there may have been a historical precedent for Peter’s trick, with Richard Schroeppel recalling he may have seen a similar method to avoid floating divides in pre1970s Fortran programs. But even if it is not entirely original, Peter’s insight cleverly adapts an old idea to a regime where numerics are exact [303]. These days, simultaneous inversion is, for instance, used in record computations of elliptic curve discrete logarithms (e.g., [69] and the record attempt [20]; it was not used in [344]). Let z1 , z2 , . . . , zk ∈ Z/nZ for some modulus n be the elements that must be inverted; in the rare case that at least one of the inverses does not exist, simultaneous inversion fails, and an appropriate alternative approach is used. Working in Z/nZ and with w0 = 1, first for i = 1, 2, . . . , k in succession calculate and store wi = wi−1 zi . Next, calculate w = wk−1 . Finally, for i = k, k − 1, . . . , 1 as wwi−1 and next replace w by wzi (so that in succession first compute z−1 i −1 ). The overall cost is 3(k − 1) multiplications, a single inversion, and w = wi−1 storage for k temporary values, all in Z/nZ. Given that for relevant modulus sizes and software implementation of arithmetic in Z/nZ inversion is at least five times costlier than multiplication, the resulting speed up can be significant. One should be aware, though, that in hardware the difference can be made smaller [200]. Downloaded from https://www.cambridge.org/core. Teachers College Library  Columbia University, on 02 Nov 2017 at 06:38:23, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575.002 2 Montgomery Arithmetic from a Software Perspective Joppe W. Bos and Peter L. Montgomery We propose a representation of residue classes so as to speed modular multiplication without affecting the modular addition and subtraction algorithms. – Peter L. Montgomery [251] Abstract This chapter describes Peter L. Montgomery’s modular multiplication method and the various improvements to reduce the latency for software implementations on devices that have access to many computational units. 2.1 Introduction Modular multiplication is a fundamental arithmetical operation, for instance when computing in a finite field or a finite ring, and forms one of the basic operations underlying almost all currently deployed publickey cryptographic protocols. The efficient computation of modular multiplication is an important research area since optimizations, even ones resulting in a constantfactor speedup, have a direct impact on our daytoday information security life. In this chapter we review the computational aspects of Peter L. Montgomery’s modular multiplication method [251] (known as Montgomery multiplication) from a software perspective (while the next chapter highlights the hardware perspective). Throughout this chapter we use the terms digit and word interchangeably. To be more precise, we typically assume that a bbit nonnegative multiprecision integer X is represented by an array of n = b/r computer words as X= n−1 xi ri i=0 10 Downloaded from https://www.cambridge.org/core. Cambridge University Main, on 18 Dec 2019 at 22:01:47, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575.003 2.1 Introduction 11 (the socalled radixr representation), where r = 2w for the word size w of the target computer architecture and 0 ≤ xi < r. Here xi is the ith word of the integer X. Let N be the positive modulus consisting of n digits while the input values to the modular multiplication method (A and B) are nonnegative integers smaller than N and consist of up to n digits. When computing modular multiplication C = AB mod N, the definitional approach is first to compute the product P = AB. Next, a division is computed to obtain P = NQ + C such that both C and Q are nonnegative integers less than N. Knuth studies such algorithms for multiprecision nonnegative integers [215, Alg. 4.3.1.D]. Counting wordbyword instructions, the method described by Knuth requires O(n2 ) multiplications and O(n) divisions when implemented on a computer platform. However, on almost all computer platforms divisions are expensive (relative to the cost of a multiplication). Is it possible to perform modular multiplication without using any division instructions? If one is allowed to perform some precomputation which only depends on the modulus N, then this question can be answered affirmatively. When computing the division step, the idea is to use only “cheap” divisions and “cheap” modular reductions when computing the modular multiplication in combination with a precomputed constant (the computation of which may require “expensive” divisions). These “cheap” operations are computations which either come for free or at a low cost on computer platforms. Virtually all modern computer architectures internally store and compute on data in binary format using some fixed wordsize r = 2w as above. In practice, this means that all arithmetic operations are implicitly computed modulo 2w (i.e., for free) and divisions or multiplications by (powers of) 2 can be computed by simply shifting the content of the register that holds this value. Barrett introduced a modular multiplication approach (known as Barrett multiplication [31]) using this idea. This approach can be seen as a Newton method which uses a precomputed scaled variant of the modulus’ reciprocal in order to use only such “cheap” divisions when computing (estimating and adjusting) the division step. After precomputing a single (multiprecision) value, an implementation of Barrett multiplication does not use any division instructions and requires O(n2 ) multiplication instructions. Another and earlier approach based on precomputation is the main topic of this chapter: Montgomery multiplication. This method is the preferred choice in cryptographic applications when the modulus has no “special” form (besides being an odd positive integer) that would allow more efficient modular reduction techniques. See Section 2.3 on page 22 for applications of Montgomery multiplication in the “special” setting. In practice, Montgomery Downloaded from https://www.cambridge.org/core. Cambridge University Main, on 18 Dec 2019 at 22:01:47, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575.003 12 Montgomery Arithmetic from a Software Perspective multiplication is the most efficient method when a generic modulus is used (see e.g., the comparison performed by Bosselaers, Govaerts, and Vandewalle [75]) and has a very regular structure which speeds up the implementation. Moreover, the structure of the algorithm (especially if its single branch, the notorious conditional “subtraction step,” can be avoided, cf. page 20 in Section 2.2.4) has advantages when guarding against certain types of cryptographic attacks (for more information on differential power analysis attacks see the seminal paper by Kocher, Jaffe, and Jun [221]). In the next chapter, Montgomery’s method is compared with a version of Barrett multiplication in order to be more precise about the computational advantages of the former technique. As observed by Shand and Vuillemin in [310], Montgomery multiplication can be seen as a generalization of Hensel’s odd division [184] for 2adic numbers. In this chapter we explain the motivation behind Montgomery arithmetic. More specifically, we show how a change of the residue class representatives used improves the performance of modular multiplication. Next, we summarize some of the proposed modifications to Montgomery multiplication, which can further speed up the algorithm in certain settings. Finally, we show how to implement Montgomery arithmetic in software. We especially study how to compute a single Montgomery multiplication concurrently using either vector instructions or when many computational units are available that can compute in parallel. 2.2 Montgomery Multiplication Let N be an odd bbit integer and P a 2bbit integer such that 0 ≤ P < N 2 . The idea behind Montgomery multiplication is to change the representatives of the residue classes and change the modular multiplication accordingly. More precisely, we are not going to compute P mod N but 2−b P mod N instead. This explains the requirement that N needs to be odd (otherwise 2−b mod N does not exist). It turns out that this approach is more efficient (by a constant factor) on computer platforms. Let us start with a basic example to illustrate the strategy used. A first idea is to reduce the value P one bit at a time and repeat this for b bits such that the result has been reduced from 2b to b bits (as required). This can be achieved without computing any expensive modular divisions by noting that P/2 if P is even, −1 2 P mod N = (P + N)/2 if P is odd. Downloaded from https://www.cambridge.org/core. Cambridge University Main, on 18 Dec 2019 at 22:01:47, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575.003 2.2 Montgomery Multiplication 13 When P is even, the division by two can be computed with a basic operation on computer architectures: shift the number one position to the right. When P is odd one can not simply compute this division by shifting. A computationally efficient approach to compute this division by two is to make this number P even by adding the odd modulus N, since obviously modulo N this is the same. This allows one to compute 2−1 P mod N at the cost of (at most) a single addition and a single shift. Let us compute D < 2N and Q < 2b such that P = 2b D − NQ since then D ≡ 2−b P mod N. Initially set D equal to P and Q equal to zero. We denote i by qi the ith digit when Q is written in binary (radix2), i.e., Q = b−1 i=0 qi 2 . Next perform the following two steps b times starting at i = 0 until the last time when i = b − 1: (Step 1) qi = D mod 2, (Step 2) D = (D + qi N)/2. This procedure gradually builds the desired Q and at the start of every iteration P = 2i D − NQ remains invariant. The process is illustrated in the example below. Example 2.1 (Radix2 Montgomery reduction) For N = 7 (3 bits) and P = 20 < 72 we compute D ≡ 2−3 P mod N. At the start of the algorithm, set D = P = 20 and Q = 0. i = 0, 20 = 20 · 20 − 7 · 0 (Step 1) q0 = 20 mod 2 = 0, i = 1, 20 = 21 · 10 − 7 · 0 (Step 1) q1 = 10 mod 2 = 0, i = 2, 20 = 22 · 5 − 7 · 0 (Step 1) q2 = 5 mod 2 = 1, ⇒ 2−0 · 20 ≡ 20 mod 7 (Step 2) D = (20 + 0 · 7)/2 = 10 ⇒ 2−1 · 20 ≡ 10 mod 7 (Step 2) D = (10 + 0 · 7)/2 = 5 ⇒ 2−2 · 20 ≡ 5 mod 7 (Step 2) D = (5 + 1 · 7)/2 = 6 Since Q = q0 20 + q1 21 + q2 22 = 4 and P = 20 = 2k D − NQ = 23 · 6 − 7 · 4, we have computed 2−3 · 20 ≡ 6 mod 7. The approach behind Montgomery multiplication [251] generalizes this idea. Instead of dividing by two at every iteration the idea is to divide by a Montgomery radix R which needs to be coprime to, and should be larger than, N. By precomputing the value μ = −N −1 mod R, Downloaded from https://www.cambridge.org/core. Cambridge University Main, on 18 Dec 2019 at 22:01:47, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575.003 14 Montgomery Arithmetic from a Software Perspective Algorithm 2.1 The Montgomery reduction algorithm. Compute PR−1 modulo the odd modulus N given the Montgomery radix R > N and using the precomputed Montgomery constant μ = −N −1 mod R. Input: N, P, such that 0 ≤ P < N 2 . Output: C ≡ PR−1 mod N such that 0 ≤ C < N. 1: q ← μ(P mod R) mod R 2: C ← (P + Nq)/R 3: if C ≥ N then 4: C ←C−N 5: end if 6: return C adding a specific multiple of the modulus N to the current value P ensures that (2.1) P + N (μP mod R) ≡ P − N N −1 P mod R ≡P−P≡0 (mod R). Hence, P + N (Pμ mod R) is divisible by the Montgomery radix R while P does not change modulo N. Let P be the product of two nonnegative integers that are both less than N. After applying Equation (2.1) and dividing by R, the value P (bounded by N 2 ) has been reduced to at most 2N since 0≤ P + N(μP mod R) N 2 + NR < < 2N R R (2.2) (since R was chosen larger than N). This approach is summarized in Algorithm 2.1: given an integer P bounded by N 2 , it computes PR−1 mod N, bounded by N, without using any “expensive” division instructions when assuming the reductions modulo R and divisions by R can be computed (almost) for free. On most computer platforms, where one chooses R as a power of two, this assumption is indeed true. Equation (2.2) guarantees that the output is bounded by 2N. Hence, a conditional subtraction needs to be computed at the end of Algorithm 2.1 to ensure the output is less than N. The process is illustrated in the example below, where P is the product of integers A, B with 0 ≤ A, B < N. Example 2.2 (Montgomery multiplication) Exact divisions by 102 = 100 are visually convenient when using a decimal system: just shift the number two places to the right (or “erase” the two least significant digits). Assume the following modular reduction approach: use the Montgomery radix R = 100 when computing modulo N = 97. This example Downloaded from https://www.cambridge.org/core. Cambridge University Main, on 18 Dec 2019 at 22:01:47, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575.003 2.2 Montgomery Multiplication 15 computes the Montgomery product of A = 42 with B = 17. First, precompute the Montgomery constant μ = −N −1 mod R = −97−1 mod 100 = 67. After computing the product P = AB = 42 · 17 = 714, compute the first two steps of Algorithm 2.1 omitting the division by R: P + N(μP mod R) = 714 + 97(67 · 714 mod 100) = 714 + 97(67 · 14 mod 100) = 714 + 97(938 mod 100) = 714 + 97 · 38 = 4400. Indeed, 4400 is divisible by R = 100 and we have computed (AB)R−1 ≡ 42 · 17 · 100−1 ≡ 44 (mod 97) without using any “difficult” modular divisions. 2.2.1 Interleaved Montgomery Multiplication When working with multiprecision integers, integers consisting of n digits of w bits each, it is common to write the Montgomery radix R as R = rn = 2wn , where w is the wordsize of the architecture where Montgomery multiplication is implemented. The Montgomery multiplication method (as outlined in Algorithm 2.1) assumes the multiplication is computed before performing the Montgomery reduction. This has the advantage that one can use asymptotically fast multiplication methods (like e.g., Karatsuba [202], Toom–Cook [327, 102], Schönhage–Strassen [301], or Fürer [152] multiplication). However, this has the disadvantage that the intermediate results can be as large as r2n+1 . Or, stated differently, when using a machine word size of w bits the intermediate results are represented using at most 2n + 1 computer words. The multiprecision setting was already handled in Montgomery’s original paper [251, section 2] and the reduction and multiplication were meant to be interleaved by design (see also the remark in the second to last paragraph of Downloaded from https://www.cambridge.org/core. Cambridge University Main, on 18 Dec 2019 at 22:01:47, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575.003 16 Montgomery Arithmetic from a Software Perspective Algorithm 2.2 The radixr interleaved Montgomery multiplication algorithm. Compute (AB)R−1 modulo the odd modulus N given the Montgomery radix R = rn and using the precomputed Montgomery constant μ = −N −1 mod r. The modulus N is such that rn−1 ≤ N < rn and r and N are coprime. i Input: A = n−1 i=0 ai r , B, N such that 0 ≤ ai < r, 0 ≤ A, B < R. Output: C ≡ (AB)R−1 mod N such that 0 ≤ C < N. 1: C ← 0 2: for i = 0 to n − 1 do 3: C ← C + ai B 4: q ← μC mod r 5: C ← (C + Nq)/r 6: end for 7: if C ≥ N then 8: C ←C−N 9: end if 10: return C Section 1.2 on page 5). When representing the integers in radixr representation A= n−1 ai ri , such that 0 ≤ ai < r i=0 then the radixr interleaved Montgomery multiplication (see also the work by Dussé and Kaliski Jr. in [134]) ensures the intermediate results never exceed r + 2 computer words. This approach is presented in Algorithm 2.2. Note that this interleaves the naive schoolbook multiplication algorithm with the Montgomery reduction and therefore does not make use of any asymptotically faster multiplication algorithm. The idea is that every iteration divides by the value r (instead of dividing once by R = rn in the “noninterleaved” Montgomery multiplication algorithm). Hence, the value for μ is adjusted accordingly. In [93], Koç, Acar, and Kaliski Jr. compare different approaches to implementing multiprecision Montgomery multiplication. According to this analysis, the interleaved radixr approach, referred to as coarsely integrated operand scanning in [93], performs best in practice. 2.2.2 Using Montgomery Arithmetic in Practice As we have seen earlier in this section and in Algorithm 2.1 on page 14 Montgomery multiplication computes C ≡ PR−1 mod N. It follows that, in order to use Montgomery multiplication in practice, one should transform the input Downloaded from https://www.cambridge.org/core. Cambridge University Main, on 18 Dec 2019 at 22:01:47, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575.003 2.2 Montgomery Multiplication 17 operands A and B to Ã = AR mod N and B̃ = BR mod N: this is called the Montgomery representation. The transformed inputs (converted to the Montgomery representation) are used in the Montgomery multiplication algorithm. At the end of a series of modular multiplications the result, in Montgomery representation, is transformed back. This works correctly since Montgomery multiplication M(Ã, B̃, N ) computes (ÃB̃)R−1 mod N and it is indeed the case that the Montgomery representation C̃ of C = AB mod N is computed from the Montgomery representations of A and B since C̃ ≡ M(Ã, B̃, N ) ≡ (ÃB̃)R−1 ≡ (AR)(BR)R−1 ≡ (AB)R ≡ CR (mod N). Converting an integer A to its Montgomery representation Ã = AR mod N can be performed using Montgomery multiplication with the help of the precomputed constant R2 mod N since M(A, R2 , N ) ≡ (AR2 )R−1 ≡ AR ≡ Ã (mod N). Converting (the result) back from the Montgomery representation to the regular representation is the same as computing a Montgomery multiplication with the integer value one since M(Ã, 1, N ) ≡ (Ã · 1)R−1 ≡ (AR)R−1 ≡ A (mod N). As mentioned earlier, due to the overhead of changing representations, Montgomery arithmetic is best when used to replace a sequence of modular multiplications, since this overhead is amortized. A typical usecase scenario is when computing a modular exponentiation as required in the RSA cryptosystem [294]. As noted in the original paper [251] (see the quote at the start of this chapter) computing with numbers in Montgomery representation does not affect the modular addition and subtraction algorithms. This can be seen from Ã ± B̃ ≡ AR ± BR ≡ (A ± B)R ≡ A ±B (mod N). Computing the Montgomery inverse is, however, affected. The Montgomery −1 . This is different inverse of a value Ã in Montgomery representation is A −1 from computing the inverse of Ã modulo N since Ã ≡ (AR)−1 ≡ A−1 R−1 (mod N) is the Montgomery representation of the value A−1 R−2 . One of the correct ways of computing the Montgomery inverse is to invert the number Downloaded from https://www.cambridge.org/core. Cambridge University Main, on 18 Dec 2019 at 22:01:47, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575.003 18 Montgomery Arithmetic from a Software Perspective in its Montgomery representation and Montgomery multiply this result by R3 since −1 M(Ã−1 , R3 , N ) ≡ ((AR)−1 R3 )R−1 ≡ A−1 R ≡ A (mod N). Another approach, which does not require any precomputed constant, is to compute the Montgomery reduction of a Montgomery residue Ã twice before inverting since M(M(Ã, 1, N ), 1, N )−1 ≡ M((AR)R−1 , 1, N )−1 ≡ M(A, 1, N )−1 ≡ (AR−1 )−1 ≡ A−1 R −1 ≡A (mod N). 2.2.3 Computing the Montgomery Constants μ and R2 In order to use Montgomery multiplication one has to precompute the Montgomery constant μ = −N −1 mod r. This can be computed with, for instance, the extended Euclidean algorithm. A particularly efficient algorithm to compute μ when r is a power of two and N is odd, the typical setting used in cryptology, is given by Dussé and Kaliski Jr. and presented in [134]. This approach is recalled in Algorithm 2.3. To show that this approach is correct, it suffices to show that at the start of Algorithm 2.3 and at the end of every iteration we have Nyi ≡ 1 (mod 2i ). This can be shown by induction as follows. At the start of the algorithm we set y to Algorithm 2.3 Compute the Montgomery constant μ = −N −1 mod r for odd values N and r = 2w as presented by Dussé and Kaliski Jr. in [134]. Input: Odd integer N and r = 2w for w ≥ 1. Output: μ = −N −1 mod r y←1 for i = 2 to w do if (Ny mod 2i ) = 1 then y ← y + 2i−1 end if end for return μ ← r − y Downloaded from https://www.cambridge.org/core. Cambridge University Main, on 18 Dec 2019 at 22:01:47, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575.003 2.2 Montgomery Multiplication 19 one, denote this start setting with y1 , and the condition holds since N is odd by assumption. Denote with yi , for 2 ≤ i ≤ w, the value of y in Algorithm 2.3 at the end of iteration i. When i > 1, our induction hypothesis is that Nyi−1 = 1 + 2i−1 m for some positive integer m, at the end of iteration i − 1. We consider two cases r (m is even) Since Nyi−1 = 1 + m 2i ≡ 1 (mod 2i ) we can simply update yi 2 to yi−1 and the condition holds. r (m is odd) Since Nyi−1 = 1 + 2i−1 + m−1 2i ≡ 1 + 2i−1 (mod 2i ), we 2 update yi with yi−1 + 2i−1 . We obtain N(yi−1 + 2i−1 ) = 1 + 2i−1 (1 + N) + m−1 i 2 ≡ 1 (mod 2i ) since N is odd. 2 Hence, after the for loop yw is such that Nyw ≡ 1 mod 2w and the returned value μ = r − yw ≡ 2w − N −1 ≡ −N −1 (mod 2w ) has been computed correctly. The precomputed constant R2 mod N is required when converting a residue modulo N from its regular to its Montgomery representation (see Section 2.2.2 on page 16). When R = rn is a power of two, which in practice is typically the case since r = 2w , then this precomputed value R2 mod N can also be computed efficiently. For convenience, assume R = rn = 2wn and 2wn−1 ≤ N < 2wn (but this approach can easily be adapted when N is smaller than 2wn−1 ). Commence by setting the initial c0 = 2wn−1 < N. Next, start at i = 1 and compute ci ≡ ci−1 + ci−1 mod N and increase i until i = wn + 1. The final value cwn+1 ≡ 2wn+1 c0 ≡ 2wn+1 2wn−1 ≡ 22wn ≡ (2wn )2 ≡ R2 mod N as required and can be computed with wn + 1 modular additions. 2.2.4 On the Final Conditional Subtraction It is possible to alter or even completely remove the conditional subtraction from lines 3–4 in Algorithm 2.1 on page 14. This is often motivated by either performance considerations or turning the (software) implementation into straightline code that requires no conditional branches. This is one of the basic requirements for cryptographic implementations that need to protect themselves against a variety of (simple) sidechannel attacks as introduced by Kocher, Jaffe, and Jun [221] (those attacks that use physical information, such as elapsed time, obtained when executing an implementation, to deduce information about the secret key material used). Ensuring constant running time is a first step in achieving this goal. In order to change or remove this final conditional subtraction the general idea is to bound the input and output of the Downloaded from https://www.cambridge.org/core. Cambridge University Main, on 18 Dec 2019 at 22:01:47, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575.003 20 Montgomery Arithmetic from a Software Perspective Montgomery multiplication in such a way that they can be reused in a subsequent Montgomery multiplication computation. This means using a redundant representation, in which the representation of the residues used is not unique and can be larger than N. 2.2.4.1 Subtractionless Montgomery Multiplication The conditional subtraction can be omitted when the size of the modulus N is appropriately selected with respect to the Montgomery radix R. (This is a result presented by Shand and Vuillemin in [310] but see also the sequence of papers by Walter, Hachez, and Quisquater [337, 177, 340].) The idea is to select the modulus N such that 4N < R and to use a redundant representation for the input and output values of the algorithm. More specifically, we assume A, B ∈ Z/2NZ (residues modulo 2N), where 0 ≤ A, B < 2N, since then the outputs of Algorithm 2.1 on page 14 and Algorithm 2.2 on page 16 are bounded by 0≤ (2N)2 + NR NR + NR AB + N(μAB mod R) < < = 2N. (2.3) R R R Hence, the result can be reused as input to the same Montgomery multiplication algorithm. This avoids the need for the conditional subtraction except in a final correction step (after having computed a sequence of Montgomery multiplications) where one reduces the value to a unique representation with a single conditional subtraction. In practice, this might reduce the number of arithmetic operations whenever the modulus can be chosen beforehand and, moreover, simplifies the code. However, in the popular usecases in cryptography, e.g., in the setting of computing modular exponentiations when using schemes based on RSA [294] where the bitlength of the modulus must be a power of two due to compliance with cryptographic standards, the condition 4N < R results in a Montgomeryradix R, which is represented using one additional computer word (compared to the number of words needed to represent the modulus N). Hence, in this setting, such a multiprecision implementation without a conditional subtraction needs one more iteration (when using the interleaved Montgomery multiplication algorithm) to compute the result compared to a version that computes the conditional subtraction. 2.2.4.2 Montgomery Multiplication with a Simpler Final Comparison Another approach is not to remove the subtraction but make this operation computationally cheaper. See the analysis by Walter and Thompson in [342, section 2.2] which is introduced again by Pu and Zhao in [292]. In practice, the Downloaded from https://www.cambridge.org/core. Cambridge University Main, on 18 Dec 2019 at 22:01:47, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575.003 2.2 Montgomery Multiplication 21 Montgomeryradix R = rn is often chosen as a multiple of the wordsize of the computer architecture used (e.g., r = 2w for w ∈ {8, 16, 32, 64}). The idea is to reduce the output of the Montgomery multiplication to {0, 1, . . . , R − 1}. instead of to the smaller range {0, 1, . . . , N − 1}. Just as above, this is a redundant representation but working with residues from Z/RZ. This representation does not need more computer words to represent the result and therefore does not increase the number of iterations one needs to compute; something which might be the case when the Montgomery radix is increased to remove the con ditional subtraction. Computing the comparison if an integer x = ni=0 xi ri is at least R = rn can be done efficiently by verifying if the most significant word xn is nonzero. This is significantly more efficient compared with computing a full multiprecision comparison. This approach is correct since if the input values A and B are bounded by R then the output of the Montgomery multiplication, before the conditional subtraction, is bounded by 0≤ R2 + NR B + N(μAB mod R) < = R + N. R R (2.4) Subtracting N whenever the result is at least R ensures that the output is also less than R. Hence, one still requires to evaluate the condition for subtraction in every Montgomery multiplication. However, the greaterorequal comparison becomes significantly cheaper and the number of iterations required to compute the interleaved Montgomery multiplication algorithm remains unchanged. In the setting where a constant running time is required this approach does not seem to bring a significant advantage (see the security analysis by Walter and Thompson in [342, section 2.2] for more details). A simple constant running time solution is to compute the subtraction and select this result if no borrow occurred. However, when constant running time is not an issue this approach (using a cheaper comparison) can speed up the Montgomery multiplication algorithm. 2.2.5 Montgomery Multiplication in F2k The idea behind Montgomery multiplication carries over to finite fields of cardinality 2k as well. Such finite fields are known as binary fields or characteristictwo finite fields. The application of Montgomery multiplication to this setting is outlined by Koç and Acar in [219]. Let n(x) be an irreducible polynomial of degree k. Then an element a(x) ∈ F2k ∼ = F2 [x]/(n(x)) can be represented in the Downloaded from https://www.cambridge.org/core. Cambridge University Main, on 18 Dec 2019 at 22:01:47, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575.003 22 Montgomery Arithmetic from a Software Perspective polynomialbasis representation by a polynomial of degree at most k − 1 a(x) = k−1 ai xi where ai ∈ F2 . i=0 The equivalent of the Montgomeryradix is the polynomial r(x) ∈ F2 [x]/(n(x)), which in practice is chosen as r(x) = xk . Since n(x) is irreducible this ensures that the inverse r−1 (x) mod n(x) exists and the Montgomery multiplication a(x)b(x)r−1 (x) mod n(x) is well defined. Let a(x), b(x) ∈ F2 [x]/(n(x)) and their product p(x) = a(x)b(x) of degree at most 2(k − 1). Computing the Montgomery reduction p(x)r−1 (x) mod n(x) of p(x) can be done using the same steps as presented in Algorithm 2.1 on page 14 given the precomputed Montgomery constant μ(x) = −n(x)−1 mod r(x). Hence, one computes q(x) = p(x)μ(x) mod r(x) c(x) = (p(x) + q(x)n(x))/r(x). Note that the final conditional subtraction step is not required since deg(c(x)) ≤ max(2(k − 1), k − 1 + k) − k = k − 1, (because r(x) is a degree k polynomial). A large characteristic version of this approach using the interleaved Montgomery multiplication for finite fields of large prime characteristic from Section 2.2.1 on page 15, works here as well. 2.3 Using Primes of a Special Form In some settings in cryptography, most notably in elliptic curve cryptography (introduced independently by Miller and Koblitz in [247, 217]), the (prime) modulus can be chosen freely and is fixed for a large number of modular arithmetic operations. In order to gain a constant factor speedup when computing the modular multiplication, Solinas suggested [316] a specific set of special primes which were subsequently included in the FIPS 186 standard [330] used in publickey cryptography. More recently, prime moduli of the form 2s ± δ have gained popularity where s, δ ∈ Z>0 and δ < 2s such that δ is a (very) small integer. More precisely, the constant δ is small compared to the typical wordsize of computer architectures used (e.g., less than 232 ) and often is chosen as the smallest integer such that one of 2s ± δ is prime. One should Downloaded from https://www.cambridge.org/core. Cambridge University Main, on 18 Dec 2019 at 22:01:47, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575.003 2.3 Using Primes of a Special Form 23 be aware that the usage of such primes of a special form not only accelerates the cryptographic implementations, but the cryptanalytic methods benefit as well. See, for instance, the work by this chapter’s authors, Kleinjung, and Lenstra related to efficient arithmetic to factor Mersenne numbers (numbers of the form 2M − 1) in [71]. An example of one of the primes, suggested by Solinas, is 2256 − 2224 + 2192 + 296 − 1 where the exponents are selected to be a multiple of 32 to speed up implementations on 32bit platforms (but see for instance the work by Käsper how to implement such primes efficiently on 64bit platforms [204]). A more recent example proposed by Bernstein [35] is to use the prime 2255 − 19 to implement efficient modular arithmetic in the setting of elliptic curve cryptography. 2.3.1 Faster Modular Reduction with Primes of a Special Form We use the Mersenne prime 2127 − 1 as an example to illustrate the various modular reduction techniques in this section. Given two integers a and b, such that 0 ≤ a, b < 2127 − 1, the modular product ab mod 2127 − 1 can be computed efficiently as follows. (We follow the description given by the first author of this chapter, Costello, Hisil, and Lauter from [64].) First compute the product with one’s preferred multiplication algorithm and write the result in radix2128 representation c = ab = c1 2128 + c0 , where 0 ≤ c0 < 2128 and 0 ≤ c1 < 2126 . This product can be almost fully reduced by subtracting 2c1 times the modulus since c ≡ c1 2128 + c0 − 2c1 (2127 − 1) ≡ c0 + 2c1 (mod 2127 − 1). Moreover, we can subtract 2127 − 1 one more time if the bit c0 /2127 (the most significant bit of c0 ) is set. When combining these two observations a first reduction step can be computed as c = (c0 mod 2127 ) + 2c1 + c0 /2127 ≡ c (mod 2127 − 1) (2.5) This already ensures that the result 0 ≤ c < 2128 since c ≤ 2127 − 1 + 2(2126 − 1) + 1 < 2128 . One can then reduce c further, using conditional subtractions. Reduction modulo 2127 − 1 can therefore be computed without using any multiplications or expensive divisions by taking advantage of the form of the modulus. Downloaded from https://www.cambridge.org/core. Cambridge University Main, on 18 Dec 2019 at 22:01:47, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575.003 24 Montgomery Arithmetic from a Software Perspective 2.3.2 Faster Montgomery Reduction with Primes of a Special Form Along the same lines one can select moduli that speed up the operations when computing Montgomery reduction. Such special moduli have been proposed many times in the literature to reduce the number of arithmetic operations (see for instance the work by Lenstra [231], Acar and Shumow [1], Knezevic, Vercauteren, and Verbauwhede [214], Hamburg [178], and the first author of this chapter, Costello, Hisil, and Lauter [64, 65]). They are sometimes referred to as Montgomeryfriendly moduli or Montgomeryfriendly primes. Techniques to scale an existing modulus such that this scaled modulus has a special shape which reduces the number of arithmetic operations, using the same techniques as for the Montgomeryfriendly moduli, are also known and called Montgomery tail tailoring by Hars in [183]. Following the description in the book by Brent and Zimmermann [80], this can be seen as a form of preconditioning as suggested by Svoboda in [320] in the setting of division. When one is free to select the modulus N beforehand, then the number of arithmetic operations can be reduced if the modulus is selected such that μ = −N −1 mod r = ±1 in the setting of interleaved Montgomery multiplication (as also used by Dixon and Lenstra in [127]). This ensures that the multiplication by μ can be avoided (since μ = ±1) in every iteration of the interleaved Montgomery multiplication algorithm. This puts a first requirement on the modulus, namely N≡ ∓ 1 mod r. In practice, r = 2w where w is the wordsize of the computer architecture. Hence, this requirement puts a restriction on the least significant word of the modulus (which equals either 1 or −1 ≡ 2w − 1 mod 2w ). Combining lines 4 and 5 from the interleaved Montgomery multiplication (Algorithm 2.2 on page 16) we see that one has to compute C+N(μCr mod r) . Besides the multiplication by μ one has to compute a multiword multiplication with the (fixed) modulus N. In the same vein as the techniques from Section 2.3.1 above, one can require N to have a special shape such that this multiplication can be computed faster in practice. This can be achieved, for instance, when one of the computer words of the modulus is small or has some special shape while the remainder of the digits are zero except for the most significant word (e.g., when μ = 1). Along these lines the first author of this chapter, Costello, Longa, and Naehrig select primes for usage in elliptic curve cryptography where N = 2α (2β − γ ) ± 1 (2.6) where α, β, and γ are integers such that γ < 2β ≤ r. Downloaded from https://www.cambridge.org/core. Cambridge University Main, on 18 Dec 2019 at 22:01:47, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575.003 2.3 Using Primes of a Special Form 25 The final requirement on the modulus is to ensure that 4N < R = rn since this avoids the final conditional subtraction (as shown on page 20 in Section 2.2.4). Examples of such Montgomeryfriendly moduli include 2252 − 2232 − 1 = 2192 (260 − 240 ) − 1 = 2224 (228 − 28 ) − 1 (written in different form to show the usage on different architectures that can compute with βbit integers) proposed by Hamburg in [178] and the modulus 2240 (214 − 127) − 1 = 2222 (232 − 218 · 127) − 1 proposed by the first author of this chapter, Costello, Longa, and Naehrig in [66]. The approach is illustrated in the example below. Other examples of Montgomeryfriendly moduli are given in [166, table 4] based on Chung– Hasan arithmetic [98]. Example 2.3 (Montgomeryfriendly reduction modulo 2127 − 1) Let us consider Montgomery reduction modulo 2127 − 1 on a 64bit computer architecture (w = 64). This means we have α = 64, β = 63, and γ = 0 in Equation (2.6) on page 24 (2127 − 1 = 264 (263 − 0) − 1). Multiplication by μ can be avoided since μ = −(2127 − 1)−1 mod 264 = 1. Furthermore, due to the special form of the modulus the multiplication by 2127 − 1 can be simplified. The computation of C+N(μCr mod r) , which needs to be done for each computer word (twice in this setting), can be simplified when using the Montgomery interleaved multiplication algorithm. Write C = c2 2128 + c1 264 + c0 (see line 3 in Algorithm 2.2 on page 16) with 0 ≤ c2 , c1 , c0 < 264 , then C + (2127 − 1)(C mod 264 ) C + N(μC mod r) = r 264 = (c2 2128 + c1 264 + c0 ) + (2127 − 1)c0 264 = c2 2128 + c1 264 + c0 2127 264 = c2 264 + c1 + c0 263 . Hence, only two addition and two shift operations are needed in this computation. Downloaded from https://www.cambridge.org/core. Cambridge University Main, on 18 Dec 2019 at 22:01:47, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575.003 26 Montgomery Arithmetic from a Software Perspective 2.4 Concurrent Computing of Montgomery Multiplication Since the seminal paper by the second author introducing modular multiplication without trial division, people have studied ways to obtain better performance on different computer architectures. Many of these techniques are specifically tailored towards a (family) of platforms motivated by the desire to enhance the practical performance of publickey cryptosystems. One approach focuses on reducing the latency of the Montgomery multiplication operation. This might be achieved by computing the Montgomery product using many computational units in parallel. One example is to use the single instruction, multiple data (SIMD) programming paradigm. In this setting a single vector instruction applies to multiple data elements in parallel. Many modern computer architectures have access to vector instruction set extensions to perform SIMD operations. Example platforms include the popular highend x86 architecture as well as the embedded ARM platform which can be found in the majority of modern smartphones and tablets. To highlight the potential, Gueron and Krasnov were the first to show in [175] that the computation of Montgomery multiplication on the 256bit wide vector instruction set AVX2 is faster than the same computation on the classical arithmetic logic unit (ALU) on the x86_64 platform. In Section 2.4.2 below we outline the approach by the authors of this chapter, Shumow, and Zaverucha from [73] for computing a single Montgomery multiplication using vector instruction set extensions which support 2way SIMD operations (i.e., perform the same operation on two data points simultaneously). This approach allows one to split the computation of the interleaved Montgomery multiplication into two parts which can be computed in parallel. Note that in a followup work [308] by Seo, Liu, Großschädl, and Choi it is shown how to improve the performance on 2way SIMD architectures even further. Instead of computing the two multiplications concurrently, as is presented in Section 2.4.2, they compute every multiplication using 2way SIMD instructions. By careful scheduling of the instructions they manage to significantly reduce the readafterwrite dependencies which reduces the number of bubbles (execution delays in the instruction pipeline). This results in a software implementation which outperforms the one presented in [73]. It would be interesting to see if these two approaches (from [73] and [308]) can be merged on platforms which support efficient 4way SIMD instructions. In Section 2.4.4 on page 36 we show how to compute Montgomery multiplication when integers are represented in a residue number system. This approach can be used to compute Montgomery arithmetic efficiently on highly parallel computer architectures which have hundreds of computational cores or more and when large moduli are used (such as in the RSA cryptosystem). Downloaded from https://www.cambridge.org/core. Cambridge University Main, on 18 Dec 2019 at 22:01:47, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575.003 2.4 Concurrent Computing of Montgomery Multiplication 27 2.4.1 Related Work on Concurrent Computing of Montgomery Multiplication A parallel software approach describing systolic Montgomery multiplication is described by Dixon and Lenstra in [127], by Iwamura, Matsumoto, and Imai in [192], and Walter in [336]. See Chapter 3.14 on page 67 for more information about systolic Montgomery multiplication. Another approach is to use the SIMD vector instructions to compute multiple Montgomery multiplications in parallel. This can be useful in applications where many computations need to be processed in parallel such as batchRSA or cryptanalysis. This approach is studied by Page and Smart in [274] using the SSE2 vector instructions on a Pentium 4 and by the first author of this chapter in [63] on the Cell Broadband Engine (see Section 2.4.3 on page 31 for more details about this platform). An approach based on Montgomery multiplication which allows one to split the operand into two parts, which can then be processed in parallel, is called bipartite modular multiplication and was introduced by Kaihara and Takagi in [199]. The idea is to use a Montgomery radix R = rαn where α is a rational number such that αn is an integer and 0 < αn < n: hence, the radix R is smaller than the modulus N. For example, one can choose α such that αn = n2 . In order to compute xyr−αn mod N (where 0 ≤ x, y < N) write y = y1 rαn + y0 and compute xy1 mod N and xy0 r−αn mod N in parallel using a regular interleaved modular multiplication algorithm (see, e.g., the work by Brickell [81]) and the interleaved Montgomery multiplication algorithm, respectively. The sum of the two products gives the correct Montgomery product of x and y since (xy1 mod N) + (xy0 r−αn mod N) ≡ x(y1 rαn + y0 )r−αn ≡ xyr−αn (mod N). 2.4.2 Montgomery Multiplication Using SIMD Extensions This section is an extended version of the description of the idea presented by this chapter’s authors, Shumow, and Zaverucha in [73] where an algorithm is presented to compute the interleaved Montgomery multiplication using two threads running in parallel performing identical arithmetic steps. Hence, this algorithm runs efficiently when using 2way SIMD vector instructions as frequently found on modern computer architectures. For illustrative purposes we assume a radix232 system, but this can be adjusted accordingly to any other radix system. Note that for efficiency considerations the choice of the radix system depends on the vector instructions available. Downloaded from https://www.cambridge.org/core. Cambridge University Main, on 18 Dec 2019 at 22:01:47, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575.003 28 Montgomery Arithmetic from a Software Perspective Algorithm 2.2 on page 16 outlines the interleaved Montgomery multiplication algorithm and computes two 1 × n → (n + 1) computer word multiplications, namely ai B and qN, and a single 1 × 1 → 1 computer word multiplication (μC mod 232 ) every iteration. Unfortunately, these three multiplications depend on each other and therefore can not be computed in parallel. Every iteration computes (see Algorithm 2.2) 1. C ← C + ai B 2. q ← μC mod 232 3. C ← (C + qN)/232 In order to reduce latency, we would like to compute the two 1 × n → (n + 1) word multiplications in parallel using vector instructions. This can be achieved by removing the dependency between these two multiword multiplications by 32i is comcomputing the value of q first. The first word c0 of C = n−1 i=0 ci 2 puted twice: once for the computation of q (in μC mod 232 ) and then again in the parallel computation of C + ai B. This is a relatively minor penalty of one additional oneword multiplication and addition per iteration to make these two multiword multiplications independent of each other. This means an iteration i can be computed as 1. q ← μ(c0 + ai b0 ) mod 232 2. C ← C + ai B 3. C ← (C + qN)/232 and the 1 × n → (n + 1) word multiplications in steps 2 and 3 (ai B and qN) can be computed in parallel using, for instance, 2way SIMD vector instructions. In order to rewrite the remaining operations, besides the multiplication, the authors of [73] suggest inverting the sign of the Montgomery constant μ, i.e., instead of using −N −1 mod 232 use μ = N −1 mod 232 . When computing the Montgomery product C = AB2−32n mod N, one can compute D (which contains the sum of the products ai B) and E (which contains the sum of the products qN), separately and in parallel using the same arithmetic operations. Due to the modified choice of the Montgomery constant μ we have C = D − E ≡ AB2−32n (mod M), where 0 ≤ D, E < N: the maximum values of both D and E fit in an nword integer. This approach is presented in Algorithm 2.4 on page 29. At the start of every iteration in the forloop iterating with j, the two separate computational streams running in parallel need to communicate information to compute the value of q. More precisely, this requires the knowledge of both d0 and e0 , the least significant words of D and E respectively. Once the values of both d0 and e0 are known to one of the computational streams, the updated Downloaded from https://www.cambridge.org/core. Cambridge University Main, on 18 Dec 2019 at 22:01:47, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575.003 2.4 Concurrent Computing of Montgomery Multiplication 29 Algorithm 2.4 A parallel radix232 interleaved Montgomery multiplication algorithm. Except for the computation of q, the arithmetic steps in the outer forloop, performed by computation 1 and computation 2, are identical. This approach is suitable for 32bit 2way SIMD vector instruction units. ⎧ A, B, M, μ such that ⎪ ⎪ n−1 32i n−1 ⎪ ⎨ 32i 32i A = n−1 i=0 ai 2 , B = i=0 bi 2 , M = i=0 mi 2 , Input: 32 32(n−1) ⎪ 0 ≤ ai , bi < 2 , 0 ≤ A, B < M, 2 ≤ M < 232n , ⎪ ⎪ ⎩ 2 M, μ = M −1 mod 232 , −32n mod M such that 0 ≤ C < M Output: C ≡ AB2 Computation 1 Computation 2 ei = 0 for 0 ≤ i < n di = 0 for 0 ≤ i < n for j = 0 to n − 1 do t0 ← a j b0 + d0 t0 t0 ← 32 2 for i = 1 to n − 1 do p0 ← a j bi + t0 + di p0 t0 ← 32 2 di−1 ← p0 mod 232 end for dn−1 ← t0 end for for j = 0 to n − 1 do q ← ((μb0 )a j + μ(d0 − e0 )) mod 232 t1 ← qm0 + e0 // where t0 ≡ t1 (mod 232 ) t1 t1 ← 32 2 for i = 1 to n − 1 do p1 ← qmi + t1 + ei p1 t1 ← 32 2 ei−1 ← p1 mod 232 end for en−1 ← t1 end for C ←D−E // where D = if C < 0 do C ← C + M end if return C n−1 di 232i , E = i=0 n−1 ei 232i i=0 value of q can be computed as q = ((μa0 )b j + μ(d0 − e0 )) mod 232 = μ(c0 + b j a0 ) mod 232 since c0 = d0 − e0 . Except for this communication cost between the two streams, to compute the value of q, all arithmetic computations performed by computation 1 and computation 2 in the outer forloop are identical but work on different data. Downloaded from https://www.cambridge.org/core. Cambridge University Main, on 18 Dec 2019 at 22:01:47, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575.003 30 Montgomery Arithmetic from a Software Perspective Table 2.1 A simplified comparison, only stating the number of wordlevel instructions required, to compute the Montgomery multiplication when using a 32nbit modulus for a positive even integer n. Two approaches are shown, a sequential one on the classical ALU (Algorithm 2.2 on page 16) and a parallel one using 2way SIMD instructions (performing two operations in parallel, cf. Algorithm 2.4 on page 29). Classical Instruction add sub shortmul muladd muladdadd SIMD muladd SIMD muladdadd 32bit 64bit – – n 2n 2n(n − 1) – – – – n 2 n n( n2 − 1) – – 2way SIMD 32bit n n 2n – – n n(n − 1) This makes this approach suitable for computation using 2way 32bit SIMD vector instructions. This technique benefits from 2way SIMD 32 × 32 → 64bit multiplication and matches exactly the 128bit wide vector instructions as present in many modern computer architectures. By changing the radix used in Algorithm 2.4, larger or smaller vector instructions can be supported. Note that as opposed to a conditional subtraction in Algorithm 2.1 on page 14 and Algorithm 2.2 on page 16, Algorithm 2.4 computes a conditional addition because of the inverted sign of the precomputed Montgomery constant μ. This condition is based on the fact that if D − E is negative (produces a borrow), then the modulus is added to make the result positive. 2.4.2.1 Expected Performance We follow the analysis of the expected performance from [73], which just considers execution time. The idea is to perform an analysis of the number of arithmetic instructions as an indicator of the expected performance when using a 2way SIMD implementation instead of a regular (nonSIMD) implementation for the classical ALU. We assume the 2way SIMD implementation works on pairs of 32bit words in parallel and has access to a 2way SIMD 32 × 32 → 64bit multiplication instruction. A comparison to a regular implementation is not straightforward since the wordsize can be different, the platform might be able to compute multiple instructions in parallel (on different Downloaded from https://www.cambridge.org/core. Cambridge University Main, on 18 Dec 2019 at 22:01:47, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575.003 2.4 Concurrent Computing of Montgomery Multiplication 31 ALUs) and the number of instructions per arithmetic operation might differ. This is why we present a simplified comparison based on the number of arithmetic operations when computing Montgomery multiplication using a 32nbit modulus for a positive even integer n. We denote by muladdw (e, a, b, c) and muladdaddw (e, a, b, c, d) the computation of e = ab + c and e = ab + c + d, respectively, for 0 ≤ a, b, c, d < 2w (and thus 0 ≤ e < 22w ). These are basic operations on a computer architecture which works on wbit words. Some platforms have these operations as a single instruction (e.g., on some ARM architectures) while others implement this using separate multiplication and addition instructions (as on the x86 platform). Furthermore, let shortmulw (e, a, b) denote e = ab mod 2w : this w × w → wbit multiplication only computes the least significant w bits of the result and is faster than computing a full doubleword product on most modern computer platforms. Table 2.1 summarizes the expected performance of Algorithm 2.2 on page 16 and Algorithm 2.4 on page 29 in terms of arithmetic operations only. The shifting and masking operations are omitted for simplicity as well as the operations required to compute the final conditional subtraction or addition. When just taking the muladd and muladdadd instructions into account it becomes clear from Table 2.1 that the SIMD approach uses exactly half the number of instructions compared to the 32bit classical implementation and almost twice as many operations compared to the classical 64bit implementations. However, the SIMD approach requires more operations to compute the value of q every iteration and has various other overheads (e.g., inserting and extracting values from the large vector registers). Hence, when assuming that all the characteristics of the SIMD and classical (nonSIMD) instructions are identical, which is most likely not the case on most platforms, then we expect Algorithm 2.4 running on a 2way 32bit SIMD unit to outperform a classical 32bit implementation using Algorithm 2.4 by at most a factor of two while being roughly half as fast as a classical 64bit implementation. 2.4.3 A ColumnWise SIMD Approach A different approach, suitable for computing Montgomery multiplication on architectures supporting 4way SIMD instructions is outlined by the first chapter author and Kaihara in [67]. This approach is particularly efficient on the Cell Broadband Engine (see a brief introduction to this architecture below), since it was designed for usage on this platform, but can be used on any platform supporting the SIMD instructions used in this approach. This approach differs from the one described in the previous section in that it uses the SIMD instructions to compute the multiprecision arithmetic in parallel, so it works Downloaded from https://www.cambridge.org/core. Cambridge University Main, on 18 Dec 2019 at 22:01:47, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575.003 32 Montgomery Arithmetic from a Software Perspective on a lower level, while the approach from Section 2.4.2 above computes the arithmetic operations itself sequentially but divides the work into two steps, which can be computed concurrently. 2.4.3.1 The Cell Broadband Engine The Cell Broadband Engine (cf. the introductions given by Hofstee [187] and Gschwind [173]), denoted by “Cell” and jointly developed by Sony, Toshiba, and IBM, is a powerful heterogeneous multiprocessor which was released in 2005. The Cell contains a Power Processing Element, a dualthreaded Power architecturebased 64bit processor with access to a 128bit AltiVec/VMX single instruction, multiple data (SIMD) unit (which is not considered in this chapter). Its main processing power, however, comes from eight Synergistic Processing Elements (SPEs). For an introduction to the circuit design see the work by Takahashi et al. [323]. Each SPE consists of a Synergistic Processing Unit (SPU), 256 KB of private memory called Local Store (LS), and a Memory Flow Controller (MFC). To avoid the complexity of sending explicit direct memory access requests to the MFC, all code and data must fit within the LS. Each SPU runs independently from the others at 3.192 GHz and is equipped with a large register file containing 128 registers of 128 bits each. Most SPU instructions work on 128bit operands denoted as quadwords. The instruction set is partitioned into two sets: one set consists of (mainly) 4 and 8way SIMD arithmetic instructions on 32bit and 16bit operands respectively, while the other set consists of instructions operating on the whole quadword (including the load and store instructions) in a single instruction, single data (SISD) manner. The SPU is an asymmetric processor; each of these two sets of instructions is executed in a separate pipeline, denoted by the even and odd pipeline for the SIMD and SISD instructions, respectively. For instance, the {4, 8}way SIMD leftrotate instruction is an even instruction, while the instruction leftrotating the full quadword is dispatched into the odd pipeline. When dependencies are avoided, a single pair consisting of one odd and one even instruction can be dispatched every clock cycle. One of the first applications of the Cell processor was to serve as the brain of Sony’s PlayStation 3 game console. Due to the widespread availability of this game console and the fact that one could install and run one’s own software this platform has been used to accelerate cryptographic operations [112, 111, 95, 272, 74, 63] as well as cryptanalytic algorithms [318, 72, 69]. 2.4.3.2 Montgomery Multiplication on the Cell Broadband Engine In this section we outline the approach presented by the first author of this chapter and Kaihara tailored towards the instruction set of the Cell Broadband Engine. Most notably, the presented techniques rely on an efficient 4way Downloaded from https://www.cambridge.org/core. Cambridge University Main, on 18 Dec 2019 at 22:01:47, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575.003 2.4 Concurrent Computing of Montgomery Multiplication 33 Figure 2.1 The 16bit words xi of a 16(n + 1)bit n+1 positive integer X = n 16i x 2 < 2N are stored columnwise using s = 128bit vectors X j on i i=0 4 the SPE architecture. SIMD instruction to multiply two 16bit integers and add another 16bit integer to the 32bit result, and a large register file. Therefore, the approach described here uses a radix r = 216 to divide the large numbers into words that match the input sizes of the 4SIMD multipliers of the Cell. This can easily be adapted to any other radix size for different platforms with different SIMD instructions. The idea is to represent integers X in a radix216 system, i.e., X = ni=0 xi 216i where 0 ≤ xi < 216 . However, in order to use the 4way SIMD instructions of this platform efficiently these 16bit digits xi are stored in a 32bit datatype. The usage of this 32bit space is to ensure that intermediate values of the form ab + c do not produce any carries since when 0 ≤ a, b, c < 216 then 0 ≤ ab + c < 232 . Hence, given an odd 16nbit modulus N, then a Montgomery residue X, such that 0 ≤ X < 2N < 216(n+1) , is represented using s = n+1 4 vectors of 128 bits. Note that this representation uses roughly twice the number of bits when compared to storing it in a “normal” radixrepresentation. The single additional 16bit word is required because the intermediate accumulating result of Montgomery multiplication can be almost as large as 2N (see page 20 in Section 2.2.4, above). The 16bit digits xi are placed columnwise in the four 32bit datatypes of the 128bit vectors. This representation is illustrated in Figure 2.1. The four 32bit parts of the jth 128bit vector X j are denoted by X j = {X j [3], X j [2], X j [1], X j [0]}. Each of the (n+1) 16bit words xi of X is stored in the most significant 16 bits of Xi mod s [ si ]. The motivation for using this columnwise representation Downloaded from https://www.cambridge.org/core. Cambridge University Main, on 18 Dec 2019 at 22:01:47, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575.003 34 Montgomery Arithmetic from a Software Perspective is that a division by 216 can be computed efficiently: simply move the digits in vector X0 “one position to the right,” which in practice means a logical 32bit right shift, and relabeling of the indices such that X j becomes X j−1 , for 1 ≤ j < s − 1 and the modified vector X0 becomes the new Xs−1 . Algorithm 2.5 on page 35 computes Montgomery multiplication using such a 4way columnwise SIMD representation. In each iteration, the indices of the vectors that contain the accumulating partial product U change cyclically among the s registers. In Algorithm 2.5, each 16bit word of the inputs X, Y and N and the output Z is stored in the upper part (the most significant 16 bits) of each of the four 32bit words in a 128bit vector. The vector μ contains the replicated values of −N −1 mod 216 in the lower 16bit positions of the four 32bit words. In its most significant 16bit positions, the temporary vector K stores the replicated values of yi , i.e., each of the parsed coefficients of the multiplier Y corresponding to the ith iteration of the main loop. The operation A ← muladd(B, c, D), which is a single instruction on the SPE, represents the operation of multiplying the vector B (where data are stored in the higher 16bit positions of 32 bit words) by a vector with replicated 16bit values of c across all higher positions of the 32bit words. This product is added to D (in 4way SIMD manner) and the overall result is placed into A. The temporary vector V stores the replicated values of u0 in the least significant 16bit words. This u0 refers to the least significant 16bit word of the updated value of U, where U = nj=0 u j 216 j and is stored as s vectors of 128bit Ui mod s , Ui+1 mod s , . . . , Ui+n mod s (where i refers to the index of the main loop). The vector Q is computed as an elementwise logical left shift by 16 bits of the 4SIMD product of vectors V and μ. The propagation of the higher 16bit carries of U(i+ j) mod s as stated in lines 10 and 18 of Algorithm 2.5 consists of extracting the higher 16bit words of these vectors and placing them into the lower 16bit positions of temporary vectors. These vectors are then added to the “next” vector U(i+ j+1) mod s correspondingly. The operation is carried out for the vectors with indices j ∈ {0, 1, . . . , s − 2}. For j = s − 1, the last index, the temporary vector that contains the words is logically shifted 32 bits to the left and added to the vector Ui mod s . Similarly, the carry propagation of the higher words of U(i+ j) mod s in line 22 of Algorithm 2.5 is performed with 16bit word extraction and addition, but requires a sequential parsing over the (n + 1) 16bit words. Hence, the approach outlined in Algorithm 2.5 computes Montgomery multiplication by computing the multiword multiplications using SIMD instructions and representing the integers using a columnwise approach (see Figure 2.1 on page 33). This approach comes at the cost that a single 16nbit integer is rep bits: requiring slightly over twice the amount of storage. resented by 128 n+1 4 Downloaded from https://www.cambridge.org/core. Cambridge University Main, on 18 Dec 2019 at 22:01:47, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575.003 2.4 Concurrent Computing of Montgomery Multiplication 35 Algorithm 2.5 Montgomery multiplication algorithm for the Cell ⎧ ⎪ N represented by s 128bit vectors: Ns−1 , . . . , N0 , such that ⎪ ⎪ ⎪ 16(n−1) ⎪ ≤ N < 216n , 2 N, ⎪ ⎪ 2 ⎨ X, Y each represented by s 128bit vectors: Xs−1 , . . . , X0 , and Input: ⎪ Ys−1 , . . . , Y0 , such that 0 ≤ X, Y < 2N, ⎪ ⎪ ⎪ ⎪ μ : a 128bit vector containing (−N)−1 (mod 216 ) ⎪ ⎪ ⎩ replicated in all 4 elements. Z represented by s 128bit vectors: Zs−1 , . . . , Z0 , such that Output: Z ≡ XY 2−16(n+1) mod N, 0 ≤ Z < 2N. 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16: 17: 18: 19: 20: 21: 22: 23: 24: 25: for j = 0 to s − 1 do Uj = 0 end for for i = 0 to n do /* lines 6–9 compute U = yi X + U */ K = {yi , yi , yi , yi } for j = 0 to s − 1 do U(i+ j) mod s = muladd(X j , K, U(i+ j) mod s ) end for Carry propagation on U(i+ j) mod s for j = 0, . . . , s − 1 (see text) /* lines 12–13 compute Q = μV mod 216 */ V = {u0 , u0 , u0 , u0 } Q = shiftleft(mul(V , μ), 16) /* lines 15–17 compute U = NQ + U */ for j = 0 to s − 1 do U(i+ j) mod s = muladd(N j , Q, U(i+ j) mod s ) end for Carry propagation on U(i+ j) mod s for j = 0, . . . , s − 1 (see text) /* line 20 computes the division by 216 */ Ui mod s = vshiftright(Ui mod s , 32) end for Carry propagation on Ui mod s for i = n + 1, . . . , 2n + 1 (see text) for j = 0 to s − 1 do Z j = U(n+ j+1) mod s end for Note, however, that an implementation of this technique outperforms the native multiprecision bignumber library on the Cell processor by a factor of about 2.5, as summarized in [67]. Downloaded from https://www.cambridge.org/core. Cambridge University Main, on 18 Dec 2019 at 22:01:47, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781316271575.003 36 Montgomery Arithmetic from a Software Perspective 2.4.4 Montgomery Multiplication Using the Residue Number System Representation The residue number system (RNS) as introduced by Garner [156] and Merrill [244] is an approach, based on the Chinese remainder theorem, to represent an integer as a number of residues modulo smaller (coprime) integers. The advantage of RNS is that additions, subtractions, and multiplication can be performed independently and concurrently on these smaller residues. Given an RNS basis βn = {r1 , r2 , . . . , rn }, where gcd(ri , r j ) = 1 for i = j, the RNS modulus is defined as R = ni=1 ri . Given an integer x ∈ Z/RZ and the RNS basis βn , this integer x is represented as an ntuple x = (x1 , x2 , . . . , xn ), where xi = x mod ri for 1 ≤ i ≤ n. In order to convert an ntuple back to its integer value one can apply the Chinese remainder theorem (CRT) n R −1 R xi mod ri mod R. (2.7) x= r r i i i=1 Modular multiplication using Montgomery multiplication in the RNS setting has been studied, for instance, by Posch and Posch in [290] and by Bajard, Didier, and Kornerup in [23] and subsequent work. In this section we outline how to achieve this. First note for the application in which we are interested, we can not use the modulus N as the RNS modulus since N is either prime (in the setting of elliptic curve cryptography) or a product of two large primes (when using RSA). When computing the Montgomery reduction one has to perform arithmeti