## Multiprecision arithmetic

Peter Schwabe



October 20, 2014
SPACE 2014, Pune, India

Multiprecision arithmetic (from primary school to Asiacrypt)

Peter Schwabe



October 20, 2014
SPACE 2014, Pune, India

## PART I

Joint work with Michael Hutter

## The first year of primary school

Available numbers (digits): (0), 1, 2, 3, 4, 5, 6, 7, 8,9

## The first year of primary school

Available numbers (digits): (0), $1,2,3,4,5,6,7,8,9$

Addition
$3+5=$ ?
$2+7=$ ?
$4+3=$ ?

## The first year of primary school

Available numbers (digits): (0), $1,2,3,4,5,6,7,8,9$

Addition
$3+5=$
$2+7=$
$4+3=$

## The first year of primary school

Available numbers (digits): (0), 1, 2, 3, 4, 5, 6, 7, 8,9

## Addition <br> $3+5=$ ? <br> $2+7=$ ? <br> $4+3=$ ?

- All results are in the set of available numbers
- No confusion for first-year school kids


## Programming today

Available numbers: $0,1, \ldots, 255$

## Programming today

## Available numbers: $0,1, \ldots, 255$

## Addition

```
uint8_t a = 42;
uint8_t b = 89;
uint8_t r = a + b;
```


## Programming today

Available numbers: $0,1, \ldots, 255$

## Addition

$$
\begin{aligned}
& \text { uint8_t } a=42 ; \\
& \text { uint8_t } b=89 ; \\
& \text { uint8_t } r=a+b ;
\end{aligned}
$$

## Subtraction

$$
\begin{aligned}
& \text { uint8_t } a=157 ; \\
& \text { uint8_t } b=23 ; \\
& \text { uint8_t } r=a-b ;
\end{aligned}
$$

## Programming today

Available numbers: $0,1, \ldots, 255$

## Addition

$$
\begin{aligned}
& \text { uint8_t } a=42 ; \\
& \text { uint8_t } b=89 ; \\
& \text { uint8_t } r=a+b ;
\end{aligned}
$$

## Subtraction

$$
\begin{aligned}
& \text { uint8_t } a=157 ; \\
& \text { uint8_t } b=23 ; \\
& \text { uint8_t } r=a-b ;
\end{aligned}
$$

- All results are in the set of available numbers
- Larger set of available numbers: uint16_t, uint32_t, uint64_t
- Basic principle is the same; for the moment stick with uint8_t

Still in the first year of primary school
Crossing the ten barrier

$$
\begin{aligned}
& 6+5=? \\
& 9+7=? \\
& 4+8=?
\end{aligned}
$$

## Still in the first year of primary school

## Crossing the ten barrier

$6+5=?$
$9+7=?$
$4+8=?$

- Inputs to addition are still from the set of available numbers
- Results are allowed to be larger than 9


## Still in the first year of primary school

## Crossing the ten barrier

$$
\begin{aligned}
& 6+5=? \\
& 9+7=? \\
& 4+8=?
\end{aligned}
$$

- Inputs to addition are still from the set of available numbers
- Results are allowed to be larger than 9
- Addition is allowed to produce a carry


## Still in the first year of primary school

## Crossing the ten barrier

```
6+5= ?
9+7= ?
4+8=?
```

- Inputs to addition are still from the set of available numbers
- Results are allowed to be larger than 9
- Addition is allowed to produce a carry

What happens with the carry?

- Introduce the decimal positional system
- Write an integer $A$ in two digits $a_{1} a_{0}$ with

$$
A=10 \cdot a_{1}+\cdot a_{0}
$$

- Note that at the moment $a_{1} \in\{0,1\}$


## ... back to programming

```
uint8_t a = 184;
uint8_t b = 203;
uint8_t r = a + b;
```


## ... back to programming

```
uint8_t a = 184;
uint8_t b = 203;
uint8_t r = a + b;
```

- The result r now has the value of 131
- The carry is lost, what do we do?


## ... back to programming

```
uint8_t a = 184;
uint8_t b = 203;
uint8_t r = a + b;
```

- The result r now has the value of 131
- The carry is lost, what do we do?
- Could cast to uint16_t, uint32_t etc., but that solves the problem only for this uint8_t example
- We really want to obtain the carry, and put it into another uint8_t


## The AVR ATmega

- 8-bit RISC architecture
- 32 registers R0...R31, some of those are "special":
- (R26,R27) aliased as X
- $(\mathrm{R} 28, \mathrm{R} 29)$ aliased as Y
- (R30,R31) aliased as Z
- $\mathrm{X}, \mathrm{Y}, \mathrm{Z}$ are used for addressing
- 2-byte output of a multiplication always in R0,R1
- Most arithmetic instructions cost 1 cycle
- Multiplication and memory access takes 2 cycles


## $184+203$

LDI R5, 184
LDI R6, 203
ADD R5, R6 ; result in R5, sets carry flag
CLR R6 ; set R6 to zero
ADC R6,R6 ; add with carry, R6 now holds the carry

## Later in primary school

> Addition
> $42+78=?$
> $789+543=?$
> $7862+5275=?$

## Later in primary school

## Addition

$$
\begin{aligned}
& 42+78=\quad ? \\
& 789+543=? \\
& 7862+5275=?
\end{aligned}
$$

$$
7862
$$

$$
\begin{array}{r}
+\quad 5275 \\
\hline+\quad 7
\end{array}
$$

## Later in primary school

## Addition

$$
\begin{aligned}
& 42+78=\quad ? \\
& 789+543=? \\
& 7862+5275=?
\end{aligned}
$$

$$
7862
$$

$$
\begin{array}{r}
+\quad 5275 \\
\hline+\quad 37
\end{array}
$$

## Later in primary school

## Addition

$$
\begin{aligned}
& 42+78=\quad ? \\
& 789+543=? \\
& 7862+5275=?
\end{aligned}
$$

| 7862 |
| ---: |
| $+\quad 5275$ |
| $+\quad 137$ |

## Later in primary school

## Addition

$$
\begin{aligned}
& 42+78=\quad ? \\
& 789+543=? \\
& 7862+5275=?
\end{aligned}
$$

$$
7862
$$

$$
\begin{array}{r}
+\quad 5275 \\
\hline+\quad 13137
\end{array}
$$

## Later in primary school

## Addition

$$
\begin{aligned}
& 42+78=\quad ? \\
& 789+543=? \\
& 7862+5275=?
\end{aligned}
$$

$$
7862
$$

$$
\begin{array}{r}
+\quad 5275 \\
\hline+\quad 13137
\end{array}
$$

- Once school kids can add beyond 1000, they can add arbitrary numbers


## Multiprecision addition is old

"Oh Līāvatī, intelligent girl, if you understand addition and subtraction, tell me the sum of the amounts 2, 5, 32, 193, 18, 10, and 100, as well as [the remainder of] those when subtracted from 10000."
—"Lilāvatī" by Bhāskara (1150)

## Why do we do all that again?

- Mainly asymmetric cryptography heavily relies on multiprecision arithmetic
- Example 1: RSA-2048 needs (modular) multiplication and squaring of 2048-bit numbers


## Why do we do all that again?

- Mainly asymmetric cryptography heavily relies on multiprecision arithmetic
- Example 1: RSA-2048 needs (modular) multiplication and squaring of 2048-bit numbers
- Example 2:
- Elliptic curves defined over finite fields
- Typically use EC over large-characteristic prime fields
- Typical field sizes: ( 160 bits, 192 bits), 256 bits, 414 bits ...


## Why do we do all that again?

- Mainly asymmetric cryptography heavily relies on multiprecision arithmetic
- Example 1: RSA-2048 needs (modular) multiplication and squaring of 2048-bit numbers
- Example 2:
- Elliptic curves defined over finite fields
- Typically use EC over large-characteristic prime fields
- Typical field sizes: (160 bits, 192 bits), 256 bits, 414 bits ...
- Example 3:
- Genus-2 hyperelliptic curves defined over finite fields
- Typically use HEC over large-characteristic prime fields
- Field size for 128 -bit security: 128 bits


## Why do we do all that again?

- Mainly asymmetric cryptography heavily relies on multiprecision arithmetic
- Example 1: RSA-2048 needs (modular) multiplication and squaring of 2048-bit numbers
- Example 2:
- Elliptic curves defined over finite fields
- Typically use EC over large-characteristic prime fields
- Typical field sizes: (160 bits, 192 bits), 256 bits, 414 bits ...
- Example 3:
- Genus-2 hyperelliptic curves defined over finite fields
- Typically use HEC over large-characteristic prime fields
- Field size for 128 -bit security: 128 bits
- For now mainly interested in 160 -bit and 256 -bit arithmetic


## AVR multiprecision addition...

- Add two $n$-byte numbers, returning an $n+1$ byte result:
- Input pointers X,Y, output pointer Z
LD R5, $\mathrm{X}+$
LD R6, $\mathrm{Y}+$
ADD R5, R6
ST Z+, R5
LD R5, $\mathrm{X}+$
LD R6, $\mathrm{Y}+$
ADC R5, R6
ST Z+,R5

LD R5, X+
LD R6, $\mathrm{Y}+$
ADC R5,R6
ST Z+,R5
LD R5, X+
LD R6, $\mathrm{Y}+$
ADC R5,R6
ST Z+,R5

CLR R5
ADC R5,R5
ST Z+,R5

## ... and subtraction

- Add two $n$-byte numbers, returning an $n+1$ byte result:
- Input pointers X,Y, output pointer Z
- Use highest byte $=-1$ to indicate negative result

LD R5, $\mathrm{X}+$
LD R6,Y+
SUB R5,R6
ST Z+,R5

LD R5, $\mathrm{X}+$
LD R6, Y+
SBC R5,R6
ST Z+,R5

LD R5, X+
LD R6,Y+
SBC R5,R6
ST Z+,R5

LD R5, $\mathrm{X}+$
LD R6, Y+
SBC R5,R6
ST Z+,R5

CLR R5
SBC R5,R5
ST Z+,R5

## How about multiplication?

- Consider multiplication of 1234 by 789


## How about multiplication?

- Consider multiplication of 1234 by 789
$1234 \cdot 789$
6


## How about multiplication?

- Consider multiplication of 1234 by 789
$1234 \cdot 789$
06


## How about multiplication?

- Consider multiplication of 1234 by 789
$1234 \cdot 789$
106


## How about multiplication?

- Consider multiplication of 1234 by 789
$1234 \cdot 789$
11106


## How about multiplication?

- Consider multiplication of 1234 by 789
$1234 \cdot 789$
11106
9872


## How about multiplication?

- Consider multiplication of 1234 by 789
$1234 \cdot 789$
11106
9872
8638


## How about multiplication?

- Consider multiplication of 1234 by 789

|  | $1234 \cdot 789$ |
| ---: | :---: |
|  | 11106 |
| + | 9872 |
| $+\quad 8638$ |  |
|  | 973626 |

## How about multiplication?

- Consider multiplication of 1234 by 789
$1234 \cdot 789$
11106


## How about multiplication?

- Consider multiplication of 1234 by 789

| $1234 \cdot 789$ |
| ---: |
|  |
| $+\quad 9872$ |

## How about multiplication?

- Consider multiplication of 1234 by 789
$1234 \cdot 789$
20978


## How about multiplication?

- Consider multiplication of 1234 by 789

| $1234 \cdot 789$ |
| ---: |
| $+\quad 20978$ |

## How about multiplication?

- Consider multiplication of 1234 by 789
$1234 \cdot 789$
973626


## How about multiplication?

- Consider multiplication of 1234 by 789
$1234 \cdot 789$
973626
- This is also an old technique
- Earliest reference I could find is again the Līāvatī (1150)


## Let's do that on the AVR

```
LD R2, X+
LD R3, X+
LD R4, X+
LD R7, Y+
MUL R2,R7
ST Z+,RO
MOV R8,R1
MUL R3,R7
ADD R8,RO
CLR R9
ADC R9,R1
MUL R4,R7
ADD R9,R0
CLR R10
ADC R10,R1
```


## Let's do that on the AVR

| LD R2, $\mathrm{X}+$ | LD R7, Y+ |
| :---: | :---: |
| LD R3, $\mathrm{X}+$ |  |
| LD R4, X + | MUL R2,R7 |
|  | MOVW R12,R0 |
| LD R7, Y+ |  |
|  | MUL R3,R7 |
| MUL R2,R7 | ADD R13,R0 |
| ST Z+,R0 | CLR R14 |
| MOV R8,R1 | ADC R14,R1 |
| MUL R3,R7 | MUL R4,R7 |
| ADD R8, R0 | ADD R14,R0 |
| CLR R9 | CLR R15 |
| ADC R9,R1 | ADC R15,R1 |
| MUL R4,R7 | ADD R8,R12 |
| ADD R9,R0 | ST Z+,R8 |
| CLR R10 | ADC R9,R13 |
| ADC R10,R1 | ADC R10,R14 |
|  | CLR R11 |
|  | ADC R11,R15 |

## Let's do that on the AVR

| LD R2, X+ | LD R7, Y+ | LD R7, Y+ |
| :---: | :---: | :---: |
| LD R3, X+ |  |  |
| LD R4, X+ | MUL R2,R7 | MUL R2,R7 |
|  | MOVW R12,R0 | MOVW R12,R0 |
| LD R7, Y+ |  |  |
|  | MUL R3, R7 | MUL R3, R7 |
| MUL R2,R7 | ADD R13,R0 | ADD R13,R0 |
| ST Z+,R0 | CLR R14 | CLR R14 |
| MOV R8,R1 | ADC R14,R1 | ADC R14,R1 |
| MUL R3,R7 | MUL R4,R7 | MUL R4,R7 |
| ADD R8, R0 | ADD R14,R0 | ADD R14,R0 |
| CLR R9 | CLR R15 | CLR R15 |
| ADC R9,R1 | ADC R15,R1 | ADC R15,R1 |
| MUL R4,R7 | ADD R8, R12 | ADC R9,R12 |
| ADD R9, R0 | ST Z+,R8 | ST Z+,R9 |
| CLR R10 | ADC R9,R13 | ADC R10,R13 |
| ADC R10,R1 | ADC R10,R14 | ADC R11,R14 |
|  | CLR R11 | CLR R12 |
|  | ADC R11,R15 | ADC R12,R15 |

## Let's do that on the AVR

| LD R2, $\mathrm{X}+$ | LD R7, Y+ | LD R7, Y+ | ST Z+,R10 |
| :---: | :---: | :---: | :---: |
| LD R3, $\mathrm{X}+$ |  |  | ST $\mathrm{Z}+$,R11 |
| LD R4, $\mathrm{X}+$ | MUL R2,R7 | MUL R2,R7 | ST $\mathrm{Z}+$,R12 |
|  | MOVW R12,RO | MOVW R12,RO |  |
| LD R7, Y+ |  |  |  |
|  | MUL R3,R7 | MUL R3,R7 |  |
| MUL R2,R7 | ADD R13,R0 | ADD R13,R0 |  |
| ST Z+,RO | CLR R14 | CLR R14 |  |
| MOV R8,R1 | ADC R14,R1 | ADC R14,R1 |  |
| MUL R3,R7 | MUL R4,R7 | MUL R4,R7 |  |
| ADD R8,RO | ADD R14,R0 | ADD R14,R0 |  |
| CLR R9 | CLR R15 | CLR R15 |  |
| ADC R9,R1 | ADC R15,R1 | ADC R15,R1 |  |
| MUL R4,R7 | ADD R8,R12 | ADC R9,R12 |  |
| ADD R9,RO | ST $\mathrm{Z}+$, R8 | ST Z +, R9 |  |
| CLR R10 | ADC R9,R13 | ADC R10,R13 |  |
| ADC R10,R1 | ADC R10,R14 | ADC R11,R14 |  |
|  | CLR R11 | CLR R12 |  |
|  | ADC R11,R15 | ADC R12,R15 |  |

## Let's do that on the AVR

- Problem: Need $3 n+c$ registers for $n \times n$-byte multiplication


## Let's do that on the AVR

- Problem: Need $3 n+c$ registers for $n \times n$-byte multiplication
- Can add on the fly, get down to $2 n+c$, but more carry handling


## Can we do better?

"Again as the information is understood, the multiplication of 2345 by 6789 is proposed; therefore the numbers are written down; the 5 is multiplied by the 9, there will be 45; the 5 is put, the 4 is kept; and the 5 is multiplied by the 8 , and the 9 by the 4 and the products are added to the kept 4; there will be 80; the 0 is put and the 8 is kept; and the 5 is multiplied by the 7 and the 9 by the 2 and the 4 by the 8, and the products are added to the kept 8; there will be 102; the 2 is put and the 10 is kept in hand. .. "

From "Fibonacci's Liber Abaci", Chapter 2 (English translation by Sigler)

## Product scanning on the AVR

| LD R2, X+ | MUL R2, R9 |
| :--- | :--- |
| LD R3, X + | ADD R14, R0 |
| LD R4, X + | ADC R15, R1 |
| LD R7, Y+ | ADC R16, R5 |
| LD R8, Y+ | MUL R3, R8 |
| LD R9, Y+ | ADD R14, R0 |
|  | ADC R15, R1 |
| MUL R2, R7 | ADC R16, R5 |
| MOV R13, R1 | MUL R4, R7 |
| STD Z+0, R0 | ADD R14, R0 |
| CLR R14 | ADC R15, R1 |
| CLR R15 | ADC R16, R5 |

$$
\begin{aligned}
& \text { MUL R3, R9 } \\
& \text { ADD R15, R0 } \\
& \text { ADC R16, R1 } \\
& \text { ADC R17, R5 } \\
& \text { MUL R4, R8 } \\
& \text { ADD R15, R0 } \\
& \text { ADC R16, R1 } \\
& \text { ADC R17, R5 } \\
& \text { STD Z+3, R15 }
\end{aligned}
$$

MUL R4, R9
ADD R16, R0
ADC R17, R1
STD Z+4, R16
STD Z+5, R17

## Even better...?



From the Treviso Arithmetic, 1478
(http://www.republicaveneta.com/doc/abaco.pdf)

## Hybrid multiplication

- Idea: Chop whole multiplication into smaller blocks
- Compute each of the smaller multiplications by schoolbook
- Later add up to the full result
- See it as two nested loops:
- Inner loop performs operand scanning
- Outer loop performs product scanning


## Hybrid multiplication

- Idea: Chop whole multiplication into smaller blocks
- Compute each of the smaller multiplications by schoolbook
- Later add up to the full result
- See it as two nested loops:
- Inner loop performs operand scanning
- Outer loop performs product scanning
- Originally proposed by Gura, Patel, Wander, Eberle, Chang Shantz, 2004


## Hybrid multiplication

- Idea: Chop whole multiplication into smaller blocks
- Compute each of the smaller multiplications by schoolbook
- Later add up to the full result
- See it as two nested loops:
- Inner loop performs operand scanning
- Outer loop performs product scanning
- Originally proposed by Gura, Patel, Wander, Eberle, Chang Shantz, 2004
- Various improvements, consider 160-bit multiplication:
- Originally: 3106 cycles
- Uhsadel, Poschmann, Paar (2007): 2881 cycles
- Scott, Szczechowiak (2007): 2651 cycles
- Kargl, Pyka, Seuschek (2008): 2593 cycles


## Operand-caching multiplication

- Hutter, Wenger, 2011: More efficient way to decompose multiplication
- Inside separate chunks use product-scanning
- Main idea: re-use values in registers for longer


## Operand-caching multiplication

- Hutter, Wenger, 2011: More efficient way to decompose multiplication
- Inside separate chunks use product-scanning
- Main idea: re-use values in registers for longer
- Performance:
- 2393 cycles for 160 -bit multiplication
- 6121 cycles for 256 -bit multiplication


## Operand-caching multiplication

- Hutter, Wenger, 2011: More efficient way to decompose multiplication
- Inside separate chunks use product-scanning
- Main idea: re-use values in registers for longer
- Performance:
- 2393 cycles for 160 -bit multiplication
- 6121 cycles for 256-bit multiplication
- Followup-paper by Seo and Kim: "Consecutive operand caching":
- 2341 cycles for 160 -bit multiplication
- 6115 cycles for 256 -bit multiplication


## Multiplication complexity

- So far, multiplication of $2 n$-byte numbers needs $n^{2}$ MULs
- Kolmogorov conjectured 1952: You can't do better, multiplication has quadratic complexity


## Multiplication complexity

- So far, multiplication of $2 n$-byte numbers needs $n^{2}$ MULs
- Kolmogorov conjectured 1952: You can't do better, multiplication has quadratic complexity
- Proven wrong by 23 -year old student Karatsuba in 1960


## Multiplication complexity

- So far, multiplication of $2 n$-byte numbers needs $n^{2}$ MULs
- Kolmogorov conjectured 1952: You can't do better, multiplication has quadratic complexity
- Proven wrong by 23 -year old student Karatsuba in 1960
- Idea: write $A \cdot B$ as $\left(A_{0}+2^{m} A_{1}\right)\left(B_{0}+2^{m} B_{1}\right)$ for half-size $A_{0}, B_{0}, A_{1}, B_{1}$


## Multiplication complexity

- So far, multiplication of $2 n$-byte numbers needs $n^{2}$ MULs
- Kolmogorov conjectured 1952: You can't do better, multiplication has quadratic complexity
- Proven wrong by 23 -year old student Karatsuba in 1960
- Idea: write $A \cdot B$ as $\left(A_{0}+2^{m} A_{1}\right)\left(B_{0}+2^{m} B_{1}\right)$ for half-size $A_{0}, B_{0}, A_{1}, B_{1}$
- Compute

$$
A_{0} B_{0}+\quad 2^{m}\left(A_{0} B_{1}+B_{0} A_{1}\right) \quad+2^{2 m} A_{1} B_{1}
$$

## Multiplication complexity

- So far, multiplication of $2 n$-byte numbers needs $n^{2}$ MULs
- Kolmogorov conjectured 1952: You can't do better, multiplication has quadratic complexity
- Proven wrong by 23 -year old student Karatsuba in 1960
- Idea: write $A \cdot B$ as $\left(A_{0}+2^{m} A_{1}\right)\left(B_{0}+2^{m} B_{1}\right)$ for half-size $A_{0}, B_{0}, A_{1}, B_{1}$
- Compute

$$
\begin{array}{cc} 
& A_{0} B_{0}+ \\
= & A_{0} B_{0}+2^{m}\left(\left(A_{0} B_{1}+B_{0} A_{1}\right)\right. \\
\left.\left.A_{1}\right)\left(B_{0}+B_{1}\right)-2_{0} B_{0}-A_{1} B_{1}\right)+2_{1} \\
2_{1} A_{1} B_{1}
\end{array}
$$

## Multiplication complexity

- So far, multiplication of $2 n$-byte numbers needs $n^{2}$ MULs
- Kolmogorov conjectured 1952: You can't do better, multiplication has quadratic complexity
- Proven wrong by 23 -year old student Karatsuba in 1960
- Idea: write $A \cdot B$ as $\left(A_{0}+2^{m} A_{1}\right)\left(B_{0}+2^{m} B_{1}\right)$ for half-size $A_{0}, B_{0}, A_{1}, B_{1}$
- Compute

$$
\begin{array}{cc} 
& A_{0} B_{0}+r \\
= & A_{0} B_{0}+2^{m}\left(\left(A_{0} B_{1}+B_{0} A_{1}\right)\right. \\
\left.\left.A_{1}\right)\left(B_{0}+B_{1}\right)-A_{0} B_{0}-A_{1} B_{1}\right)+2_{1} B_{1} \\
{ }^{2 m} A_{1} B_{1}
\end{array}
$$

- Recursive application yields $\Theta\left(n^{\log _{2} 3}\right)$ runtime


## Does that help on the AVR?



## The straight-forward approach

Consider multiplication of $n$-byte numbers

$$
\begin{aligned}
& A \hat{=}\left(a_{0}, \ldots, a_{n-1}\right) \text { and } \\
& B \hat{=}\left(b_{0}, \ldots, b_{n-1}\right)
\end{aligned}
$$

## The straight-forward approach

Consider multiplication of $n$-byte numbers

$$
\begin{aligned}
& A \hat{=}\left(a_{0}, \ldots, a_{n-1}\right) \text { and } \\
& B \hat{=}\left(b_{0}, \ldots, b_{n-1}\right)
\end{aligned}
$$

- Write $A=A_{\ell}+2^{8 k} A_{h}$ and $B=B_{\ell}+2^{8 k} B_{h}$ for $k$-byte integers $A_{\ell}, A_{h}, B_{\ell}$, and $B_{h}$ and $k=n / 2$


## The straight-forward approach

Consider multiplication of $n$-byte numbers

$$
\begin{aligned}
& A \hat{=}\left(a_{0}, \ldots, a_{n-1}\right) \text { and } \\
& B \hat{=}\left(b_{0}, \ldots, b_{n-1}\right)
\end{aligned}
$$

- Write $A=A_{\ell}+2^{8 k} A_{h}$ and $B=B_{\ell}+2^{8 k} B_{h}$ for $k$-byte integers $A_{\ell}, A_{h}, B_{\ell}$, and $B_{h}$ and $k=n / 2$
- Compute $L=A_{\ell} \cdot B_{\ell} \hat{=}\left(\ell_{0}, \ldots, \ell_{n-1}\right)$
- Compute $H=A_{h} \cdot B_{h} \hat{=}\left(h_{0}, \ldots, h_{n-1}\right)$
- Compute $M=\left(A_{\ell}+A_{h}\right) \cdot\left(B_{\ell}+B_{h}\right) \hat{=}\left(m_{0}, \ldots, m_{n}\right)$


## The straight-forward approach

Consider multiplication of $n$-byte numbers

$$
\begin{aligned}
& A \hat{=}\left(a_{0}, \ldots, a_{n-1}\right) \text { and } \\
& B \hat{=}\left(b_{0}, \ldots, b_{n-1}\right)
\end{aligned}
$$

- Write $A=A_{\ell}+2^{8 k} A_{h}$ and $B=B_{\ell}+2^{8 k} B_{h}$ for $k$-byte integers $A_{\ell}, A_{h}, B_{\ell}$, and $B_{h}$ and $k=n / 2$
- Compute $L=A_{\ell} \cdot B_{\ell} \hat{=}\left(\ell_{0}, \ldots, \ell_{n-1}\right)$
- Compute $H=A_{h} \cdot B_{h} \hat{=}\left(h_{0}, \ldots, h_{n-1}\right)$
- Compute $M=\left(A_{\ell}+A_{h}\right) \cdot\left(B_{\ell}+B_{h}\right) \hat{=}\left(m_{0}, \ldots, m_{n}\right)$
- Obtain result as $A \cdot B=L+2^{8 k}(M-L-H)+2^{8 n} H$


## Multiplication by the carry in $M$

- Can expand carry to 0xff or 0x00
- Use AND instruction for multiplication


## Multiplication by the carry in $M$

- Can expand carry to 0xff or 0x00
- Use AND instruction for multiplication
- Does not help for recursive Karatsuba


## Multiplication by the carry in $M$

- Can expand carry to 0xff or 0x00
- Use AND instruction for multiplication
- Does not help for recursive Karatsuba


## Subtractive Karatsuba

- Compute $L=A_{\ell} \cdot B_{\ell} \hat{=}\left(\ell_{0}, \ldots, \ell_{n-1}\right)$
- Compute $H=A_{h} \cdot B_{h} \hat{=}\left(h_{0}, \ldots, h_{n-1}\right)$
- Compute $M=\left|A_{\ell}-A_{h}\right| \cdot\left|B_{\ell}-B_{h}\right| \hat{=}\left(m_{0}, \ldots, m_{n-1}\right)$
- Set $t=0$, if $M=\left(A_{\ell}-A_{h}\right) \cdot\left(B_{\ell}-B_{h}\right) ; t=1$ otherwise
- Compute $\hat{M}=(-1)^{t} M=\left(A_{\ell}-A_{h}\right)\left(B_{\ell}-B_{h}\right)$ $\hat{=}\left(\hat{m}_{0}, \ldots, \hat{m}_{n-1}\right)$
- Obtain result as $A \cdot B=L+2^{8 k}(L+H-\hat{M})+2^{8 n} H$


## Conditional negation

The easy solution
if (b) a = -a

## Conditional negation

The easy solution
if(b) $\mathrm{a}=-\mathrm{a}$

- NEG instruction does not help for multiprecision
- Can subtract from zero, but subtraction would overwrite zero


## Conditional negation

The easy solution
if(b) $\mathrm{a}=-\mathrm{a}$

- NEG instruction does not help for multiprecision
- Can subtract from zero, but subtraction would overwrite zero
- Even worse, the if would create a timing side-channel!


## Conditional negation

The easy solution
if (b) a = -a

- NEG instruction does not help for multiprecision
- Can subtract from zero, but subtraction would overwrite zero
- Even worse, the if would create a timing side-channel!

The constant-time solution

- Produce condition bit as byte 0xff or $0 x 00$
- XOR all values with this condition byte


## Conditional negation

## The easy solution

if(b) $\mathrm{a}=-\mathrm{a}$

- NEG instruction does not help for multiprecision
- Can subtract from zero, but subtraction would overwrite zero
- Even worse, the if would create a timing side-channel!


## The constant-time solution

- Produce condition bit as byte 0xff or $0 x 00$
- XOR all values with this condition byte
- Negate the condition byte and obtain 0x01 or 0x00
- Add this value to the lowest byte
- Ripple through the carry (ADC with zero)


## Conditional negation

The easy solution
if(b) $a=-a$

- NEG instruction does not help for multiprecision
- Can subtract from zero, but subtraction would overwrite zero
- Even worse, the if would create a timing side-channel!

The constant-time solution

- Produce condition bit as byte 0xff or 0x00
- XOR all values with this condition byte
- Don't negate the condition byte
- Subtract the condition byte (0xff or 0x00 from all bytes)
- Saves two NEG instructions


## Refined Karatsuba

- Consider example of $4 \times 4$-byte Karatsuba multiplication:

| $l_{0}$ | $l_{1}$ | $l_{2}$ | $l_{3}$ | $h_{0}$ | $h_{1}$ | $h_{2}$ | $h_{3}$ |
| ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: |
|  | - | $\hat{m}_{0}$ | $\hat{m}_{1}$ | $\hat{m}_{2}$ | $\hat{m}_{3}$ |  |  |
|  | + | $l_{0}$ | $l_{1}$ | $l_{2}$ | $l_{3}$ |  |  |
|  | + | $h_{0}$ | $h_{1}$ | $h_{2}$ | $h_{3}$ |  |  |

## Refined Karatsuba

- Consider example of $4 \times 4$-byte Karatsuba multiplication:

- Karatsuba performs some additions twice
- Refined Karatsuba: do them only once


## Refined Karatsuba

- Consider example of $4 \times 4$-byte Karatsuba multiplication:

- Karatsuba performs some additions twice
- Refined Karatsuba: do them only once
- Merge additions into computation of $H$
- Compute $\mathbf{H} \hat{=}\left(\mathbf{h}_{\mathbf{0}}, \mathbf{h}_{\mathbf{1}}, \mathbf{h}_{\mathbf{2}}, \mathbf{h}_{\mathbf{3}}\right)=H+\left(l_{2}, l_{3}\right)$
- Note that H cannot "overflow"


## Refined Karatsuba

- Consider example of $4 \times 4$-byte Karatsuba multiplication:

| $l_{0}$ | $l_{1}$ | $l_{2}$ | $l_{3}$ | $h_{0}$ | $h_{1}$ | $h_{2}$ | $h_{3}$ |
| ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: |
|  | - | $\hat{m}_{0}$ | $\hat{m}_{1}$ | $\hat{m}_{2}$ | $\hat{m}_{3}$ |  |  |
|  | + | $l_{0}$ | $l_{1}$ | $l_{2}$ | $l_{3}$ |  |  |
|  | + | $h_{0}$ | $h_{1}$ | $h_{2}$ | $h_{3}$ |  |  |

- Karatsuba performs some additions twice
- Refined Karatsuba: do them only once
- Merge additions into computation of $H$
- Compute $\mathbf{H} \hat{=}\left(\mathbf{h}_{\mathbf{0}}, \mathbf{h}_{\mathbf{1}}, \mathbf{h}_{\mathbf{2}}, \mathbf{h}_{\mathbf{3}}\right)=H+\left(l_{2}, l_{3}\right)$
- Note that H cannot "overflow"



## Refined Karatsuba

- Consider example of $4 \times 4$-byte Karatsuba multiplication:

| $l_{0}$ | $l_{1}$ | $l_{2}$ | $l_{3}$ | $h_{0}$ | $h_{1}$ | $h_{2}$ | $h_{3}$ |
| ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: |
|  | - | $\hat{m}_{0}$ | $\hat{m}_{1}$ | $\hat{m}_{2}$ | $\hat{m}_{3}$ |  |  |
|  | + | $l_{0}$ | $l_{1}$ | $l_{2}$ | $l_{3}$ |  |  |
|  | + | $h_{0}$ | $h_{1}$ | $h_{2}$ | $h_{3}$ |  |  |

- Karatsuba performs some additions twice
- Refined Karatsuba: do them only once
- Merge additions into computation of $H$
- Compute $\mathbf{H} \hat{=}\left(\mathbf{h}_{\mathbf{0}}, \mathbf{h}_{\mathbf{1}}, \mathbf{h}_{\mathbf{2}}, \mathbf{h}_{\mathbf{3}}\right)=H+\left(l_{2}, l_{3}\right)$
- Note that H cannot "overflow"

| $l_{0}$ | $l_{1}$ | $\mathbf{h}_{0}$ | $\mathbf{h}_{1}$ | $\mathbf{h}_{0}$ | $\mathbf{h}_{1}$ | $\mathbf{h}_{\mathbf{2}}$ | $\mathbf{h}_{\mathbf{3}}$ |
| :---: | :---: | ---: | ---: | ---: | ---: | ---: | ---: |
|  | - | $\hat{m}_{0}$ | $\hat{m}_{1}$ | $\hat{m}_{2}$ | $\hat{m}_{3}$ |  |  |
|  | + | $l_{0}$ | $l_{1}$ | $\mathbf{h}_{\mathbf{2}}$ | $\mathbf{h}_{\mathbf{3}}$ |  |  |

## Refined Karatsuba

- Consider example of $4 \times 4$-byte Karatsuba multiplication:

| $l_{0}$ | $l_{1}$ | $l_{2}$ | $l_{3}$ | $h_{0}$ | $h_{1}$ | $h_{2}$ |
| ---: | ---: | ---: | ---: | ---: | ---: | ---: |$h_{3}$

- Karatsuba performs some additions twice
- Refined Karatsuba: do them only once
- Merge additions into computation of $H$
- Compute $\mathbf{H} \hat{=}\left(\mathbf{h}_{\mathbf{0}}, \mathbf{h}_{\mathbf{1}}, \mathbf{h}_{\mathbf{2}}, \mathbf{h}_{\mathbf{3}}\right)=H+\left(l_{2}, l_{3}\right)$
- Note that H cannot "overflow"

| $l_{0}$ | $l_{1}$ | $\mathbf{h}_{0}$ | $\mathbf{h}_{1}$ | $\mathbf{h}_{0}$ | $\mathbf{h}_{1}$ | $\mathbf{h}_{\mathbf{2}}$ | $\mathbf{h}_{\mathbf{3}}$ |
| :---: | :---: | ---: | ---: | ---: | ---: | ---: | ---: |
|  | - | $\hat{m}_{0}$ | $\hat{m}_{1}$ | $\hat{m}_{2}$ | $\hat{m}_{3}$ |  |  |
|  | + | $l_{0}$ | $l_{1}$ | $\mathbf{h}_{\mathbf{2}}$ | $\mathbf{h}_{\mathbf{3}}$ |  |  |

- Consequence: fewer additions, easier register allocation


## Putting it together

Arithmetic cost of $n$-byte Karatsuba on AVR

- Cost of computing $L, M$, and $\mathbf{H}$


## Putting it together

## Arithmetic cost of $n$-byte Karatsuba on AVR

- Cost of computing $L, M$, and $\mathbf{H}$
- $4 k+2$ SUB/SBC, $2 k$ EOR for absolute differences


## Putting it together

## Arithmetic cost of $n$-byte Karatsuba on AVR

- Cost of computing $L, M$, and $\mathbf{H}$
- $4 k+2$ SUB/SBC, $2 k$ EOR for absolute differences
- $n+1$ ADD/ADC to add $\left(l_{0}, \ldots, l_{k-1}, \mathbf{h}_{\mathbf{k}}, \ldots, \mathbf{h}_{\mathbf{n}-\mathbf{1}}\right)$


## Putting it together

## Arithmetic cost of $n$-byte Karatsuba on AVR

- Cost of computing $L, M$, and $\mathbf{H}$
- $4 k+2$ SUB/SBC, $2 k$ EOR for absolute differences
- $n+1$ ADD/ADC to add $\left(l_{0}, \ldots, l_{k-1}, \mathbf{h}_{\mathbf{k}}, \ldots, \mathbf{h}_{\mathbf{n}-\mathbf{1}}\right)$
- One EOR to compute $t$
- A BRNE instruction to branch, then either


## Putting it together

## Arithmetic cost of $n$-byte Karatsuba on AVR

- Cost of computing $L, M$, and $\mathbf{H}$
- $4 k+2$ SUB/SBC, $2 k$ EOR for absolute differences
- $n+1$ ADD/ADC to add $\left(l_{0}, \ldots, l_{k-1}, \mathbf{h}_{\mathbf{k}}, \ldots, \mathbf{h}_{\mathbf{n}-\mathbf{1}}\right)$
- One EOR to compute $t$
- A BRNE instruction to branch, then either
- $n+2$ SUB/SBC instructions and one RJMP, or
- $n+1 \mathrm{ADD} / \mathrm{ADC}$, one CLR, and one NOP


## Putting it together

## Arithmetic cost of $n$-byte Karatsuba on AVR

- Cost of computing $L, M$, and $\mathbf{H}$
- $4 k+2$ SUB/SBC, $2 k$ EOR for absolute differences
- $n+1$ ADD/ADC to add $\left(l_{0}, \ldots, l_{k-1}, \mathbf{h}_{\mathbf{k}}, \ldots, \mathbf{h}_{\mathbf{n}-\mathbf{1}}\right)$
- One EOR to compute $t$
- A BRNE instruction to branch, then either
- $n+2$ SUB/SBC instructions and one RJMP, or
- $n+1 \mathrm{ADD} / \mathrm{ADC}$, one CLR, and one NOP
- $k$ ADD/ADC instructions to ripple carry to the end


## 48-bit Karatsuba on AVR

CLR R22
CLR R23
MOVW R12, R22
MOVW R20, R22
LD R2, X+
LD R3, X+
LD R4, X+
LDD R5, Y+0
LDD R6, Y+1
LDD R7, Y+2
MUL R2, R7
MOVW R10, R0
MUL R2, R5
MOVW R8, RO
MUL R2, R6
ADD R9, R0
ADC R10, R1
ADC R11, R23

MUL R3, R7
MOVW R14, RO
MUL R3, R5
ADD R9, RO
ADC R10, R1
ADC R11, R14
ADC R15, R23
MUL R3, R6
ADD R10, R0
ADC R11, R1
ADC R12, R15
MUL R4, R7
MOVW R14, RO
MUL R4, R5
ADD R10, R0
ADC R11, R1
ADC R12, R14
ADC R15, R23
MUL R4, R6
ADD R11, R0
ADC R12, R1
ADC R13, R15
STD Z+0, R8
STD Z+1, R9
STD Z+2, R10

LD R14, X+
LD R15, X+
LD R16, X+
LDD R17, Y+3
LDD R18, Y+4
LDD R19, Y+5
SUB R2, R14
SBC R3, R15
SBC R4, R16
SBC R26, R26
SUB R5, R17
SBC R6, R18
SBC R7, R19
SBC R27, R27

EOR R2, R26
EOR R3, R26
EOR R4, R26
EOR R5, R27
EOR R6, R27
EOR R7, R27
SUB R2, R26
SBC R3, R26
SBC R4, R26
SUB R5, R27
SBC R6, R27
SBC R7, R27

## 48-bit Karatsuba on AVR

MUL R14, R19
MOVW R24, R0
MUL R14, R17
ADD R11, R0
ADC R12, R1
ADC R13, R24
ADC R25, R23
MUL R14, R18
ADD R12, R0
ADC R13, R1
ADC R20, R25
MUL R15, R19
MOVW R24, R0
MUL R15, R17
ADD R12, R0
ADC R13, R1
ADC R20, R24
ADC R25, R23
MUL R15, R18
ADD R13, R0
ADC R20, R1
ADC R21, R25

MUL R16, R19
MOVW R24, RO
MUL R16, R17
ADD R13, R0
ADC R20, R1
ADC R21, R24
ADC R25, R23
MUL R16, R18
MOVW R18,R22
ADD R20, RO
ADC R21, R1
ADC R22, R25
ADC R22,

```
MUL R2, R7
MOVW R16, RO
MUL R2, R5
MOVW R14, RO
MUL R2, R6
ADD R15, R0
ADC R16, R1
ADC R17, R23
MUL R3, R7
MOVW R24, RO
MUL R3, R5
ADD R15, R0
ADC R16, R1
ADC R17, R24
ADC R25, R23
MUL R3, R6
ADD R16, R0
ADC R17, R1
ADC R18, R25
```

MUL R4, R7 MOVW R24, RO MUL R4, R5 ADD R16, R0 ADC R17, R1 ADC R18, R24 ADC R25, R23 MUL R4, R6 ADD R17, R0 ADC R18, R1 ADC R19, R25

## 48-bit Karatsuba on AVR

| ADD R8, R11 | add_M: |
| :--- | :---: |
| ADC R9, R12 | ADD R8, R14 |
| ADC R10, R13 | ADC R9, R15 |
| ADC R11, R20 | ADC R10, R16 |
| ADC R12, R21 | ADC R11, R17 |
| ADC R13, R22 | ADC R12, R18 |
| ADC R23, R23 | ADC R13, R19 |
|  | CLR R24, |
| EOR R26, R27 | ADC R23, R24 |
| BRNE add_M | NOP |
|  |  |
| SUB R8, R14 | final: |
| SBC R9, R15 | STD Z+3, R8 |
| SBC R10, R16 | STD Z+4, R9 |
| SBC R11, R17 | STD Z+5, R10 |
| SBC R12, R18 | STD Z+6, R11 |
| SBC R13, R19 | STD Z+7, R12 |
| SBCI R23, 0 | STD Z+8, R13 |
| SBC R24, R24 |  |
| RJMP final | ADD R20, R23 |
|  | ADC R21, R24 |
|  | ADC R22, R24 |
|  |  |
|  | STD Z+9, R20 |
|  | STD Z+10, R21 |

## Larger Karatsuba multiplication

- 48-bit Karatsuba is friendly; everything fits into registers
- Remember that previous speed records were achieved by eliminating loads/stores


## Larger Karatsuba multiplication

- 48-bit Karatsuba is friendly; everything fits into registers
- Remember that previous speed records were achieved by eliminating loads/stores
- Karatsuba structure needs additional temporary storage
- Good performance needs careful scheduling and register allocation
- Very important is to compute $\mathbf{H}=H+\left(l_{k+1}, \ldots, l_{n-1}\right)$ on the fly


## Larger Karatsuba multiplication

- 48-bit Karatsuba is friendly; everything fits into registers
- Remember that previous speed records were achieved by eliminating loads/stores
- Karatsuba structure needs additional temporary storage
- Good performance needs careful scheduling and register allocation
- Very important is to compute $\mathbf{H}=H+\left(l_{k+1}, \ldots, l_{n-1}\right)$ on the fly
- Use 1-level Karatsuba for 48-bit, 64-bit, 80-bit, 96-bit inputs
- Use 2-level Karatsuba for 128-bit, 160-bit, 192-bit inputs
- Use 3-level Karatsuba for 256-bit inputs


## Results

Cycle counts for $n$-bit multiplication

|  | Input size $n$ |  |  |  |  |  |  |  |  |
| :--- | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | :---: |
| Approach | 48 | 64 | 80 | 96 | 128 | 160 | 192 | 256 |  |
| Product scanning: | 235 | 395 | 595 | 836 | - | - | - | - |  |
| Hutter, Wenger, 2011: | - | - | - | - | - | 2393 | 3467 | 6121 |  |
| Seo, Kim, 2012: | - | - | - | - | 1532 | 2356 | 3464 | 6180 |  |
| Seo, Kim, 2013: | - | - | - | - | 1523 | 2341 | 3437 | 6115 |  |
| Karatsuba: | 217 | 360 | 522 | 780 | 1325 | $\mathbf{1 9 7 6}$ | 2923 | $\mathbf{4 7 9 7}$ |  |
| - w/o branches: | 222 | 368 | 533 | 800 | 1369 | 2030 | 2987 | 4961 |  |

- 160-bit multiplication now $>18 \%$ faster
- 256-bit multiplication now $>23 \%$ faster


## Resources online

## Paper:

Michael Hutter, Peter Schwabe. "Multiprecision multiplication on AVR revisited".
http://cryptojedi.org/papers/\#avrmul
Software: http://cryptojedi.org/crypto/\#avrmul

## Resources online

## Paper:

Michael Hutter, Peter Schwabe. "Multiprecision multiplication on AVR revisited".
http://cryptojedi.org/papers/\#avrmul
Software: http://cryptojedi.org/crypto/\#avrmul

Code generator for operand-scanning, product-scanning, hybrid, and operand-caching by Hutter and Wenger:
http://mhutter.org/research/avr/index.htm\#mulopcache

## PART II

Joint work with Daniel J. Bernstein, Chitchanok Chuengsatiansup, and Tanja Lange

## From 8-bit to 64 -bit processors

Main differences (for us)

- Arithmetic on larger (64-bit) integers


## From 8-bit to 64 -bit processors

Main differences (for us)

- Arithmetic on larger (64-bit) integers
- Arithmetic on floating-point numbers


## From 8 -bit to 64 -bit processors

## Main differences (for us)

- Arithmetic on larger (64-bit) integers
- Arithmetic on floating-point numbers
- Arithmetic on vectors


## From 8 -bit to 64 -bit processors

## Main differences (for us)

- Arithmetic on larger (64-bit) integers
- Arithmetic on floating-point numbers
- Arithmetic on vectors
- Pipelined and superscalar execution


## Radix- $2^{64}$ representation

- Let's consider representing 255-bit integers
- Obvious choice: use 464 -bit integers $a_{0}, a_{1}, a_{2}, a_{3}$ with

$$
A=\sum_{i=0}^{3} a_{i} 2^{64 i}
$$

- Arithmetic works just as before (except with larger registers)


## Radix- $2^{51}$ representation

- Radix- $2^{64}$ representation works and is sometimes a good choice
- Highly depends on the efficiency of handling carries


## Radix- $2^{51}$ representation

- Radix-2 ${ }^{64}$ representation works and is sometimes a good choice
- Highly depends on the efficiency of handling carries
- Example 1: Intel Nehalem can do 3 additions every cycle, but only 1 addition with carry every two cycles (carries cost a factor of 6 !)


## Radix- $2^{51}$ representation

- Radix- $2^{64}$ representation works and is sometimes a good choice
- Highly depends on the efficiency of handling carries
- Example 1: Intel Nehalem can do 3 additions every cycle, but only 1 addition with carry every two cycles (carries cost a factor of 6 !)
- Example 2: When using vector arithmetic, carries are typically lost (very expensive to recompute)


## Radix- $2^{51}$ representation

- Radix- $2^{64}$ representation works and is sometimes a good choice
- Highly depends on the efficiency of handling carries
- Example 1: Intel Nehalem can do 3 additions every cycle, but only 1 addition with carry every two cycles (carries cost a factor of 6 !)
- Example 2: When using vector arithmetic, carries are typically lost (very expensive to recompute)
- Let's get rid of the carries, represent $A$ as $\left(a_{0}, a_{1}, a_{2}, a_{3}, a_{4}\right)$ with

$$
A=\sum_{i=0}^{4} a_{i} 2^{51 \cdot i}
$$

- This is called radix- $2^{51}$ representation


## Radix- $2^{51}$ representation

- Radix- $2^{64}$ representation works and is sometimes a good choice
- Highly depends on the efficiency of handling carries
- Example 1: Intel Nehalem can do 3 additions every cycle, but only 1 addition with carry every two cycles (carries cost a factor of 6 !)
- Example 2: When using vector arithmetic, carries are typically lost (very expensive to recompute)
- Let's get rid of the carries, represent $A$ as $\left(a_{0}, a_{1}, a_{2}, a_{3}, a_{4}\right)$ with

$$
A=\sum_{i=0}^{4} a_{i} 2^{51 \cdot i}
$$

- This is called radix- $2^{51}$ representation
- Multiple ways to write the same integer $A$, for example $A=2^{52}$ :
- $\left(2^{52}, 0,0,0,0\right)$
- $(0,2,0,0,0)$


## Radix- $2^{51}$ representation

- Radix- $2^{64}$ representation works and is sometimes a good choice
- Highly depends on the efficiency of handling carries
- Example 1: Intel Nehalem can do 3 additions every cycle, but only 1 addition with carry every two cycles (carries cost a factor of 6 !)
- Example 2: When using vector arithmetic, carries are typically lost (very expensive to recompute)
- Let's get rid of the carries, represent $A$ as $\left(a_{0}, a_{1}, a_{2}, a_{3}, a_{4}\right)$ with

$$
A=\sum_{i=0}^{4} a_{i} 2^{51 \cdot i}
$$

- This is called radix- $2^{51}$ representation
- Multiple ways to write the same integer $A$, for example $A=2^{52}$ :
- $\left(2^{52}, 0,0,0,0\right)$
- $(0,2,0,0,0)$
- Let's call a representation $\left(a_{0}, a_{1}, a_{2}, a_{3}, a_{4}\right)$ reduced, if all $a_{i} \in\left[0, \ldots, 2^{52}-1\right]$


## Addition of two bigint255

```
typedef struct{
    unsigned long long a[5];
} bigint255;
void bigint255_add(bigint255 *r,
                const bigint255 *x,
                    const bigint255 *y)
{
    r->a[0] = x->a[0] + y->a[0];
    r->a[1] = x->a[1] + y->a[1];
    r->a[2] = x->a[2] + y->a[2];
    r->a[3] = x->a[3] + y->a[3];
    r->a[4] = x->a[4] + y->a[4];
}
```


## Addition of two bigint255

```
typedef struct{
    unsigned long long a[5];
} bigint255;
void bigint255_add(bigint255 *r,
{
    r->a[0] = x->a[0] + y->a[0];
    r->a[1] = x->a[1] + y->a[1];
    r->a[2] = x->a[2] + y->a[2];
    r->a[3] = x->a[3] + y->a[3];
    r->a[4] = x->a[4] + y->a[4];
}
```

                        const bigint255 *x,
                            const bigint255 *y)
    - This definitely works for reduced inputs


## Addition of two bigint255

```
typedef struct{
    unsigned long long a[5];
} bigint255;
void bigint255_add(bigint255 *r,
                        const bigint255 *x,
                            const bigint255 *y)
{
    r->a[0] = x->a[0] + y->a[0];
    r->a[1] = x->a[1] + y->a[1];
    r->a[2] = x->a[2] + y->a[2];
    r->a[3] = x->a[3] + y->a[3];
    r->a[4] = x->a[4] + y->a[4];
}
```

- This definitely works for reduced inputs
- This actually works as long as all coefficients are in $\left[0, \ldots, 2^{63}-1\right]$


## Addition of two bigint255

```
typedef struct{
    unsigned long long a[5];
} bigint255;
void bigint255_add(bigint255 *r,
                        const bigint255 *y)
{
    r->a[0] = x->a[0] + y->a[0];
    r->a[1] = x->a[1] + y->a[1];
    r->a[2] = x->a[2] + y->a[2];
    r->a[3] = x->a[3] + y->a[3];
    r->a[4] = x->a[4] + y->a[4];
}
```

                        const bigint255 *x,
    - This definitely works for reduced inputs
- This actually works as long as all coefficients are in $\left[0, \ldots, 2^{63}-1\right]$
- We can do quite a few additions before we have to carry (reduce)


## Subtraction of two bigint255

```
typedef struct{
    signed long long a[5];
} bigint255;
void bigint255_sub(bigint255 *r,
{
    r->a[0] = x->a[0] - y->a[0];
    r->a[1] = x->a[1] - y->a[1];
    r->a[2] = x->a[2] - y->a[2];
    r->a[3] = x->a[3] - y->a[3];
    r->a[4] = x->a[4] - y->a[4];
}
```

                const bigint255 *x,
                const bigint255 *y)
    - Slightly update our bigint255 definition to work with signed 64-bit integers


## Subtraction of two bigint255

```
typedef struct{
    signed long long a[5];
} bigint255;
void bigint255_sub(bigint255 *r,
{
    r->a[0] = x->a[0] - y->a[0];
    r->a[1] = x->a[1] - y->a[1];
    r->a[2] = x->a[2] - y->a[2];
    r->a[3] = x->a[3] - y->a[3];
    r->a[4] = x->a[4] - y->a[4];
}
```

                const bigint255 *x,
                const bigint255 *y)
    - Slightly update our bigint255 definition to work with signed 64-bit integers
- Reduced if coefficients are in $\left[-2^{52}+1,2^{52}-1\right]$


## Carrying in radix- $2^{51}$

- With many additions, coefficients may grow larger than 63 bits
- They grow even faster with multiplication


## Carrying in radix- $2^{51}$

- With many additions, coefficients may grow larger than 63 bits
- They grow even faster with multiplication
- Eventually we have to carry en bloc:

```
signed long long carry \(=\) r.a[0] >> 51;
r.a[1] += carry;
carry <<= 51;
r.a[0] -= carry;
```


## Big integers and polynomials

- Note: Addition code would look exactly the same for 5-coefficient polynomial addition


## Big integers and polynomials

- Note: Addition code would look exactly the same for 5-coefficient polynomial addition
- This is no coincidence: We actually perform arithmetic in $\mathbb{Z}[x]$
- Inputs to addition are 5-coefficient polynomials


## Big integers and polynomials

- Note: Addition code would look exactly the same for 5-coefficient polynomial addition
- This is no coincidence: We actually perform arithmetic in $\mathbb{Z}[x]$
- Inputs to addition are 5-coefficient polynomials
- Nice thing about arithmetic $\mathbb{Z}[x]$ : no carries!


## Big integers and polynomials

- Note: Addition code would look exactly the same for 5-coefficient polynomial addition
- This is no coincidence: We actually perform arithmetic in $\mathbb{Z}[x]$
- Inputs to addition are 5-coefficient polynomials
- Nice thing about arithmetic $\mathbb{Z}[x]$ : no carries!
- To go from $\mathbb{Z}[x]$ to $\mathbb{Z}$, evaluate at the radix (this is a ring homomorphism)
- Carrying means evaluating at the radix


## Big integers and polynomials

- Note: Addition code would look exactly the same for 5-coefficient polynomial addition
- This is no coincidence: We actually perform arithmetic in $\mathbb{Z}[x]$
- Inputs to addition are 5-coefficient polynomials
- Nice thing about arithmetic $\mathbb{Z}[x]$ : no carries!
- To go from $\mathbb{Z}[x]$ to $\mathbb{Z}$, evaluate at the radix (this is a ring homomorphism)
- Carrying means evaluating at the radix
- Thinking of multiprecision integers as polynomials is very powerful for efficient arithmetic


## Using floating-point limbs

- On some microarchitectures floating-point arithmetic is much faster than integer arithmetic
- An IEEE-754 floating-point number has value

$$
(-1)^{s} \cdot\left(1 . b_{m-1} b_{m-2} \ldots b_{0}\right) \cdot 2^{e-t} \text { with } b_{i} \in\{0,1\}
$$

## Using floating-point limbs

- On some microarchitectures floating-point arithmetic is much faster than integer arithmetic
- An IEEE-754 floating-point number has value

$$
(-1)^{s} \cdot\left(1 . b_{m-1} b_{m-2} \ldots b_{0}\right) \cdot 2^{e-t} \text { with } b_{i} \in\{0,1\}
$$

- For double-precision floats:
- $s \in\{0,1\}$ "sign bit"
- $m=52$ "mantissa bits"
- $e \in\{1, \ldots, 2046\}$ "exponent"
- $t=1023$


## Using floating-point limbs

- On some microarchitectures floating-point arithmetic is much faster than integer arithmetic
- An IEEE-754 floating-point number has value

$$
(-1)^{s} \cdot\left(1 . b_{m-1} b_{m-2} \ldots b_{0}\right) \cdot 2^{e-t} \text { with } b_{i} \in\{0,1\}
$$

- For double-precision floats:
- $s \in\{0,1\}$ "sign bit"
- $m=52$ "mantissa bits"
- $e \in\{1, \ldots, 2046\}$ "exponent"
- $t=1023$
- For single-precision floats:
- $s \in\{0,1\}$ "sign bit"
- $m=23$ "mantissa bits"
- $e \in\{1, \ldots, 254\}$ "exponent"
- $t=127$


## Using floating-point limbs

- On some microarchitectures floating-point arithmetic is much faster than integer arithmetic
- An IEEE-754 floating-point number has value

$$
(-1)^{s} \cdot\left(1 . b_{m-1} b_{m-2} \ldots b_{0}\right) \cdot 2^{e-t} \text { with } b_{i} \in\{0,1\}
$$

- For double-precision floats:
- $s \in\{0,1\}$ "sign bit"
- $m=52$ "mantissa bits"
- $e \in\{1, \ldots, 2046\}$ "exponent"
- $t=1023$
- For single-precision floats:
- $s \in\{0,1\}$ "sign bit"
- $m=23$ "mantissa bits"
- $e \in\{1, \ldots, 254\}$ "exponent"
- $t=127$
- Exponent $=0$ used to represent 0


## Using floating-point limbs

- On some microarchitectures floating-point arithmetic is much faster than integer arithmetic
- An IEEE-754 floating-point number has value

$$
(-1)^{s} \cdot\left(1 . b_{m-1} b_{m-2} \ldots b_{0}\right) \cdot 2^{e-t} \text { with } b_{i} \in\{0,1\}
$$

- For double-precision floats:
- $s \in\{0,1\}$ "sign bit"
- $m=52$ "mantissa bits"
- $e \in\{1, \ldots, 2046\}$ "exponent"
- $t=1023$
- For single-precision floats:
- $s \in\{0,1\}$ "sign bit"
- $m=23$ "mantissa bits"
- $e \in\{1, \ldots, 254\}$ "exponent"
- $t=127$
- Exponent $=0$ used to represent 0
- Any number that can be represented like this, will be precise
- Other numbers will be rounded, according to a rounding mode


## Addition and subtraction

```
typedef struct{
    double a[12];
} bigint255;
void bigint255_add(bigint255 *r,
                const bigint255 *x,
                const bigint255 *y)
{
    int i;
    for(i=0;i<12;i++)
        r->a[i] = x->a[i] + y->a[i];
}
void bigint255_sub(bigint255 *r,
                const bigint255 *x,
                const bigint255 *y)
{
    int i;
    for(i=0;i<12;i++)
        r->a[i] = x->a[i] - y->a[i];
}
```


## Carrying

- For carrying integers we used a right shift (discard lowest bits)


## Carrying

- For carrying integers we used a right shift (discard lowest bits)
- For floating-point numbers we can use multiplication by the inverse of the radix
- Example: Radix $2^{22}$, multiply by $2^{-22}$
- This does not cut off lowest bits, need to round


## Carrying

- For carrying integers we used a right shift (discard lowest bits)
- For floating-point numbers we can use multiplication by the inverse of the radix
- Example: Radix $2^{22}$, multiply by $2^{-22}$
- This does not cut off lowest bits, need to round
- Some processors have efficient rounding instructions, e.g., vroundpd


## Carrying

- For carrying integers we used a right shift (discard lowest bits)
- For floating-point numbers we can use multiplication by the inverse of the radix
- Example: Radix $2^{22}$, multiply by $2^{-22}$
- This does not cut off lowest bits, need to round
- Some processors have efficient rounding instructions, e.g., vroundpd
- Otherwise (for double-precision):
- add constant $2^{52}+2^{51}$
- subtract constant $2^{52}+2^{51}$
- This will round the number to an integer according to the rounding mode (to nearest, towards zero, away from zero, or truncate)


## Vector arithmetic

- Most modern processors have vector units
- Work on vectors of floats or ints instead of scalars
- Much larger computational power, e.g., Intel Nehalem:


## Vector arithmetic

- Most modern processors have vector units
- Work on vectors of floats or ints instead of scalars
- Much larger computational power, e.g., Intel Nehalem:
- 32-bit load throughput: 1 per cycle
- 32-bit add throughput: 3 per cycle
- 32-bit store throughput: 1 per cycle


## Vector arithmetic

- Most modern processors have vector units
- Work on vectors of floats or ints instead of scalars
- Much larger computational power, e.g., Intel Nehalem:
- 32-bit load throughput: 1 per cycle
- 32-bit add throughput: 3 per cycle
- 32-bit store throughput: 1 per cycle
- 128-bit load throughput: 1 per cycle
- $4 \times 32$-bit add throughput: 2 per cycle
- 128 -bit store throughput: 1 per cycle


## Vector arithmetic

- Most modern processors have vector units
- Work on vectors of floats or ints instead of scalars
- Much larger computational power, e.g., Intel Nehalem:
- 32-bit load throughput: 1 per cycle
- 32-bit add throughput: 3 per cycle
- 32-bit store throughput: 1 per cycle
- 128-bit load throughput: 1 per cycle
- $4 \times 32$-bit add throughput: 2 per cycle
- 128-bit store throughput: 1 per cycle
- Vector instructions are almost as fast as scalar instructions but do $4 \times$ the work


## Vector arithmetic

- Most modern processors have vector units
- Work on vectors of floats or ints instead of scalars
- Much larger computational power, e.g., Intel Nehalem:
- 32-bit load throughput: 1 per cycle
- 32-bit add throughput: 3 per cycle
- 32-bit store throughput: 1 per cycle
- 128-bit load throughput: 1 per cycle
- $4 \times 32$-bit add throughput: 2 per cycle
- 128 -bit store throughput: 1 per cycle
- Vector instructions are almost as fast as scalar instructions but do $4 \times$ the work
- Some things are cost extra:
- Variably indexed loads (lookups) into vectors
- Data-dependent branches are expensive in SIMD
- Shuffeling entries in vectors


## Vectorizing EC scalar multiplication

Computing multiple scalar multiplications

- Changes the rules of the game
- Increases size of active data set


## Vectorizing EC scalar multiplication

## Computing multiple scalar multiplications

- Changes the rules of the game
- Increases size of active data set

Parallelism inside multiprecision arithmetic

- Addition (in redundant representation) is trivially vectorized
- Vectorizing multiplication needs many shuffles
- Vectorization "eats up" instruction-level parallelism


## Vectorizing EC scalar multiplication

## Computing multiple scalar multiplications

- Changes the rules of the game
- Increases size of active data set

Parallelism inside multiprecision arithmetic

- Addition (in redundant representation) is trivially vectorized
- Vectorizing multiplication needs many shuffles
- Vectorization "eats up" instruction-level parallelism


## Parallelism inside EC arithmetic

- Vectorize independent multiplications in EC addition
- May still need some shuffles (after each block of operations)
- Efficiency depends on EC formulas


## Example: Montgomery ladder

function ladderstep $\left(x_{Q-P}, X_{P}, Z_{P}, X_{Q}, Z_{Q}\right)$
$t_{1} \leftarrow X_{P}+Z_{P}$
$t_{6} \leftarrow t_{1}^{2}$
$t_{2} \leftarrow X_{P}-Z_{P}$
$t_{7} \leftarrow t_{2}^{2}$
$t_{5} \leftarrow t_{6}-t_{7}$
$t_{3} \leftarrow X_{Q}+Z_{Q}$
$t_{4} \leftarrow X_{Q}-Z_{Q}$
$t_{8} \leftarrow t_{4} \cdot t_{1}$
$t_{9} \leftarrow t_{3} \cdot t_{2}$
$X_{P+Q} \leftarrow\left(t_{8}+t_{9}\right)^{2}$
$Z_{P+Q} \leftarrow x_{Q-P} \cdot\left(t_{8}-t_{9}\right)^{2}$
$X_{[2] P} \leftarrow t_{6} \cdot t_{7}$
$Z_{[2] P} \leftarrow t_{5} \cdot\left(t_{7}+((A+2) / 4) \cdot t_{5}\right)$
return $\left(X_{[2] P}, Z_{[2] P}, X_{P+Q}, Z_{P+Q}\right)$
end function

## Example: Montgomery ladder

function ladderstep $\left(x_{Q-P}, X_{P}, Z_{P}, X_{Q}, Z_{Q}\right)$

$$
\begin{aligned}
& \qquad \begin{array}{l}
t_{1} \leftarrow X_{P}+Z_{P} ; t_{2} \leftarrow X_{P}-Z_{P} ; t_{3} \leftarrow X_{Q}+Z_{Q} ; t_{4} \leftarrow X_{Q}-Z_{Q} \\
t_{6} \leftarrow t_{1} \cdot t_{1} ; t_{7} \leftarrow t_{2} \cdot t_{2} ; t_{8} \leftarrow t_{4} \cdot t_{1} ; t_{9} \leftarrow t_{3} \cdot t_{2}
\end{array} \\
& \left.\begin{array}{l}
t_{10} \leftarrow((A+2) / 4) \cdot t_{6} \\
t_{11}
\end{array}\right) \\
& \quad t_{5} \leftarrow((A+2) / 4-1) \cdot t_{7} \\
& \quad Z_{[2] P} \leftarrow t_{6}-t_{7} ; t_{4} \leftarrow t_{10}-t_{4} ; t_{11} ; t_{1} \leftarrow t_{8}-t_{9} ; t_{0} \leftarrow t_{8}+t_{9}
\end{aligned} \quad \begin{aligned}
& Z_{P+Q} \leftarrow t_{0}^{2} ; X_{[2] P} \leftarrow t_{6} \cdot t_{7} ; t_{2} \leftarrow t_{1} \cdot t_{1} \\
& \text { return }\left(X_{[2] P}, Z_{[2] P}, X_{P+Q}, Z_{P+Q}\right) \\
& \text { end function }
\end{aligned}
$$

## A better candidate: Kummer surfaces

- Think of a Kummer surface as the Jacobian of a hyperelliptic curve modulo negation


## A better candidate: Kummer surfaces

- Think of a Kummer surface as the Jacobian of a hyperelliptic curve modulo negation
- Easier way to think about it:
- Group modulo negation
- Map from group to Kummer surface by rational map $X$
- Elements represented projectively as $(x: y: z: t)$
- $(x: y: z: t)=(r x: r y: r z: r t)$ for any $r \neq 0$
- Efficient doubling and efficient differential addition


## A better candidate: Kummer surfaces

- Think of a Kummer surface as the Jacobian of a hyperelliptic curve modulo negation
- Easier way to think about it:
- Group modulo negation
- Map from group to Kummer surface by rational map $X$
- Elements represented projectively as $(x: y: z: t)$
- $(x: y: z: t)=(r x: r y: r z: r t)$ for any $r \neq 0$
- Efficient doubling and efficient differential addition
- Ladderstep: gets as input $X(P)=\left(x_{2}: y_{2}: z_{2}: t_{2}\right)$, $X(Q)=\left(x_{3}: y_{3}: z_{3}: t_{3}\right)$, and $X(Q-P)=\left(x_{1}: y_{1}: z_{1}: t_{1}\right)$
- Computes $X(2 P)=\left(x_{4}: y_{4}: z_{4}: t_{4}\right)$
- Computes $X(P+Q)=\left(x_{5}: y_{5}: z_{5}: t_{5}\right)$


## A better candidate: Kummer surfaces

- Think of a Kummer surface as the Jacobian of a hyperelliptic curve modulo negation
- Easier way to think about it:
- Group modulo negation
- Map from group to Kummer surface by rational map $X$
- Elements represented projectively as $(x: y: z: t)$
- $(x: y: z: t)=(r x: r y: r z: r t)$ for any $r \neq 0$
- Efficient doubling and efficient differential addition
- Ladderstep: gets as input $X(P)=\left(x_{2}: y_{2}: z_{2}: t_{2}\right)$, $X(Q)=\left(x_{3}: y_{3}: z_{3}: t_{3}\right)$, and $X(Q-P)=\left(x_{1}: y_{1}: z_{1}: t_{1}\right)$
- Computes $X(2 P)=\left(x_{4}: y_{4}: z_{4}: t_{4}\right)$
- Computes $X(P+Q)=\left(x_{5}: y_{5}: z_{5}: t_{5}\right)$
- Coordinates are elements of a (large) finite field
- Efficient Kummer surface arithmetic wants efficient multiprecision arithmetic


## Arithmetic on the Kummer surface


$10 \mathbf{M}+9 \mathbf{S}+6 \mathbf{m}$ ladder formulas

## Arithmetic on the Kummer surface


$10 \mathbf{M}+9 \mathbf{S}+6 \mathbf{m}$ ladder formulas

$7 \mathbf{M}+12 \mathbf{S}+9 \mathbf{m}$ ladder formulas

## The "squared Kummer surface"

- In fact, we use arithmetic on a different, "squared" surface
- Each point $(x: y: z: t)$ on the original surface corresponds to $\left(x^{2}: y^{2}: z^{2}: t^{2}\right)$ on the squared surface
- No operation-count advantages
- Easier to construct squared surface with small constants
- In the following rename $\left(x^{2}: y^{2}: z^{2}: t^{2}\right)$ to $(x: y: z: t)$


## Arithmetic on the squared Kummer surface


$10 \mathbf{M}+9 \mathbf{S}+6 \mathbf{m}$ ladder formulas

## Arithmetic on the squared Kummer surface


$10 \mathbf{M}+9 \mathbf{S}+6 \mathbf{m}$ ladder formulas

$7 \mathbf{M}+12 \mathbf{S}+9 \mathbf{m}$ ladder formulas

## Arithmetic on the (original) Kummer surface


$10 \mathbf{M}+9 \mathbf{S}+6 \mathbf{m}$ ladder formulas

$7 \mathbf{M}+12 \mathbf{S}+9 \mathbf{m}$ ladder formulas

## A suitable Kummer surface

- Formulas for efficient Kummer surface arithmetic known for a while
- Originally proposed by Chudnovsky, Chudnovsky, 1986
- $10 \mathbf{M}+9 \mathbf{S}+6 \mathbf{m}$ formulas by Gaudry, 2006
- $7 \mathbf{M}+12 \mathbf{S}+9 \mathbf{m}$ formulas by Bernstein, 2006


## A suitable Kummer surface

- Formulas for efficient Kummer surface arithmetic known for a while
- Originally proposed by Chudnovsky, Chudnovsky, 1986
- $10 \mathbf{M}+9 \mathbf{S}+6 \mathbf{m}$ formulas by Gaudry, 2006
- $7 \mathbf{M}+12 \mathbf{S}+9 \mathbf{m}$ formulas by Bernstein, 2006
- Problem: find cryptographically secure surface with small constants


## A suitable Kummer surface

- Formulas for efficient Kummer surface arithmetic known for a while
- Originally proposed by Chudnovsky, Chudnovsky, 1986
- $10 \mathbf{M}+9 \mathbf{S}+6 \mathbf{m}$ formulas by Gaudry, 2006
- $7 \mathbf{M}+12 \mathbf{S}+9 \mathbf{m}$ formulas by Bernstein, 2006
- Problem: find cryptographically secure surface with small constants
- Gaudry, Schost, 2012: suitable (squared) surface:
- Defined over the field $\mathbb{F}_{2^{127}-1}$
- $\left(1: a^{2} / b^{2}: a^{2} / c^{2}: a^{2} / d^{2}\right)=(-114: 57: 66: 418)$
- $\left(1: A^{2} / B^{2}: A^{2} / C^{2}: A^{2} / D^{2}\right)=(-833: 2499: 1617: 561)$


## A suitable Kummer surface

- Formulas for efficient Kummer surface arithmetic known for a while
- Originally proposed by Chudnovsky, Chudnovsky, 1986
- $10 \mathbf{M}+9 \mathbf{S}+6 \mathbf{m}$ formulas by Gaudry, 2006
- $7 \mathbf{M}+12 \mathbf{S}+9 \mathbf{m}$ formulas by Bernstein, 2006
- Problem: find cryptographically secure surface with small constants
- Gaudry, Schost, 2012: suitable (squared) surface:
- Defined over the field $\mathbb{F}_{2^{127}-1}$
- $\left(1: a^{2} / b^{2}: a^{2} / c^{2}: a^{2} / d^{2}\right)=(-114: 57: 66: 418)$
- $\left(1: A^{2} / B^{2}: A^{2} / C^{2}: A^{2} / D^{2}\right)=(-833: 2499: 1617: 561)$
- Finding this surface cost 1000000 CPU hours
- The same surface has been used by Bos, Costello, Hisil, and Lauter (Eurocrypt 2013)


## Sandy Bridge/lvy Bridge

- Pre-latest microarchitectures by Intel
- Very powerful vector instructions: AVX
- Operations on 256 -bit vector registers


## Sandy Bridge/lvy Bridge

- Pre-latest microarchitectures by Intel
- Very powerful vector instructions: AVX
- Operations on 256 -bit vector registers
- Can only do vectors of (single- or double-precision) floats
- Performance:
- 1 vectorized double-precision multiplication per cycle, plus
- 1 vectorized double-precision addition per cycle
- A total of 8 double-precision operations per cycle!


## Representing elements of $\mathbb{F}_{2^{127}-1}$

- Represent an element $A$ in radix- $2^{127 / 6}$
- Write $A$ as $a_{0}, a_{1}, a_{2}, a_{3}, a_{4}, a_{5}$, where
- $a_{0}$ is a small multiple of $2^{0}$
- $a_{1}$ is a small multiple of $2^{22}$
- $a_{2}$ is a small multiple of $2^{43}$
- $a_{3}$ is a small multiple of $2^{64}$
- $a_{4}$ is a small multiple of $2^{85}$
- $a_{5}$ is a small multiple of $2^{106}$


## Multiplication

- Consider multiplication of $A$ and $B$ with reduction $\bmod 2^{127}-1$
- Make use of the fact that $2^{127} \equiv 1$
- With radix $2^{127 / 6}$ we obtain:

$$
\begin{array}{lllll}
r_{0}=a_{0} b_{0}+2^{-127} a_{1} b_{5}+2^{-127} a_{2} b_{4}+2^{-127} a_{3} b_{3}+2^{-127} a_{4} b_{2}+2^{-127} a_{5} b_{1} \\
r_{1}=a_{0} b_{1}+ & a_{1} b_{0}+2^{-127} a_{2} b_{5}+2^{-127} a_{3} b_{4}+2^{-127} a_{4} b_{3}+2^{-127} a_{5} b_{2} \\
r_{2} & =a_{0} b_{2}+ & a_{1} b_{1}+ & a_{2} b_{0}+2^{-127} a_{3} b_{5}+2^{-127} a_{4} b_{4}+2^{-127} a_{5} b_{3} \\
r_{3} & =a_{0} b_{3}+ & a_{1} b_{2}+ & a_{2} b_{1}+ & a_{3} b_{0}+2^{-127} a_{4} b_{5}+2^{-127} a_{5} b_{4} \\
r_{4} & =a_{0} b_{4}+ & a_{1} b_{3}+ & a_{2} b_{2}+ & a_{3} b_{1}+ \\
r_{5} & =a_{0} b_{5}+ & a_{1} b_{4}+ & a_{2} b_{3}+ & a_{3} b_{2}+ \\
a_{4} b_{1}+ & a_{5} b_{5} b_{0}
\end{array}
$$

## Multiplication

- Consider multiplication of $A$ and $B$ with reduction $\bmod 2^{127}-1$
- Make use of the fact that $2^{127} \equiv 1$
- With radix $2^{127 / 6}$ we obtain:

$$
\begin{array}{lllll}
r_{0}=a_{0} b_{0}+2^{-127} a_{1} b_{5}+2^{-127} a_{2} b_{4}+2^{-127} a_{3} b_{3}+2^{-127} a_{4} b_{2}+2^{-127} a_{5} b_{1} \\
r_{1}=a_{0} b_{1}+ & a_{1} b_{0}+2^{-127} a_{2} b_{5}+2^{-127} a_{3} b_{4}+2^{-127} a_{4} b_{3}+2^{-127} a_{5} b_{2} \\
r_{2} & =a_{0} b_{2}+ & a_{1} b_{1}+ & a_{2} b_{0}+2^{-127} a_{3} b_{5}+2^{-127} a_{4} b_{4}+2^{-127} a_{5} b_{3} \\
r_{3} & =a_{0} b_{3}+ & a_{1} b_{2}+ & a_{2} b_{1}+ & a_{3} b_{0}+2^{-127} a_{4} b_{5}+2^{-127} a_{5} b_{4} \\
r_{4} & =a_{0} b_{4}+ & a_{1} b_{3}+ & a_{2} b_{2}+ & a_{3} b_{1}+ \\
r_{5} & =a_{0} b_{5}+ & a_{1} b_{4}+ & a_{2} b_{3}+ & a_{3} b_{2}+ \\
a_{4} b_{1}+ & a_{5} a_{5} b_{0}
\end{array}
$$

- Obviously, we always perform this whole thing $4 \times$ in parallel


## Multiplication

- Consider multiplication of $A$ and $B$ with reduction $\bmod 2^{127}-1$
- Make use of the fact that $2^{127} \equiv 1$
- With radix $2^{127 / 6}$ we obtain:

$$
\begin{array}{lllll}
r_{0} & =a_{0} b_{0}+2^{-127} a_{1} b_{5}+2^{-127} a_{2} b_{4}+2^{-127} a_{3} b_{3}+2^{-127} a_{4} b_{2}+2^{-127} a_{5} b_{1} \\
r_{1} & =a_{0} b_{1}+ & a_{1} b_{0}+2^{-127} a_{2} b_{5}+2^{-127} a_{3} b_{4}+2^{-127} a_{4} b_{3}+2^{-127} a_{5} b_{2} \\
r_{2} & =a_{0} b_{2}+ & a_{1} b_{1}+ & a_{2} b_{0}+2^{-127} a_{3} b_{5}+2^{-127} a_{4} b_{4}+2^{-127} a_{5} b_{3} \\
r_{3} & =a_{0} b_{3}+ & a_{1} b_{2}+ & a_{2} b_{1}+ & a_{3} b_{0}+2^{-127} a_{4} b_{5}+2^{-127} a_{5} b_{4} \\
r_{4} & =a_{0} b_{4}+ & a_{1} b_{3}+ & a_{2} b_{2}+ & a_{3} b_{1}+ \\
r_{5} & =a_{0} b_{5}+ & a_{1} b_{4}+ & a_{2} b_{3}+ & a_{3} b_{2}+ \\
a_{4} b_{1}+ & a_{5} a_{5} b_{0}
\end{array}
$$

- Obviously, we always perform this whole thing $4 \times$ in parallel
- Obviously, we specialize squaring


## Multiplication

- Consider multiplication of $A$ and $B$ with reduction $\bmod 2^{127}-1$
- Make use of the fact that $2^{127} \equiv 1$
- With radix $2^{127 / 6}$ we obtain:

$$
\begin{array}{lllll}
r_{0} & =a_{0} b_{0}+2^{-127} a_{1} b_{5}+2^{-127} a_{2} b_{4}+2^{-127} a_{3} b_{3}+2^{-127} a_{4} b_{2}+2^{-127} a_{5} b_{1} \\
r_{1} & =a_{0} b_{1}+ & a_{1} b_{0}+2^{-127} a_{2} b_{5}+2^{-127} a_{3} b_{4}+2^{-127} a_{4} b_{3}+2^{-127} a_{5} b_{2} \\
r_{2} & =a_{0} b_{2}+ & a_{1} b_{1}+ & a_{2} b_{0}+2^{-127} a_{3} b_{5}+2^{-127} a_{4} b_{4}+2^{-127} a_{5} b_{3} \\
r_{3} & =a_{0} b_{3}+ & a_{1} b_{2}+ & a_{2} b_{1}+ & a_{3} b_{0}+2^{-127} a_{4} b_{5}+2^{-127} a_{5} b_{4} \\
r_{4} & =a_{0} b_{4}+ & a_{1} b_{3}+ & a_{2} b_{2}+ & a_{3} b_{1}+ \\
r_{5} & =a_{0} b_{5} b_{0}+ & a_{1} b_{4}+ & a_{2} b_{3}+ & a_{3} b_{2}+ \\
a_{4} b_{1}+ & a_{5} b_{5} b_{0}
\end{array}
$$

- Obviously, we always perform this whole thing $4 \times$ in parallel
- Obviously, we specialize squaring
- Obviously, we specialize multiplications by small constants


## The Hadamard transform



- Only shuffeling operation in Kummer arithmetic
- AVX has limited shuffeling across left and right half
- Plain Hadamard turns out to be expensive


## The Hadamard transform



- Only shuffeling operation in Kummer arithmetic
- AVX has limited shuffeling across left and right half
- Plain Hadamard turns out to be expensive


## Permuted and negated Hadamard

- Allow generalized Hadamard to output permuted vector
- Self-inverting permutation "cleans" after two generalized Hadamards


## The Hadamard transform



- Only shuffeling operation in Kummer arithmetic
- AVX has limited shuffeling across left and right half
- Plain Hadamard turns out to be expensive


## Permuted and negated Hadamard

- Allow generalized Hadamard to output permuted vector
- Self-inverting permutation "cleans" after two generalized Hadamards
- Allow generalized Hadamard to negate vector entries
- "Clean" negations by multiplication by negated constants


## Arithmetic on the squared Kummer surface



## Results

128-bit secure, constant-time scalar multiplication

| arch | cycles | open | $g$ | source of software |
| :---: | :---: | :---: | :---: | :---: |
| Sandy | 194036 | yes | 1 | Bernstein-Duif-Lange-Schwabe- Yang CHES 2011 |
| Sandy | 153000? | no | 1 | Hamburg |
| Sandy | 137000? | no | 1 | Longa-Sica Asiacrypt 2012 |
| Sandy | 122716 | yes | 2 | Bos-Costello-Hisil-Lauter Eurocrypt 2013 |
| Sandy | 119904 | yes | 1 | Oliveira-López-Aranha-Rodríguez- <br> Henríquez <br> CHES 2013 |
| Sandy | 96000? | no | 1 | Faz-Hernández-Longa-Sánchez CTRSA 2014 |
| Sandy | 92000? | no | 1 | Faz-Hernández-Longa-Sánchez July 2014 |
| Sandy | 88916 | yes | 2 | new (our results) |

## Results

128-bit secure, constant-time scalar multiplication

| arch | cycles | open | $g$ | source of software |
| :---: | :---: | :---: | :---: | :---: |
| Ivy | 182708 | yes | 1 | Bernstein-Duif-Lange-Schwabe-Yang CHES 2011 |
| Ivy | 145000? | yes | 1 | Costello-Hisil-Smith Eurocrypt 2014 |
| Ivy | 119032 | yes | 2 | Bos-Costello-Hisil-Lauter Eurocrypt 2013 |
| Ivy | 114036 | yes | 1 | Oliveira-López-Aranha-Rodríguez- <br> Henríquez CHES 2013 |
| Ivy | 92000? | no | 1 | Faz-Hernández-Longa-Sánchez CT- |
| Ivy | 89000? | no | 1 | Faz-Hernández-Longa-Sánchez July 2014 |
| Ivy | 88448 | yes | 2 | new (our results) |

## More results

Also optimized for Intel Haswell
$\left.\begin{array}{|l|l|l|l|l|}\hline \text { arch } & \text { cycles } & \text { open } & g & \text { source of software } \\ \hline \text { Haswell } & 145907 & \text { yes } & 1 & \begin{array}{l}\text { Bernstein-Duif-Lange- } \\ \text { Schwabe-Yang CHES 2011 } \\ \text { Haswell } \\ \text { Haswell } \\ 100895\end{array} \\ 55595 & \text { yes } & 2 & \text { no } & 1 \\ \text { Bos-Costello-Hisi-Lauter } \\ \text { Eurocrypt 2013 } \\ \text { Oliveira-López-Aranha- } \\ \text { Rodríguez-Henríquez } \\ \text { CHES 2013 }\end{array}\right\}$

## Even more results

Also optimized for ARM Cortex-A8

| arch | cycles | open | $g$ | source of software |
| :--- | :--- | :--- | :--- | :--- |
| A8-slow | 497389 | yes | 1 | Bernstein-Schwabe CHES 2012 |
| A8-slow | 305395 | yes | 2 | new (our result) |
| A8-fast | 460200 | yes | 1 | Bernstein-Schwabe CHES 2012 |
| A8-fast | 273349 | yes | 2 | new (our result) |

## Resources online

## Paper:

Daniel J. Bernstein, Chitchanok Chuengsatiansup, Tanja Lange, Peter Schwabe. "Kummer strikes back: new DH speed records".
To be presented at Asiacrypt 2014
http://cryptojedi.org/papers/\#kummer

## Software:

Included in SUPERCOP, subdirectory crypto_scalarmult/kummer/ http://bench.cr.yp.to/supercop.html

