### Multiprecision arithmetic

Peter Schwabe



October 20, 2014

SPACE 2014, Pune, India

Multiprecision arithmetic (from primary school to Asiacrypt)

Peter Schwabe



October 20, 2014

SPACE 2014, Pune, India

# PART I

Joint work with Michael Hutter

Available numbers (digits): (0), 1, 2, 3, 4, 5, 6, 7, 8, 9

#### Available numbers (digits): (0), 1, 2, 3, 4, 5, 6, 7, 8, 9

Addition 3+5 = ?2+7 = ?4+3 = ?

#### Available numbers (digits): (0), 1, 2, 3, 4, 5, 6, 7, 8, 9

| Addition  | Subtraction |
|-----------|-------------|
| 3 + 5 = ? | 7 - 5 = ?   |
| 2 + 7 = ? | 5 - 1 = ?   |
| 4 + 3 = ? | 9 - 3 = ?   |

#### Available numbers (digits): (0), 1, 2, 3, 4, 5, 6, 7, 8, 9

| Addition  | Subtraction |
|-----------|-------------|
| 3 + 5 = ? | 7 - 5 = ?   |
| 2 + 7 = ? | 5 - 1 = ?   |
| 4 + 3 = ? | 9 - 3 = ?   |

- All results are in the set of available numbers
- No confusion for first-year school kids

Available numbers:  $0, 1, \ldots, 255$ 

Available numbers:  $0, 1, \ldots, 255$ 

#### Addition

uint8\_t a = 42; uint8\_t b = 89; uint8\_t r = a + b;

Available numbers:  $0, 1, \ldots, 255$ 

#### Addition

| uint8_t | а | = | 42 | 2; |    |
|---------|---|---|----|----|----|
| uint8_t | b | = | 88 | Э; |    |
| uint8_t | r | = | a  | +  | b; |

#### Subtraction

| uint8_t | а | = | 157 | ;  |
|---------|---|---|-----|----|
| uint8_t | b | = | 23; |    |
| uint8_t | r | = | a - | b; |

Available numbers:  $0, 1, \ldots, 255$ 

| Addition             | Subtraction          |
|----------------------|----------------------|
| uint8_t a = 42;      | uint8_t a = 157;     |
| uint8_t b = 89;      | uint8_t b = $23;$    |
| uint8_t $r = a + b;$ | uint8_t $r = a - b;$ |

- All results are in the set of available numbers
- Larger set of available numbers: uint16\_t, uint32\_t, uint64\_t
- Basic principle is the same; for the moment stick with uint8\_t

Crossing the ten barrier

6+5=?9+7=?4+8=?

Crossing the ten barrier

6+5=?9+7=?4+8=?

- Inputs to addition are still from the set of available numbers
- Results are allowed to be larger than 9

Crossing the ten barrier

6+5=?9+7=?4+8=?

- Inputs to addition are still from the set of available numbers
- Results are allowed to be larger than 9
- Addition is allowed to produce a carry

Crossing the ten barrier

6+5 = ?9+7 = ?4+8 = ?

- Inputs to addition are still from the set of available numbers
- Results are allowed to be larger than 9
- Addition is allowed to produce a carry

#### What happens with the carry?

- Introduce the decimal positional system
- Write an integer A in two digits  $a_1a_0$  with

$$A = 10 \cdot a_1 + \cdot a_0$$

• Note that at the moment 
$$a_1 \in \{0, 1\}$$

### ... back to programming

uint8\_t a = 184; uint8\_t b = 203; uint8\_t r = a + b;

### ... back to programming

uint8\_t a = 184; uint8\_t b = 203; uint8\_t r = a + b;

- ▶ The result r now has the value of 131
- ▶ The carry is lost, what do we do?

### ... back to programming

```
uint8_t a = 184;
uint8_t b = 203;
uint8_t r = a + b;
```

- ▶ The result r now has the value of 131
- The carry is lost, what do we do?
- Could cast to uint16\_t, uint32\_t etc., but that solves the problem only for this uint8\_t example
- We really want to obtain the carry, and put it into another uint8\_t

# The AVR ATmega

- 8-bit RISC architecture
- ▶ 32 registers R0...R31, some of those are "special":
  - (R26,R27) aliased as X
  - (R28,R29) aliased as Y
  - (R30,R31) aliased as Z
  - X, Y, Z are used for addressing
  - 2-byte output of a multiplication always in R0,R1
- Most arithmetic instructions cost 1 cycle
- Multiplication and memory access takes 2 cycles

#### 184 + 203

LDI R5, 184 LDI R6, 203 ADD R5, R6 ; result in R5, sets carry flag CLR R6 ; set R6 to zero ADC R6,R6 ; add with carry, R6 now holds the carry

#### Addition

#### Addition

|   | 7862 |
|---|------|
| + | 5275 |
| + | 7    |

#### Addition

|   | 7862 |
|---|------|
| + | 5275 |
| + | 37   |

#### Addition

|   | 7862 |
|---|------|
| + | 5275 |
| + | 137  |

#### Addition

|   | 7862  |
|---|-------|
| + | 5275  |
| + | 13137 |

#### Addition

42 + 78 = ? 789 + 543 = ?7862 + 5275 = ?  Once school kids can add beyond 1000, they can add arbitrary numbers

|   | 7862  |
|---|-------|
| + | 5275  |
| + | 13137 |

#### Multiprecision addition is old

"Oh Līlāvatī, intelligent girl, if you understand addition and subtraction, tell me the sum of the amounts 2, 5, 32, 193, 18, 10, and 100, as well as [the remainder of] those when subtracted from 10000."

—"Līlāvatī" by Bhāskara (1150)

- Mainly asymmetric cryptography heavily relies on multiprecision arithmetic
- Example 1: RSA-2048 needs (modular) multiplication and squaring of 2048-bit numbers

- Mainly asymmetric cryptography heavily relies on multiprecision arithmetic
- Example 1: RSA-2048 needs (modular) multiplication and squaring of 2048-bit numbers
- Example 2:
  - Elliptic curves defined over finite fields
  - Typically use EC over large-characteristic prime fields
  - ▶ Typical field sizes: (160 bits, 192 bits), 256 bits, 414 bits ...

- Mainly asymmetric cryptography heavily relies on multiprecision arithmetic
- Example 1: RSA-2048 needs (modular) multiplication and squaring of 2048-bit numbers
- Example 2:
  - Elliptic curves defined over finite fields
  - Typically use EC over large-characteristic prime fields
  - ▶ Typical field sizes: (160 bits, 192 bits), 256 bits, 414 bits ...
- Example 3:
  - Genus-2 hyperelliptic curves defined over finite fields
  - Typically use HEC over large-characteristic prime fields
  - ▶ Field size for 128-bit security: 128 bits

- Mainly asymmetric cryptography heavily relies on multiprecision arithmetic
- Example 1: RSA-2048 needs (modular) multiplication and squaring of 2048-bit numbers
- Example 2:
  - Elliptic curves defined over finite fields
  - Typically use EC over large-characteristic prime fields
  - ▶ Typical field sizes: (160 bits, 192 bits), 256 bits, 414 bits ...
- Example 3:
  - Genus-2 hyperelliptic curves defined over finite fields
  - Typically use HEC over large-characteristic prime fields
  - ▶ Field size for 128-bit security: 128 bits
- ▶ For now mainly interested in 160-bit and 256-bit arithmetic

#### AVR multiprecision addition...

- Add two *n*-byte numbers, returning an n+1 byte result:
- Input pointers X,Y, output pointer Z

| LD R5,X+<br>LD R6,Y+<br>ADD R5,R6<br>ST Z+,R5 | LD R5,X+<br>LD R6,Y+<br>ADC R5,R6<br>ST Z+,R5 | CLR R5<br>ADC R5,R5<br>ST Z+,R5 |
|-----------------------------------------------|-----------------------------------------------|---------------------------------|
| LD R5,X+<br>LD R6,Y+<br>ADC R5,R6<br>ST Z+,R5 | LD R5,X+<br>LD R6,Y+<br>ADC R5,R6<br>ST Z+,R5 |                                 |

. . .

#### ... and subtraction

- ▶ Add two *n*-byte numbers, returning an n + 1 byte result:
- Input pointers X,Y, output pointer Z
- Use highest byte = -1 to indicate negative result

| LD R5,X+  | CLR R5                                                                 |
|-----------|------------------------------------------------------------------------|
| LD R6,Y+  | SBC R5,R5                                                              |
| SBC R5,R6 | ST Z+,R5                                                               |
| ST Z+,R5  |                                                                        |
|           |                                                                        |
| LD R5,X+  |                                                                        |
| LD R6,Y+  |                                                                        |
| SBC R5,R6 |                                                                        |
| ST Z+,R5  |                                                                        |
|           | LD R6,Y+<br>SBC R5,R6<br>ST Z+,R5<br>LD R5,X+<br>LD R6,Y+<br>SBC R5,R6 |

. . .

How about multiplication?

 $\blacktriangleright$  Consider multiplication of 1234 by 789

How about multiplication?

 $\blacktriangleright$  Consider multiplication of 1234 by 789

 $\frac{1234 \cdot 789}{6}$ 

How about multiplication?

 $\blacktriangleright$  Consider multiplication of 1234 by 789

 $\frac{1234 \cdot 789}{06}$ 

 $\blacktriangleright$  Consider multiplication of 1234 by 789

 $\frac{1234 \cdot 789}{106}$ 

 $\blacktriangleright$  Consider multiplication of 1234 by 789

 $\frac{1234 \cdot 789}{11106}$ 

 $\blacktriangleright$  Consider multiplication of 1234 by 789

 $\frac{1234 \cdot 789}{11106}$ 9872

 $\blacktriangleright$  Consider multiplication of 1234 by 789

| $1234\cdot 789$ |
|-----------------|
| 11106           |
| 9872            |
| 8638            |

 $\blacktriangleright$  Consider multiplication of 1234 by 789

|   | $1234\cdot789$ |
|---|----------------|
|   | 11106          |
| + | 9872           |
| + | 8638           |
|   | 973626         |

 $\blacktriangleright$  Consider multiplication of 1234 by 789

 $\frac{1234 \cdot 789}{11106}$ 

#### $\blacktriangleright$ Consider multiplication of 1234 by 789

|   | $1234\cdot789$ |
|---|----------------|
|   | 11106          |
| + | 9872           |

 $\blacktriangleright$  Consider multiplication of 1234 by 789

 $\frac{1234 \cdot 789}{20978}$ 

#### $\blacktriangleright$ Consider multiplication of 1234 by 789

|   | $1234\cdot789$ |
|---|----------------|
|   | 20978          |
| + | 8638           |

 $\blacktriangleright$  Consider multiplication of 1234 by 789

 $\frac{1234\cdot789}{973626}$ 

 $\blacktriangleright$  Consider multiplication of 1234 by 789

 $\frac{1234\cdot789}{973626}$ 

- This is also an old technique
- Earliest reference I could find is again the Līlāvatī (1150)

#### Let's do that on the AVR

LD R2, X+ LD R3, X+ LD R4, X+ LD R7, Y+ MUL R2,R7 ST Z+,RO MOV R8,R1 MUL R3,R7 ADD R8,R0 CLR R9 ADC R9,R1 MUL R4,R7 ADD R9,R0 CLR R10 ADC R10,R1

# Let's do that on the $\ensuremath{\mathsf{AVR}}$

| LD R2, X+<br>LD R3, X+ | LD R7, Y+                |
|------------------------|--------------------------|
| LD R4, X+              | MUL R2,R7<br>MOVW R12,R0 |
| LD R7, Y+              | 1000 1012,100            |
| 22 101, 1              | MUL R3,R7                |
| MUL R2,R7              | ADD R13,R0               |
| ST Z+,RO               | CLR R14                  |
| MOV R8,R1              | ADC R14,R1               |
| MUL R3,R7              | MUL R4,R7                |
| ADD R8,R0              | ADD R14,R0               |
| CLR R9                 | CLR R15                  |
| ADC R9,R1              | ADC R15,R1               |
|                        |                          |
| MUL R4,R7              | ADD R8,R12               |
| ADD R9,R0              | ST Z+,R8                 |
| CLR R10                | ADC R9,R13               |
| ADC R10,R1             | ADC R10,R14              |
|                        | CLR R11                  |
|                        | ADC R11,R15              |

## Let's do that on the AVR

| LD R2, X+<br>LD R3, X+ | LD R7, Y+     | LD R7, Y+      |
|------------------------|---------------|----------------|
| LD R3, X+<br>LD R4, X+ | MUL R2,R7     | MUL R2,R7      |
| LD IGI, A.             | MOVW R12,R0   | MOVW R12,R0    |
| LD R7, Y+              | 1000 1012,100 | 11071 1112,110 |
| , _                    | MUL R3,R7     | MUL R3,R7      |
| MUL R2,R7              | ADD R13,R0    | ADD R13,R0     |
| ST Z+,RO               | CLR R14       | CLR R14        |
| MOV R8,R1              | ADC R14,R1    | ADC R14,R1     |
|                        |               |                |
| MUL R3,R7              | MUL R4,R7     | MUL R4,R7      |
| ADD R8,R0              | ADD R14,R0    | ADD R14,R0     |
| CLR R9                 | CLR R15       | CLR R15        |
| ADC R9,R1              | ADC R15,R1    | ADC R15,R1     |
|                        |               |                |
| MUL R4,R7              | ADD R8,R12    | ADC R9,R12     |
| ADD R9,R0              | ST Z+,R8      | ST Z+,R9       |
| CLR R10                | ADC R9,R13    | ADC R10,R13    |
| ADC R10,R1             | ADC R10,R14   | ADC R11,R14    |
|                        | CLR R11       | CLR R12        |
|                        | ADC R11,R15   | ADC R12,R15    |

## Let's do that on the $\ensuremath{\mathsf{AVR}}$

| LD R2, X+<br>LD R3, X+ | LD R7, Y+   | LD R7, Y+   | ST Z+,R10<br>ST Z+,R11 |
|------------------------|-------------|-------------|------------------------|
| LD R4, X+              | MUL R2,R7   | MUL R2,R7   | ST Z+,R12              |
|                        | MOVW R12,RO | MOVW R12,RO |                        |
| LD R7, Y+              | -           | -           |                        |
|                        | MUL R3,R7   | MUL R3,R7   |                        |
| MUL R2,R7              | ADD R13,R0  | ADD R13,R0  |                        |
| ST Z+,RO               | CLR R14     | CLR R14     |                        |
| MOV R8,R1              | ADC R14,R1  | ADC R14,R1  |                        |
|                        |             |             |                        |
| MUL R3,R7              | MUL R4,R7   | MUL R4,R7   |                        |
| ADD R8,R0              | ADD R14,R0  | ADD R14,RO  |                        |
| CLR R9                 | CLR R15     | CLR R15     |                        |
| ADC R9,R1              | ADC R15,R1  | ADC R15,R1  |                        |
|                        |             |             |                        |
| MUL R4,R7              | ADD R8,R12  | ADC R9,R12  |                        |
| ADD R9,R0              | ST Z+,R8    | ST Z+,R9    |                        |
| CLR R10                | ADC R9,R13  | ADC R10,R13 |                        |
| ADC R10,R1             | ADC R10,R14 | ADC R11,R14 |                        |
|                        | CLR R11     | CLR R12     |                        |
|                        | ADC R11,R15 | ADC R12,R15 |                        |

# Let's do that on the $\ensuremath{\mathsf{AVR}}$

▶ Problem: Need 3n + c registers for  $n \times n$ -byte multiplication

## Let's do that on the AVR

- ▶ Problem: Need 3n + c registers for  $n \times n$ -byte multiplication
- Can add on the fly, get down to 2n + c, but more carry handling

"Again as the information is understood, the multiplication of 2345 by 6789 is proposed; therefore the numbers are written down; the 5 is multiplied by the 9, there will be 45; the 5 is put, the 4 is kept; and the 5 is multiplied by the 8, and the 9 by the 4 and the products are added to the kept 4; there will be 80; the 0 is put and the 8 is kept; and the 5 is multiplied by the 7 and the 9 by the 2 and the 4 by the 8, and the products are added to the kept 8; there will be 102; the 2 is put and the 10 is kept in hand..."

From "Fibonacci's Liber Abaci", Chapter 2 (English translation by Sigler)

# $\label{eq:product scanning on the AVR} Product \ \text{scanning on the AVR}$

| LD R2, X+    | MUL R2, R9   | MUL R3, R9       |
|--------------|--------------|------------------|
| LD R3, X+    | ADD R14, RO  | ADD R15, RO      |
| LD R4, X+    | ADC R15, R1  | ADC R16, R1      |
| LD R7, Y+    | ADC R16, R5  | ADC R17, R5      |
| LD R8, Y+    | MUL R3, R8   | MUL R4, R8       |
| LD R9, Y+    | ADD R14, RO  | ADD R15, RO      |
|              | ADC R15, R1  | ADC R16, R1      |
|              | ADC R16, R5  | ADC R17, R5      |
| MUL R2, R7   | MUL R4, R7   | STD Z+3, R15     |
| MOV R13, R1  | ADD R14, RO  |                  |
| STD Z+O, RO  | ADC R15, R1  | MUL R4, R9       |
| CLR R14      | ADC R16, R5  | ADD R16, RO      |
| CLR R15      | STD Z+2, R14 | ADC R17, R1      |
|              | CLR R17      | STD Z+4, R16     |
| MUL R2, R8   |              |                  |
| ADD R13, RO  |              | STD Z+5, R17     |
| ADC R14, R1  |              |                  |
| MUL R3, R7   |              |                  |
| ADD R13, RO  |              |                  |
| ADC R14, R1  |              |                  |
| ADC R15, R5  |              |                  |
| STD Z+1, R13 |              |                  |
| CLR R16      |              | Multiprecision a |

Multiprecision arithmetic 17

#### Even better...?



From the Treviso Arithmetic, 1478
(http://www.republicaveneta.com/doc/abaco.pdf)

# Hybrid multiplication

- Idea: Chop whole multiplication into smaller blocks
- Compute each of the smaller multiplications by schoolbook
- Later add up to the full result
- See it as two nested loops:
  - Inner loop performs operand scanning
  - Outer loop performs product scanning

# Hybrid multiplication

- Idea: Chop whole multiplication into smaller blocks
- Compute each of the smaller multiplications by schoolbook
- Later add up to the full result
- See it as two nested loops:
  - Inner loop performs operand scanning
  - Outer loop performs product scanning
- Originally proposed by Gura, Patel, Wander, Eberle, Chang Shantz, 2004

## Hybrid multiplication

- Idea: Chop whole multiplication into smaller blocks
- Compute each of the smaller multiplications by schoolbook
- Later add up to the full result
- See it as two nested loops:
  - Inner loop performs operand scanning
  - Outer loop performs product scanning
- Originally proposed by Gura, Patel, Wander, Eberle, Chang Shantz, 2004
- ▶ Various improvements, consider 160-bit multiplication:
  - ▶ Originally: 3106 cycles
  - ▶ Uhsadel, Poschmann, Paar (2007): 2881 cycles
  - ▶ Scott, Szczechowiak (2007): 2651 cycles
  - ▶ Kargl, Pyka, Seuschek (2008): 2593 cycles

# Operand-caching multiplication

- Hutter, Wenger, 2011: More efficient way to decompose multiplication
- Inside separate chunks use product-scanning
- Main idea: re-use values in registers for longer

# Operand-caching multiplication

- Hutter, Wenger, 2011: More efficient way to decompose multiplication
- Inside separate chunks use product-scanning
- Main idea: re-use values in registers for longer
- Performance:
  - ▶ 2393 cycles for 160-bit multiplication
  - ▶ 6121 cycles for 256-bit multiplication

# Operand-caching multiplication

- Hutter, Wenger, 2011: More efficient way to decompose multiplication
- Inside separate chunks use product-scanning
- Main idea: re-use values in registers for longer
- Performance:
  - ▶ 2393 cycles for 160-bit multiplication
  - ▶ 6121 cycles for 256-bit multiplication
- ► Followup-paper by Seo and Kim: "Consecutive operand caching":
  - ▶ 2341 cycles for 160-bit multiplication
  - ▶ 6115 cycles for 256-bit multiplication

- ▶ So far, multiplication of 2 n-byte numbers needs  $n^2$  MULs
- Kolmogorov conjectured 1952: You can't do better, multiplication has quadratic complexity

- ▶ So far, multiplication of 2 n-byte numbers needs  $n^2$  MULs
- Kolmogorov conjectured 1952: You can't do better, multiplication has quadratic complexity
- ▶ Proven wrong by 23-year old student Karatsuba in 1960

- ▶ So far, multiplication of 2 n-byte numbers needs  $n^2$  MULs
- Kolmogorov conjectured 1952: You can't do better, multiplication has quadratic complexity
- ▶ Proven wrong by 23-year old student Karatsuba in 1960
- ▶ Idea: write  $A \cdot B$  as  $(A_0 + 2^m A_1)(B_0 + 2^m B_1)$  for half-size  $A_0, B_0, A_1, B_1$

- ▶ So far, multiplication of 2 n-byte numbers needs  $n^2$  MULs
- Kolmogorov conjectured 1952: You can't do better, multiplication has quadratic complexity
- ▶ Proven wrong by 23-year old student Karatsuba in 1960
- ▶ Idea: write  $A \cdot B$  as  $(A_0 + 2^m A_1)(B_0 + 2^m B_1)$  for half-size  $A_0, B_0, A_1, B_1$
- Compute

$$A_0B_0 + 2^m(A_0B_1 + B_0A_1) + 2^{2m}A_1B_1$$

- ▶ So far, multiplication of 2 n-byte numbers needs  $n^2$  MULs
- Kolmogorov conjectured 1952: You can't do better, multiplication has quadratic complexity
- ▶ Proven wrong by 23-year old student Karatsuba in 1960
- ▶ Idea: write  $A \cdot B$  as  $(A_0 + 2^m A_1)(B_0 + 2^m B_1)$  for half-size  $A_0, B_0, A_1, B_1$
- Compute

 $A_0B_0 + 2^m(A_0B_1 + B_0A_1) + 2^{2m}A_1B_1$ =  $A_0B_0 + 2^m((A_0 + A_1)(B_0 + B_1) - A_0B_0 - A_1B_1) + 2^{2m}A_1B_1$ 

- ▶ So far, multiplication of 2 n-byte numbers needs  $n^2$  MULs
- Kolmogorov conjectured 1952: You can't do better, multiplication has quadratic complexity
- ▶ Proven wrong by 23-year old student Karatsuba in 1960
- ▶ Idea: write  $A \cdot B$  as  $(A_0 + 2^m A_1)(B_0 + 2^m B_1)$  for half-size  $A_0, B_0, A_1, B_1$
- Compute

 $A_0B_0 + 2^m(A_0B_1 + B_0A_1) + 2^{2m}A_1B_1$ =  $A_0B_0 + 2^m((A_0 + A_1)(B_0 + B_1) - A_0B_0 - A_1B_1) + 2^{2m}A_1B_1$ 

• Recursive application yields  $\Theta(n^{\log_2 3})$  runtime

# Does that help on the AVR?



#### The straight-forward approach

Consider multiplication of n-byte numbers

$$A \stackrel{\circ}{=} (a_0, \dots, a_{n-1})$$
 and  
 $B \stackrel{\circ}{=} (b_0, \dots, b_{n-1})$ 

#### The straight-forward approach

Consider multiplication of n-byte numbers

$$A \stackrel{\circ}{=} (a_0, \dots, a_{n-1})$$
 and  
 $B \stackrel{\circ}{=} (b_0, \dots, b_{n-1})$ 

#### The straight-forward approach

Consider multiplication of n-byte numbers

$$A \stackrel{\circ}{=} (a_0, \dots, a_{n-1})$$
 and  
 $B \stackrel{\circ}{=} (b_0, \dots, b_{n-1})$ 

• Compute 
$$L = A_{\ell} \cdot B_{\ell} \stackrel{\circ}{=} (\ell_0, \dots, \ell_{n-1})$$

• Compute 
$$H = A_h \cdot B_h = (h_0, \dots, h_{n-1})$$

• Compute 
$$M = (A_{\ell} + A_h) \cdot (B_{\ell} + B_h) \hat{=} (m_0, \dots, m_n)$$

### The straight-forward approach

Consider multiplication of *n*-byte numbers

$$A \stackrel{\circ}{=} (a_0, \dots, a_{n-1})$$
 and  
 $B \stackrel{\circ}{=} (b_0, \dots, b_{n-1})$ 

• Compute 
$$L = A_{\ell} \cdot B_{\ell} \stackrel{\circ}{=} (\ell_0, \dots, \ell_{n-1})$$

• Compute 
$$H = A_h \cdot B_h \stackrel{\circ}{=} (h_0, \dots, h_{n-1})$$

- ► Compute  $M = (A_{\ell} + A_h) \cdot (B_{\ell} + B_h) \hat{=} (m_0, \dots, m_n)$
- ▶ Obtain result as  $A \cdot B = L + 2^{8k}(M L H) + 2^{8n}H$

## Multiplication by the carry in ${\cal M}$

- Can expand carry to 0xff or 0x00
- Use AND instruction for multiplication

### Multiplication by the carry in ${\cal M}$

- Can expand carry to 0xff or 0x00
- Use AND instruction for multiplication
- Does not help for recursive Karatsuba

### Multiplication by the carry in M

- Can expand carry to 0xff or 0x00
- Use AND instruction for multiplication
- Does not help for recursive Karatsuba

#### Subtractive Karatsuba

- Compute  $L = A_{\ell} \cdot B_{\ell} \stackrel{\circ}{=} (\ell_0, \dots, \ell_{n-1})$
- Compute  $H = A_h \cdot B_h \stackrel{\circ}{=} (h_0, \dots, h_{n-1})$
- Compute  $M = |A_{\ell} A_h| \cdot |B_{\ell} B_h| = (m_0, \dots, m_{n-1})$
- ▶ Set t = 0, if  $M = (A_{\ell} A_h) \cdot (B_{\ell} B_h)$ ; t = 1 otherwise
- Compute  $\hat{M} = (-1)^t M = (A_\ell A_h)(B_\ell B_h)$ =  $(\hat{m}_0, \dots, \hat{m}_{n-1})$
- Obtain result as  $A \cdot B = L + 2^{8k}(L + H \hat{M}) + 2^{8n}H$

### The easy solution

if(b) a = -a

#### The easy solution

if(b) a = -a

- NEG instruction does not help for multiprecision
- ► Can subtract from zero, but subtraction would overwrite zero

#### The easy solution

if(b) a = -a

- NEG instruction does not help for multiprecision
- Can subtract from zero, but subtraction would overwrite zero
- Even worse, the if would create a timing side-channel!

#### The easy solution

if(b) a = -a

- NEG instruction does not help for multiprecision
- Can subtract from zero, but subtraction would overwrite zero
- Even worse, the if would create a timing side-channel!

#### The constant-time solution

- Produce condition bit as byte 0xff or 0x00
- XOR all values with this condition byte

### The easy solution

if(b) a = -a

- NEG instruction does not help for multiprecision
- Can subtract from zero, but subtraction would overwrite zero
- Even worse, the if would create a timing side-channel!

#### The constant-time solution

- Produce condition bit as byte 0xff or 0x00
- XOR all values with this condition byte
- Negate the condition byte and obtain 0x01 or 0x00
- Add this value to the lowest byte
- Ripple through the carry (ADC with zero)

#### The easy solution

if(b) a = -a

- NEG instruction does not help for multiprecision
- Can subtract from zero, but subtraction would overwrite zero
- Even worse, the if would create a timing side-channel!

#### The constant-time solution

- Produce condition bit as byte 0xff or 0x00
- XOR all values with this condition byte
- Don't negate the condition byte
- Subtract the condition byte (0xff or 0x00 from all bytes)
- Saves two NEG instructions

 $\blacktriangleright$  Consider example of  $4{\times}4\text{-byte}$  Karatsuba multiplication:

| $l_0$ | $l_1$ | $l_2$       | $l_3$       | $h_0$       | $h_1$       | $h_2$ | $h_3$ |
|-------|-------|-------------|-------------|-------------|-------------|-------|-------|
|       | -     | $\hat{m}_0$ | $\hat{m}_1$ | $\hat{m}_2$ | $\hat{m}_3$ |       |       |
|       | +     | $l_0$       | $l_1$       | $l_2$       | $l_3$       |       |       |
|       | +     | $h_0$       | $h_1$       | $h_2$       | $h_3$       |       |       |

 $\blacktriangleright$  Consider example of  $4{\times}4\text{-byte}$  Karatsuba multiplication:



- Karatsuba performs some additions twice
- Refined Karatsuba: do them only once

► Consider example of 4×4-byte Karatsuba multiplication:



- Karatsuba performs some additions twice
- Refined Karatsuba: do them only once
- Merge additions into computation of H
- Compute  $\mathbf{H} \stackrel{.}{=} (\mathbf{h_0}, \mathbf{h_1}, \mathbf{h_2}, \mathbf{h_3}) = H + (l_2, l_3)$
- ▶ Note that H cannot "overflow"

► Consider example of 4×4-byte Karatsuba multiplication:



- Karatsuba performs some additions twice
- Refined Karatsuba: do them only once
- Merge additions into computation of H
- Compute  $\mathbf{H} \stackrel{.}{=} (\mathbf{h_0}, \mathbf{h_1}, \mathbf{h_2}, \mathbf{h_3}) = H + (l_2, l_3)$

| Note  | that  | : H ca      | nnot '      | 'overfl     | ow"         |       |       |
|-------|-------|-------------|-------------|-------------|-------------|-------|-------|
| $l_0$ | $l_1$ |             |             | ho          | $h_1$       | $h_2$ | $h_3$ |
|       | -     | $\hat{m}_0$ | $\hat{m}_1$ | $\hat{m}_2$ | $\hat{m}_3$ |       |       |
|       | +     | $l_0$       | $l_1$       |             |             |       |       |
|       | +     | $h_0$       | $h_1$       | $h_2$       | $h_3$       |       |       |

 $\blacktriangleright$  Consider example of  $4 \times 4$ -byte Karatsuba multiplication:



- Karatsuba performs some additions twice
- Refined Karatsuba: do them only once
- Merge additions into computation of H
- Compute  $\mathbf{H} = (\mathbf{h_0}, \mathbf{h_1}, \mathbf{h_2}, \mathbf{h_3}) = H + (l_2, l_3)$

 $h_2$ 

| Note  | that  | H ca           | nnot '      | 'overfl        | ow"         |                |  |
|-------|-------|----------------|-------------|----------------|-------------|----------------|--|
| $l_0$ | $l_1$ | $\mathbf{h_0}$ | $h_1$       | $\mathbf{h_0}$ | $h_1$       | $\mathbf{h_2}$ |  |
|       | -     | $\hat{m}_0$    | $\hat{m}_1$ | $\hat{m}_2$    | $\hat{m}_3$ |                |  |

 $l_1$ 

+

 $l_0$ 

h<sub>3</sub>

 $h_3$ 

► Consider example of 4×4-byte Karatsuba multiplication:



- Karatsuba performs some additions twice
- Refined Karatsuba: do them only once
- Merge additions into computation of H
- Compute  $\mathbf{H} \stackrel{.}{=} (\mathbf{h_0}, \mathbf{h_1}, \mathbf{h_2}, \mathbf{h_3}) = H + (l_2, l_3)$

| Note  | that  | : H ca      | nnot '      | 'overfl     | ow"         |                |       |
|-------|-------|-------------|-------------|-------------|-------------|----------------|-------|
| $l_0$ | $l_1$ | $h_0$       | $h_1$       | $h_0$       | $h_1$       | $\mathbf{h_2}$ | $h_3$ |
|       | -     | $\hat{m}_0$ | $\hat{m}_1$ | $\hat{m}_2$ | $\hat{m}_3$ |                |       |
|       | +     | $l_0$       | $l_1$       | $h_2$       | $h_3$       |                |       |

► Consequence: fewer additions, easier register allocation

### Arithmetic cost of *n*-byte Karatsuba on AVR

• Cost of computing L, M, and  $\mathbf{H}$ 

- $\blacktriangleright$  Cost of computing L,~M, and  ${\bf H}$
- ▶ 4k + 2 SUB/SBC, 2k EOR for absolute differences

- Cost of computing L, M, and  $\mathbf{H}$
- ▶ 4k + 2 SUB/SBC, 2k EOR for absolute differences
- ▶ n+1 ADD/ADC to add  $(l_0, \ldots, l_{k-1}, \mathbf{h_k}, \ldots, \mathbf{h_{n-1}})$

- Cost of computing L, M, and  $\mathbf{H}$
- ▶ 4k + 2 SUB/SBC, 2k EOR for absolute differences
- ▶ n+1 ADD/ADC to add  $(l_0, \ldots, l_{k-1}, \mathbf{h_k}, \ldots, \mathbf{h_{n-1}})$
- One EOR to compute t
- A BRNE instruction to branch, then either

- Cost of computing L, M, and  $\mathbf{H}$
- ▶ 4k + 2 SUB/SBC, 2k EOR for absolute differences
- ▶ n+1 ADD/ADC to add  $(l_0, \ldots, l_{k-1}, \mathbf{h_k}, \ldots, \mathbf{h_{n-1}})$
- One EOR to compute t
- ▶ A BRNE instruction to branch, then either
  - ▶ n + 2 SUB/SBC instructions and one RJMP, or
  - ▶ n + 1 ADD/ADC, one CLR, and one NOP

- Cost of computing L, M, and  $\mathbf{H}$
- ▶ 4k + 2 SUB/SBC, 2k EOR for absolute differences
- ▶ n+1 ADD/ADC to add  $(l_0, \ldots, l_{k-1}, \mathbf{h_k}, \ldots, \mathbf{h_{n-1}})$
- One EOR to compute t
- A BRNE instruction to branch, then either
  - n+2 SUB/SBC instructions and one RJMP, or
  - ▶ n+1 ADD/ADC, one CLR, and one NOP
- $\blacktriangleright$  k ADD/ADC instructions to ripple carry to the end

### $48\text{-bit}\ \text{Karatsuba}\ \text{on}\ \text{AVR}$

| CLR R22       | MUL R3, R7   | LD R14, X+   | EOR R2, R26 |
|---------------|--------------|--------------|-------------|
| CLR R23       | MOVW R14, RO | LD R15, X+   | EOR R3, R26 |
| MOVW R12, R22 | MUL R3, R5   | LD R16, X+   | EOR R4, R26 |
| MOVW R20, R22 | ADD R9, RO   | LDD R17, Y+3 | EOR R5, R27 |
|               | ADC R10, R1  | LDD R18, Y+4 | EOR R6, R27 |
| LD R2, X+     | ADC R11, R14 | LDD R19, Y+5 | EOR R7, R27 |
| LD R3, X+     | ADC R15, R23 |              |             |
| LD R4, X+     | MUL R3, R6   | SUB R2, R14  | SUB R2, R26 |
| LDD R5, Y+O   | ADD R10, RO  | SBC R3, R15  | SBC R3, R26 |
| LDD R6, Y+1   | ADC R11, R1  | SBC R4, R16  | SBC R4, R26 |
| LDD R7, Y+2   | ADC R12, R15 | SBC R26, R26 | SUB R5, R27 |
|               |              |              | SBC R6, R27 |
| MUL R2, R7    | MUL R4, R7   | SUB R5, R17  | SBC R7, R27 |
| MOVW R10, RO  | MOVW R14, RO | SBC R6, R18  |             |
| MUL R2, R5    | MUL R4, R5   | SBC R7, R19  |             |
| MOVW R8, RO   | ADD R10, RO  | SBC R27, R27 |             |
| MUL R2, R6    | ADC R11, R1  |              |             |
| ADD R9, RO    | ADC R12, R14 |              |             |
| ADC R10, R1   | ADC R15, R23 |              |             |
| ADC R11, R23  | MUL R4, R6   |              |             |
|               | ADD R11, RO  |              |             |
|               | ADC R12, R1  |              |             |
|               | ADC R13, R15 |              |             |
|               | STD Z+0, R8  |              |             |
|               | STD Z+1, R9  |              |             |
|               | STD Z+2, R10 |              |             |

### $48\text{-bit}\ \text{Karatsuba}\ \text{on}\ \text{AVR}$

| MUL<br>MOVV<br>MUL<br>ADD<br>ADC<br>ADC<br>ADC<br>ADC<br>ADC<br>ADD<br>ADC<br>ADC | R12,<br>R13,<br>R25,<br>R14,<br>R12, | R19<br>R0<br>R17<br>R0<br>R1<br>R24<br>R23<br>R18<br>R0<br>R1<br>R25 |
|-----------------------------------------------------------------------------------|--------------------------------------|----------------------------------------------------------------------|
| MOVV<br>MUL                                                                       | R20,<br>R25,<br>R15,<br>R13,         | R19<br>R0<br>R17<br>R0<br>R1<br>R24<br>R23<br>R18<br>R0<br>R1<br>R25 |

| MUL R16, | R19  |
|----------|------|
| MOVW R24 | , RO |
| MUL R16, | R17  |
| ADD R13, | RO   |
| ADC R20, |      |
| ADC R21, | R24  |
| ADC R25, | R23  |
| MUL R16, | R18  |
| MOVW R18 | ,R22 |
| ADD R20, | RO   |
| ADC R21, | R1   |
| ADC R22, | R25  |

| MUL R2, R7<br>MOVW R16, R0<br>MUL R2, R5<br>MOVW R14, R0                                              |
|-------------------------------------------------------------------------------------------------------|
| MUL R2, R5                                                                                            |
| MOVW R14, RO                                                                                          |
| MUL R2, R6                                                                                            |
| ADD R15, RO                                                                                           |
| ADC R16, R1                                                                                           |
| ADC R17, R23                                                                                          |
|                                                                                                       |
| MUL R3, R7<br>MOVW R24, RO                                                                            |
|                                                                                                       |
| MOVW R24, RO                                                                                          |
| MOVW R24, RO<br>MUL R3, R5                                                                            |
| MOVW R24, RO<br>MUL R3, R5<br>ADD R15, RO                                                             |
| MUL R3, R5                                                                                            |
| MUL R3, R5<br>ADD R15, R0<br>ADC R16, R1                                                              |
| MUL R3, R5<br>ADD R15, R0<br>ADC R16, R1<br>ADC R17, R24                                              |
| MUL R3, R5<br>ADD R15, R0<br>ADC R16, R1<br>ADC R17, R24<br>ADC R25, R23<br>MUL R3, R6                |
| MUL R3, Ŕ5<br>ADD R15, R0<br>ADC R16, R1<br>ADC R17, R24<br>ADC R25, R23<br>MUL R3, R6<br>ADD R16, R0 |
| MUL R3, Ŕ5<br>ADD R15, R0<br>ADC R16, R1<br>ADC R17, R24<br>ADC R25, R23<br>MUL R3, R6<br>ADD R16, R0 |
| MUL R3, R5<br>ADD R15, R0<br>ADC R16, R1<br>ADC R17, R24<br>ADC R25, R23<br>MUL R3, R6                |

MUL R4, R7 MOVW R24, R0 MUL R4, R5 ADD R16, R0 ADC R17, R1 ADC R18, R24 ADC R25, R23 MUL R4, R6 ADD R17, R0 ADC R18, R1 ADC R19, R25

### $48\text{-bit}\ \text{Karatsuba}$ on AVR

| ADD R8, R11  | add_M:<br>ADD R8, R14<br>ADC R9, R15<br>ADC R10, R16<br>ADC R11, R17<br>ADC R12, R18<br>ADC R13, R19<br>CLR R24 |
|--------------|-----------------------------------------------------------------------------------------------------------------|
| ADC R9, R12  | ADD R8, R14                                                                                                     |
| ADC R10, R13 | ADC R9, R15                                                                                                     |
| ADC R11, R20 | ADC R10, R16                                                                                                    |
| ADC R12, R21 | ADC R11, R17                                                                                                    |
| ADC R13, R22 | ADC R12, R18                                                                                                    |
| ADC R23, R23 | ADC R13, R19                                                                                                    |
|              | CLR R24                                                                                                         |
| EOR R26, R27 | ADC R23, R24<br>NOP                                                                                             |
| BRNE add_M   | NOP                                                                                                             |
|              |                                                                                                                 |
| SUB R8, R14  | final:                                                                                                          |
| SBC R9, R15  | STD Z+3, R8                                                                                                     |
| SBC R10, R16 | STD Z+4, R9                                                                                                     |
| SBC R11, R17 | STD Z+5, R10                                                                                                    |
| SBC R12, R18 | STD Z+6, R11                                                                                                    |
| SBC R13, R19 | STD Z+7, R12                                                                                                    |
| SBCI R23, 0  | STD Z+3, R8<br>STD Z+4, R9<br>STD Z+5, R10<br>STD Z+6, R11<br>STD Z+7, R12<br>STD Z+8, R13                      |
| SBC R24, R24 |                                                                                                                 |
| RJMP final   | ADD R20, R23                                                                                                    |
|              | 1100 1021, 1021                                                                                                 |
|              | ADC R22, R24                                                                                                    |
|              |                                                                                                                 |
|              | STD Z+9, R20                                                                                                    |
|              | STD Z+10, R21                                                                                                   |
|              | STD Z+11, R22                                                                                                   |
|              |                                                                                                                 |

### Larger Karatsuba multiplication

- ▶ 48-bit Karatsuba is friendly; everything fits into registers
- Remember that previous speed records were achieved by eliminating loads/stores

### Larger Karatsuba multiplication

- ▶ 48-bit Karatsuba is friendly; everything fits into registers
- Remember that previous speed records were achieved by eliminating loads/stores
- Karatsuba structure needs additional temporary storage
- ► Good performance needs careful scheduling and register allocation
- ▶ Very important is to compute  $\mathbf{H} = H + (l_{k+1}, \dots, l_{n-1})$  on the fly

### Larger Karatsuba multiplication

- ▶ 48-bit Karatsuba is friendly; everything fits into registers
- Remember that previous speed records were achieved by eliminating loads/stores
- Karatsuba structure needs additional temporary storage
- Good performance needs careful scheduling and register allocation
- ▶ Very important is to compute  $\mathbf{H} = H + (l_{k+1}, \dots, l_{n-1})$  on the fly
- ▶ Use 1-level Karatsuba for 48-bit, 64-bit, 80-bit, 96-bit inputs
- ▶ Use 2-level Karatsuba for 128-bit, 160-bit, 192-bit inputs
- Use 3-level Karatsuba for 256-bit inputs

### Results

### Cycle counts for n-bit multiplication

|                       |     | Input size n |     |     |      |      |      |      |
|-----------------------|-----|--------------|-----|-----|------|------|------|------|
| Approach              | 48  | 64           | 80  | 96  | 128  | 160  | 192  | 256  |
| Product scanning:     | 235 | 395          | 595 | 836 |      |      |      |      |
| Hutter, Wenger, 2011: | _   |              | —   |     |      | 2393 | 3467 | 6121 |
| Seo, Kim, 2012:       | _   | —            | —   |     | 1532 | 2356 | 3464 | 6180 |
| Seo, Kim, 2013:       | _   | —            | —   |     | 1523 | 2341 | 3437 | 6115 |
| Karatsuba:            | 217 | 360          | 522 | 780 | 1325 | 1976 | 2923 | 4797 |
| — w/o branches:       | 222 | 368          | 533 | 800 | 1369 | 2030 | 2987 | 4961 |

▶ 160-bit multiplication now > 18% faster

▶ 256-bit multiplication now > 23% faster

### Resources online

#### Paper:

Michael Hutter, Peter Schwabe. "Multiprecision multiplication on AVR revisited".

http://cryptojedi.org/papers/#avrmul

Software: http://cryptojedi.org/crypto/#avrmul

### Resources online

#### Paper:

Michael Hutter, Peter Schwabe. "Multiprecision multiplication on AVR revisited".

http://cryptojedi.org/papers/#avrmul

Software: http://cryptojedi.org/crypto/#avrmul

Code generator for operand-scanning, product-scanning, hybrid, and operand-caching by Hutter and Wenger: http://mhutter.org/research/avr/index.htm#mulopcache

# PART II

Joint work with Daniel J. Bernstein, Chitchanok Chuengsatiansup, and Tanja Lange

### Main differences (for us)

▶ Arithmetic on larger (64-bit) integers

### Main differences (for us)

- ▶ Arithmetic on larger (64-bit) integers
- Arithmetic on floating-point numbers

### Main differences (for us)

- ▶ Arithmetic on larger (64-bit) integers
- Arithmetic on floating-point numbers
- Arithmetic on vectors

### Main differences (for us)

- ▶ Arithmetic on larger (64-bit) integers
- Arithmetic on floating-point numbers
- Arithmetic on vectors
- Pipelined and superscalar execution

# $\mathsf{Radix}\text{-}2^{64}$ representation

- $\blacktriangleright$  Let's consider representing 255-bit integers
- Obvious choice: use 4 64-bit integers  $a_0, a_1, a_2, a_3$  with

$$A = \sum_{i=0}^{3} a_i 2^{64i}$$

Arithmetic works just as before (except with larger registers)

# $\mathsf{Radix}\text{-}2^{51}$ representation

- $\blacktriangleright$  Radix- $2^{64}$  representation works and is sometimes a good choice
- Highly depends on the efficiency of handling carries

# $\mathsf{Radix}\text{-}2^{51}$ representation

- $\blacktriangleright$  Radix-2<sup>64</sup> representation works and is sometimes a good choice
- Highly depends on the efficiency of handling carries
- ► Example 1: Intel Nehalem can do 3 additions every cycle, but only 1 addition with carry every two cycles (carries cost a factor of 6!)

## $\mathsf{Radix}\text{-}2^{51}$ representation

- $\blacktriangleright$  Radix-2<sup>64</sup> representation works and is sometimes a good choice
- Highly depends on the efficiency of handling carries
- ► Example 1: Intel Nehalem can do 3 additions every cycle, but only 1 addition with carry every two cycles (carries cost a factor of 6!)
- Example 2: When using vector arithmetic, carries are typically lost (very expensive to recompute)

#### Radix- $2^{51}$ representation

- $\blacktriangleright$  Radix-2<sup>64</sup> representation works and is sometimes a good choice
- Highly depends on the efficiency of handling carries
- ► Example 1: Intel Nehalem can do 3 additions every cycle, but only 1 addition with carry every two cycles (carries cost a factor of 6!)
- Example 2: When using vector arithmetic, carries are typically lost (very expensive to recompute)
- Let's get rid of the carries, represent A as  $(a_0, a_1, a_2, a_3, a_4)$  with

$$A = \sum_{i=0}^{4} a_i 2^{51 \cdot i}$$

▶ This is called radix-2<sup>51</sup> representation

#### Radix- $2^{51}$ representation

- $\blacktriangleright$  Radix-2<sup>64</sup> representation works and is sometimes a good choice
- Highly depends on the efficiency of handling carries
- ► Example 1: Intel Nehalem can do 3 additions every cycle, but only 1 addition with carry every two cycles (carries cost a factor of 6!)
- Example 2: When using vector arithmetic, carries are typically lost (very expensive to recompute)
- Let's get rid of the carries, represent A as  $(a_0, a_1, a_2, a_3, a_4)$  with

$$A = \sum_{i=0}^{4} a_i 2^{51 \cdot i}$$

- $\blacktriangleright$  This is called radix-2<sup>51</sup> representation
- Multiple ways to write the same integer A, for example  $A = 2^{52}$ :
  - $\blacktriangleright$  (2<sup>52</sup>, 0, 0, 0, 0)
  - (0, 2, 0, 0, 0)

#### Radix- $2^{51}$ representation

- $\blacktriangleright$  Radix-2<sup>64</sup> representation works and is sometimes a good choice
- Highly depends on the efficiency of handling carries
- ► Example 1: Intel Nehalem can do 3 additions every cycle, but only 1 addition with carry every two cycles (carries cost a factor of 6!)
- Example 2: When using vector arithmetic, carries are typically lost (very expensive to recompute)
- Let's get rid of the carries, represent A as  $(a_0, a_1, a_2, a_3, a_4)$  with

$$A = \sum_{i=0}^{4} a_i 2^{51 \cdot i}$$

- ▶ This is called radix-2<sup>51</sup> representation
- Multiple ways to write the same integer A, for example  $A = 2^{52}$ :
  - $\blacktriangleright (2^{52}, 0, 0, 0, 0)$
  - (0, 2, 0, 0, 0)

# ▶ Let's call a representation $(a_0, a_1, a_2, a_3, a_4)$ reduced, if all $a_i \in [0, \dots, 2^{52} - 1]$

```
typedef struct{
   unsigned long long a[5];
} bigint255;
void bigint255_add(bigint255 *r,
                             const bigint255 *x,
                             const bigint255 *y)
ł
   r \rightarrow a[0] = x \rightarrow a[0] + y \rightarrow a[0];
   r \rightarrow a[1] = x \rightarrow a[1] + y \rightarrow a[1];
   r \rightarrow a[2] = x \rightarrow a[2] + y \rightarrow a[2];
   r > a[3] = x - a[3] + y - a[3];
   r \rightarrow a[4] = x \rightarrow a[4] + y \rightarrow a[4];
}
```

```
typedef struct{
   unsigned long long a[5];
} bigint255;
void bigint255_add(bigint255 *r,
                             const bigint255 *x,
                             const bigint255 *y)
ł
   r \rightarrow a[0] = x \rightarrow a[0] + y \rightarrow a[0];
   r \rightarrow a[1] = x \rightarrow a[1] + y \rightarrow a[1];
   r \rightarrow a[2] = x \rightarrow a[2] + y \rightarrow a[2];
   r > a[3] = x - a[3] + y - a[3];
   r \rightarrow a[4] = x \rightarrow a[4] + y \rightarrow a[4];
}
```

This definitely works for reduced inputs

```
typedef struct{
   unsigned long long a[5];
} bigint255;
void bigint255_add(bigint255 *r,
                             const bigint255 *x,
                             const bigint255 *y)
ł
   r \rightarrow a[0] = x \rightarrow a[0] + y \rightarrow a[0];
   r \rightarrow a[1] = x \rightarrow a[1] + y \rightarrow a[1];
   r \rightarrow a[2] = x \rightarrow a[2] + y \rightarrow a[2];
   r > a[3] = x - a[3] + y - a[3];
   r \rightarrow a[4] = x \rightarrow a[4] + y \rightarrow a[4];
}
```

This definitely works for reduced inputs

▶ This actually works as long as all coefficients are in  $[0, \ldots, 2^{63} - 1]$ 

```
typedef struct{
   unsigned long long a[5];
} bigint255;
void bigint255_add(bigint255 *r,
                             const bigint255 *x,
                             const bigint255 *y)
ł
   r \rightarrow a[0] = x \rightarrow a[0] + y \rightarrow a[0];
   r \rightarrow a[1] = x \rightarrow a[1] + y \rightarrow a[1];
   r \rightarrow a[2] = x \rightarrow a[2] + y \rightarrow a[2];
   r > a[3] = x - a[3] + y - a[3];
   r \rightarrow a[4] = x \rightarrow a[4] + y \rightarrow a[4];
}
```

- This definitely works for reduced inputs
- ▶ This actually works as long as all coefficients are in  $[0, \ldots, 2^{63} 1]$
- ▶ We can do quite a few additions before we have to carry (reduce)

#### Subtraction of two bigint255

```
typedef struct{
   signed long long a[5];
} bigint255;
void bigint255_sub(bigint255 *r,
                             const bigint255 *x,
                             const bigint255 *y)
ł
   r \rightarrow a[0] = x \rightarrow a[0] - y \rightarrow a[0];
   r \rightarrow a[1] = x \rightarrow a[1] - y \rightarrow a[1];
   r > a[2] = x - a[2] - y - a[2];
  r \rightarrow a[3] = x \rightarrow a[3] - y \rightarrow a[3];
  r \rightarrow a[4] = x \rightarrow a[4] - y \rightarrow a[4];
}
```

 Slightly update our bigint255 definition to work with signed 64-bit integers

#### Subtraction of two bigint255

```
typedef struct{
  signed long long a[5];
} bigint255;
void bigint255_sub(bigint255 *r,
                           const bigint255 *x,
                           const bigint255 *y)
ł
  r \rightarrow a[0] = x \rightarrow a[0] - y \rightarrow a[0];
  r \rightarrow a[1] = x \rightarrow a[1] - y \rightarrow a[1];
  r > a[2] = x - a[2] - y - a[2];
  r > a[3] = x - a[3] - y - a[3];
  r \rightarrow a[4] = x \rightarrow a[4] - y \rightarrow a[4];
}
```

 Slightly update our bigint255 definition to work with signed 64-bit integers

• Reduced if coefficients are in  $[-2^{52}+1, 2^{52}-1]$ 

# Carrying in radix- $2^{51}$

- $\blacktriangleright$  With many additions, coefficients may grow larger than 63 bits
- They grow even faster with multiplication

# Carrying in radix- $2^{51}$

- $\blacktriangleright$  With many additions, coefficients may grow larger than 63 bits
- They grow even faster with multiplication
- Eventually we have to carry en bloc:

```
signed long long carry = r.a[0] >> 51;
r.a[1] += carry;
carry <<= 51;
r.a[0] -= carry;
```

► Note: Addition code would look *exactly* the same for 5-coefficient polynomial addition

- ► Note: Addition code would look *exactly* the same for 5-coefficient polynomial addition
- This is no coincidence: We actually perform arithmetic in  $\mathbb{Z}[x]$
- ▶ Inputs to addition are 5-coefficient polynomials

- ► Note: Addition code would look *exactly* the same for 5-coefficient polynomial addition
- This is no coincidence: We actually perform arithmetic in  $\mathbb{Z}[x]$
- ▶ Inputs to addition are 5-coefficient polynomials
- Nice thing about arithmetic  $\mathbb{Z}[x]$ : no carries!

- ► Note: Addition code would look *exactly* the same for 5-coefficient polynomial addition
- This is no coincidence: We actually perform arithmetic in  $\mathbb{Z}[x]$
- ▶ Inputs to addition are 5-coefficient polynomials
- ▶ Nice thing about arithmetic Z[x]: no carries!
- ► To go from Z[x] to Z, evaluate at the radix (this is a ring homomorphism)
- Carrying means evaluating at the radix

- ► Note: Addition code would look *exactly* the same for 5-coefficient polynomial addition
- This is no coincidence: We actually perform arithmetic in  $\mathbb{Z}[x]$
- ▶ Inputs to addition are 5-coefficient polynomials
- Nice thing about arithmetic  $\mathbb{Z}[x]$ : no carries!
- ► To go from Z[x] to Z, evaluate at the radix (this is a ring homomorphism)
- Carrying means evaluating at the radix
- Thinking of multiprecision integers as polynomials is very powerful for efficient arithmetic

- On some microarchitectures floating-point arithmetic is much faster than integer arithmetic
- An IEEE-754 floating-point number has value

- On some microarchitectures floating-point arithmetic is much faster than integer arithmetic
- An IEEE-754 floating-point number has value

- For double-precision floats:
  - ▶  $s \in \{0, 1\}$  "sign bit"
  - m = 52 "mantissa bits"
  - $e \in \{1, ..., 2046\}$  "exponent"
  - ▶ t = 1023

- On some microarchitectures floating-point arithmetic is much faster than integer arithmetic
- An IEEE-754 floating-point number has value

- For double-precision floats:
  - ▶  $s \in \{0, 1\}$  "sign bit"
  - m = 52 "mantissa bits"
  - $e \in \{1, ..., 2046\}$  "exponent"
  - ▶ t = 1023
- ► For single-precision floats:
  - ▶  $s \in \{0,1\}$  "sign bit"
  - ▶ m = 23 "mantissa bits"
  - ▶  $e \in \{1, \dots, 254\}$  "exponent"
  - ▶ t = 127

- On some microarchitectures floating-point arithmetic is much faster than integer arithmetic
- An IEEE-754 floating-point number has value

 $(-1)^{s} \cdot (1.b_{m-1}b_{m-2}\dots b_0) \cdot 2^{e-t}$  with  $b_i \in \{0,1\}$ 

- For double-precision floats:
  - ▶  $s \in \{0, 1\}$  "sign bit"
  - m = 52 "mantissa bits"
  - $e \in \{1, ..., 2046\}$  "exponent"
  - ▶ t = 1023
- For single-precision floats:
  - ▶  $s \in \{0,1\}$  "sign bit"
  - ▶ m = 23 "mantissa bits"
  - ▶  $e \in \{1, \dots, 254\}$  "exponent"
  - ▶ t = 127

• Exponent = 0 used to represent 0

- On some microarchitectures floating-point arithmetic is much faster than integer arithmetic
- An IEEE-754 floating-point number has value

- ► For double-precision floats:
  - ▶  $s \in \{0, 1\}$  "sign bit"
  - m = 52 "mantissa bits"
  - $e \in \{1, ..., 2046\}$  "exponent"
  - ▶ t = 1023
- For single-precision floats:
  - ▶  $s \in \{0,1\}$  "sign bit"
  - ▶ m = 23 "mantissa bits"
  - ▶  $e \in \{1, \dots, 254\}$  "exponent"
  - ▶ t = 127
- Exponent = 0 used to represent 0
- Any number that can be represented like this, will be precise
- > Other numbers will be *rounded*, according to a rounding mode

#### Addition and subtraction

```
typedef struct{
  double a[12];
} bigint255;
void bigint255_add(bigint255 *r,
                     const bigint255 *x,
                     const bigint255 *y)
{
  int i;
  for(i=0;i<12;i++)</pre>
    r - a[i] = x - a[i] + y - a[i];
}
void bigint255_sub(bigint255 *r,
                     const bigint255 *x,
                     const bigint255 *y)
Ł
  int i;
  for(i=0;i<12;i++)</pre>
    r - a[i] = x - a[i] - y - a[i];
}
```



▶ For carrying integers we used a right shift (discard lowest bits)

# Carrying

- ▶ For carrying integers we used a right shift (discard lowest bits)
- For floating-point numbers we can use multiplication by the inverse of the radix
- Example: Radix  $2^{22}$ , multiply by  $2^{-22}$
- This does not cut off lowest bits, need to round

# Carrying

- ▶ For carrying integers we used a right shift (discard lowest bits)
- For floating-point numbers we can use multiplication by the inverse of the radix
- Example: Radix  $2^{22}$ , multiply by  $2^{-22}$
- This does not cut off lowest bits, need to round
- ▶ Some processors have efficient rounding instructions, e.g., vroundpd

# Carrying

- ▶ For carrying integers we used a right shift (discard lowest bits)
- For floating-point numbers we can use multiplication by the inverse of the radix
- Example: Radix  $2^{22}$ , multiply by  $2^{-22}$
- This does not cut off lowest bits, need to round
- ▶ Some processors have efficient rounding instructions, e.g., vroundpd
- Otherwise (for double-precision):
  - add constant  $2^{52} + 2^{51}$
  - subtract constant  $2^{52} + 2^{51}$
  - This will round the number to an integer according to the rounding mode (to nearest, towards zero, away from zero, or truncate)

- Most modern processors have vector units
- Work on vectors of floats or ints instead of scalars
- ▶ Much larger computational power, e.g., Intel Nehalem:

- Most modern processors have vector units
- Work on vectors of floats or ints instead of scalars
- ▶ Much larger computational power, e.g., Intel Nehalem:
  - ► 32-bit load throughput: 1 per cycle
  - 32-bit add throughput: 3 per cycle
  - 32-bit store throughput: 1 per cycle

- Most modern processors have vector units
- Work on vectors of floats or ints instead of scalars
- ▶ Much larger computational power, e.g., Intel Nehalem:
  - ► 32-bit load throughput: 1 per cycle
  - 32-bit add throughput: 3 per cycle
  - 32-bit store throughput: 1 per cycle
  - 128-bit load throughput: 1 per cycle
  - ▶ 4× 32-bit add throughput: 2 per cycle
  - 128-bit store throughput: 1 per cycle

- Most modern processors have vector units
- Work on vectors of floats or ints instead of scalars
- Much larger computational power, e.g., Intel Nehalem:
  - 32-bit load throughput: 1 per cycle
  - 32-bit add throughput: 3 per cycle
  - 32-bit store throughput: 1 per cycle
  - 128-bit load throughput: 1 per cycle
  - ▶ 4× 32-bit add throughput: 2 per cycle
  - 128-bit store throughput: 1 per cycle

Vector instructions are almost as fast as scalar instructions but do 4× the work

- Most modern processors have vector units
- Work on vectors of floats or ints instead of scalars
- Much larger computational power, e.g., Intel Nehalem:
  - 32-bit load throughput: 1 per cycle
  - 32-bit add throughput: 3 per cycle
  - 32-bit store throughput: 1 per cycle
  - 128-bit load throughput: 1 per cycle
  - ▶ 4× 32-bit add throughput: 2 per cycle
  - 128-bit store throughput: 1 per cycle
- Vector instructions are almost as fast as scalar instructions but do 4× the work
- Some things are cost extra:
  - Variably indexed loads (lookups) into vectors
  - Data-dependent branches are expensive in SIMD
  - Shuffeling entries in vectors

## Vectorizing EC scalar multiplication

#### Computing multiple scalar multiplications

- Changes the rules of the game
- Increases size of active data set

# Vectorizing EC scalar multiplication

#### Computing multiple scalar multiplications

- Changes the rules of the game
- Increases size of active data set

#### Parallelism inside multiprecision arithmetic

- Addition (in redundant representation) is trivially vectorized
- Vectorizing multiplication needs many shuffles
- Vectorization "eats up" instruction-level parallelism

# Vectorizing EC scalar multiplication

#### Computing multiple scalar multiplications

- Changes the rules of the game
- Increases size of active data set

#### Parallelism inside multiprecision arithmetic

- Addition (in redundant representation) is trivially vectorized
- Vectorizing multiplication needs many shuffles
- Vectorization "eats up" instruction-level parallelism

#### Parallelism inside EC arithmetic

- Vectorize independent multiplications in EC addition
- May still need some shuffles (after each block of operations)
- Efficiency depends on EC formulas

# Example: Montgomery ladder

function ladderstep(
$$x_{Q-P}, X_P, Z_P, X_Q, Z_Q$$
)  
 $t_1 \leftarrow X_P + Z_P$   
 $t_6 \leftarrow t_1^2$   
 $t_2 \leftarrow X_P - Z_P$   
 $t_7 \leftarrow t_2^2$   
 $t_5 \leftarrow t_6 - t_7$   
 $t_3 \leftarrow X_Q + Z_Q$   
 $t_4 \leftarrow X_Q - Z_Q$   
 $t_8 \leftarrow t_4 \cdot t_1$   
 $t_9 \leftarrow t_3 \cdot t_2$   
 $X_{P+Q} \leftarrow (t_8 + t_9)^2$   
 $Z_{P+Q} \leftarrow x_{Q-P} \cdot (t_8 - t_9)^2$   
 $X_{[2]P} \leftarrow t_5 \cdot (t_7 + ((A+2)/4) \cdot t_5)$   
return ( $X_{[2]P}, Z_{[2]P}, X_{P+Q}, Z_{P+Q}$ )  
end function

# Example: Montgomery ladder

**function** ladderstep
$$(x_{Q-P}, X_P, Z_P, X_Q, Z_Q)$$
  
 $t_1 \leftarrow X_P + Z_P; t_2 \leftarrow X_P - Z_P; t_3 \leftarrow X_Q + Z_Q; t_4 \leftarrow X_Q - Z_Q$   
 $t_6 \leftarrow t_1 \cdot t_1; t_7 \leftarrow t_2 \cdot t_2; t_8 \leftarrow t_4 \cdot t_1; t_9 \leftarrow t_3 \cdot t_2$   
 $t_{10} \leftarrow ((A+2)/4) \cdot t_6$   
 $t_{11} \leftarrow ((A+2)/4 - 1) \cdot t_7$   
 $t_5 \leftarrow t_6 - t_7; t_4 \leftarrow t_{10} - t_{11}; t_1 \leftarrow t_8 - t_9; t_0 \leftarrow t_8 + t_9$   
 $Z_{[2]P} \leftarrow t_5 \cdot t_4; X_{P+Q} \leftarrow t_0^2; X_{[2]P} \leftarrow t_6 \cdot t_7; t_2 \leftarrow t_1 \cdot t_1$   
 $Z_{P+Q} \leftarrow x_{Q-P} \cdot t_2$ 

return  $(X_{[2]P}, Z_{[2]P}, X_{P+Q}, Z_{P+Q})$  end function

 Think of a Kummer surface as the Jacobian of a hyperelliptic curve modulo negation

- Think of a Kummer surface as the Jacobian of a hyperelliptic curve modulo negation
- Easier way to think about it:
  - Group modulo negation
  - Map from group to Kummer surface by rational map X
  - Elements represented projectively as (x : y : z : t)
  - (x:y:z:t) = (rx:ry:rz:rt) for any  $r \neq 0$
  - Efficient doubling and efficient differential addition

- Think of a Kummer surface as the Jacobian of a hyperelliptic curve modulo negation
- Easier way to think about it:
  - Group modulo negation
  - Map from group to Kummer surface by rational map X
  - Elements represented projectively as (x : y : z : t)
  - (x:y:z:t) = (rx:ry:rz:rt) for any  $r \neq 0$
  - Efficient doubling and efficient differential addition
- Ladderstep: gets as input  $X(P) = (x_2 : y_2 : z_2 : t_2)$ ,  $X(Q) = (x_3 : y_3 : z_3 : t_3)$ , and  $X(Q - P) = (x_1 : y_1 : z_1 : t_1)$ 
  - Computes  $X(2P) = (x_4 : y_4 : z_4 : t_4)$
  - Computes  $X(P+Q) = (x_5 : y_5 : z_5 : t_5)$

- Think of a Kummer surface as the Jacobian of a hyperelliptic curve modulo negation
- Easier way to think about it:
  - Group modulo negation
  - Map from group to Kummer surface by rational map X
  - Elements represented projectively as (x : y : z : t)
  - (x:y:z:t) = (rx:ry:rz:rt) for any  $r \neq 0$
  - Efficient doubling and efficient differential addition
- Ladderstep: gets as input  $X(P) = (x_2 : y_2 : z_2 : t_2)$ ,  $X(Q) = (x_3 : y_3 : z_3 : t_3)$ , and  $X(Q - P) = (x_1 : y_1 : z_1 : t_1)$ 
  - Computes  $X(2P) = (x_4 : y_4 : z_4 : t_4)$
  - Computes  $X(P+Q) = (x_5 : y_5 : z_5 : t_5)$
- Coordinates are elements of a (large) finite field
- Efficient Kummer surface arithmetic wants efficient multiprecision arithmetic

## Arithmetic on the Kummer surface



 $10\mathbf{M} + 9\mathbf{S} + 6\mathbf{m}$  ladder formulas

#### Arithmetic on the Kummer surface



 $10\mathbf{M} + 9\mathbf{S} + 6\mathbf{m}$  ladder formulas



 $7\mathbf{M} + 12\mathbf{S} + 9\mathbf{m}$  ladder formulas

# The "squared Kummer surface"

- In fact, we use arithmetic on a different, "squared" surface
- ▶ Each point (x:y:z:t) on the original surface corresponds to  $(x^2:y^2:z^2:t^2)$  on the squared surface
- No operation-count advantages
- Easier to construct squared surface with small constants
- $\blacktriangleright$  In the following rename  $(x^2:y^2:z^2:t^2)$  to (x:y:z:t)

## Arithmetic on the squared Kummer surface



 $10\mathbf{M} + 9\mathbf{S} + 6\mathbf{m}$  ladder formulas

#### Arithmetic on the squared Kummer surface



 $10\mathbf{M} + 9\mathbf{S} + 6\mathbf{m}$  ladder formulas



7M + 12S + 9m ladder formulas

## Arithmetic on the (original) Kummer surface







7M + 12S + 9m ladder formulas

▶ Formulas for efficient Kummer surface arithmetic known for a while

- Originally proposed by Chudnovsky, Chudnovsky, 1986
- ▶ 10M + 9S + 6m formulas by Gaudry, 2006
- ▶ 7M + 12S + 9m formulas by Bernstein, 2006

▶ Formulas for efficient Kummer surface arithmetic known for a while

- ▶ Originally proposed by Chudnovsky, Chudnovsky, 1986
- ▶ 10M + 9S + 6m formulas by Gaudry, 2006
- ▶ 7M + 12S + 9m formulas by Bernstein, 2006

▶ Problem: find cryptographically secure surface with small constants

Formulas for efficient Kummer surface arithmetic known for a while

- ▶ Originally proposed by Chudnovsky, Chudnovsky, 1986
- ▶ 10M + 9S + 6m formulas by Gaudry, 2006
- ▶ 7M + 12S + 9m formulas by Bernstein, 2006

Problem: find cryptographically secure surface with small constants

- ▶ Gaudry, Schost, 2012: suitable (squared) surface:
  - Defined over the field  $\mathbb{F}_{2^{127}-1}$

• 
$$(1:a^2/b^2:a^2/c^2:a^2/d^2) = (-114:57:66:418)$$

•  $(1: A^2/B^2: A^2/C^2: A^2/D^2) = (-833: 2499: 1617: 561)$ 

Formulas for efficient Kummer surface arithmetic known for a while

- ▶ Originally proposed by Chudnovsky, Chudnovsky, 1986
- ▶ 10M + 9S + 6m formulas by Gaudry, 2006
- ▶ 7M + 12S + 9m formulas by Bernstein, 2006
- Problem: find cryptographically secure surface with small constants
- ▶ Gaudry, Schost, 2012: suitable (squared) surface:
  - Defined over the field  $\mathbb{F}_{2^{127}-1}$
  - $(1:a^2/b^2:a^2/c^2:a^2/d^2) = (-114:57:66:418)$
  - $(1: A^2/B^2: A^2/C^2: A^2/D^2) = (-833: 2499: 1617: 561)$
- ▶ Finding this surface cost 1000000 CPU hours
- The same surface has been used by Bos, Costello, Hisil, and Lauter (Eurocrypt 2013)

# Sandy Bridge/Ivy Bridge

- Pre-latest microarchitectures by Intel
- Very powerful vector instructions: AVX
- Operations on 256-bit vector registers

# Sandy Bridge/Ivy Bridge

- Pre-latest microarchitectures by Intel
- Very powerful vector instructions: AVX
- Operations on 256-bit vector registers
- Can only do vectors of (single- or double-precision) floats
- Performance:
  - ▶ 1 vectorized double-precision multiplication per cycle, plus
  - ▶ 1 vectorized double-precision addition per cycle
  - ► A total of 8 double-precision operations per cycle!

## Representing elements of $\mathbb{F}_{2^{127}-1}$

- Represent an element A in radix- $2^{127/6}$
- Write A as  $a_0, a_1, a_2, a_3, a_4, a_5$ , where
  - $a_0$  is a small multiple of  $2^0$
  - ▶ a<sub>1</sub> is a small multiple of 2<sup>22</sup>
  - $a_2$  is a small multiple of  $2^{43}$
  - ▶ a<sub>3</sub> is a small multiple of 2<sup>64</sup>
  - $a_4$  is a small multiple of  $2^{85}$
  - $a_5$  is a small multiple of  $2^{106}$

- $\blacktriangleright$  Consider multiplication of A and B with reduction mod  $2^{127}-1$
- Make use of the fact that  $2^{127} \equiv 1$
- With radix  $2^{127/6}$  we obtain:

$$\begin{split} r_0 &= a_0 b_0 + 2^{-127} a_1 b_5 + 2^{-127} a_2 b_4 + 2^{-127} a_3 b_3 + 2^{-127} a_4 b_2 + 2^{-127} a_5 b_1 \\ r_1 &= a_0 b_1 + & a_1 b_0 + 2^{-127} a_2 b_5 + 2^{-127} a_3 b_4 + 2^{-127} a_4 b_3 + 2^{-127} a_5 b_2 \\ r_2 &= a_0 b_2 + & a_1 b_1 + & a_2 b_0 + 2^{-127} a_3 b_5 + 2^{-127} a_4 b_4 + 2^{-127} a_5 b_3 \\ r_3 &= a_0 b_3 + & a_1 b_2 + & a_2 b_1 + & a_3 b_0 + 2^{-127} a_4 b_5 + 2^{-127} a_5 b_4 \\ r_4 &= a_0 b_4 + & a_1 b_3 + & a_2 b_2 + & a_3 b_1 + & a_4 b_0 + 2^{-127} a_5 b_5 \\ r_5 &= a_0 b_5 + & a_1 b_4 + & a_2 b_3 + & a_3 b_2 + & a_4 b_1 + & a_5 b_0 \end{split}$$

- Consider multiplication of A and B with reduction mod  $2^{127} 1$
- Make use of the fact that  $2^{127} \equiv 1$
- With radix  $2^{127/6}$  we obtain:

$$\begin{split} r_0 &= a_0 b_0 + 2^{-127} a_1 b_5 + 2^{-127} a_2 b_4 + 2^{-127} a_3 b_3 + 2^{-127} a_4 b_2 + 2^{-127} a_5 b_1 \\ r_1 &= a_0 b_1 + a_1 b_0 + 2^{-127} a_2 b_5 + 2^{-127} a_3 b_4 + 2^{-127} a_4 b_3 + 2^{-127} a_5 b_2 \\ r_2 &= a_0 b_2 + a_1 b_1 + a_2 b_0 + 2^{-127} a_3 b_5 + 2^{-127} a_4 b_4 + 2^{-127} a_5 b_3 \\ r_3 &= a_0 b_3 + a_1 b_2 + a_2 b_1 + a_3 b_0 + 2^{-127} a_4 b_5 + 2^{-127} a_5 b_4 \\ r_4 &= a_0 b_4 + a_1 b_3 + a_2 b_2 + a_3 b_1 + a_4 b_0 + 2^{-127} a_5 b_5 \\ r_5 &= a_0 b_5 + a_1 b_4 + a_2 b_3 + a_3 b_2 + a_4 b_1 + a_5 b_0 \end{split}$$

 $\blacktriangleright$  Obviously, we always perform this whole thing  $4\times$  in parallel

- $\blacktriangleright$  Consider multiplication of A and B with reduction mod  $2^{127}-1$
- Make use of the fact that  $2^{127} \equiv 1$
- With radix  $2^{127/6}$  we obtain:

$$\begin{split} r_0 &= a_0 b_0 + 2^{-127} a_1 b_5 + 2^{-127} a_2 b_4 + 2^{-127} a_3 b_3 + 2^{-127} a_4 b_2 + 2^{-127} a_5 b_1 \\ r_1 &= a_0 b_1 + & a_1 b_0 + 2^{-127} a_2 b_5 + 2^{-127} a_3 b_4 + 2^{-127} a_4 b_3 + 2^{-127} a_5 b_2 \\ r_2 &= a_0 b_2 + & a_1 b_1 + & a_2 b_0 + 2^{-127} a_3 b_5 + 2^{-127} a_4 b_4 + 2^{-127} a_5 b_3 \\ r_3 &= a_0 b_3 + & a_1 b_2 + & a_2 b_1 + & a_3 b_0 + 2^{-127} a_4 b_5 + 2^{-127} a_5 b_4 \\ r_4 &= a_0 b_4 + & a_1 b_3 + & a_2 b_2 + & a_3 b_1 + & a_4 b_0 + 2^{-127} a_5 b_5 \\ r_5 &= a_0 b_5 + & a_1 b_4 + & a_2 b_3 + & a_3 b_2 + & a_4 b_1 + & a_5 b_0 \end{split}$$

- $\blacktriangleright$  Obviously, we always perform this whole thing  $4\times$  in parallel
- Obviously, we specialize squaring

- Consider multiplication of A and B with reduction mod  $2^{127} 1$
- Make use of the fact that  $2^{127} \equiv 1$
- With radix  $2^{127/6}$  we obtain:

$$\begin{split} r_0 &= a_0 b_0 + 2^{-127} a_1 b_5 + 2^{-127} a_2 b_4 + 2^{-127} a_3 b_3 + 2^{-127} a_4 b_2 + 2^{-127} a_5 b_1 \\ r_1 &= a_0 b_1 + & a_1 b_0 + 2^{-127} a_2 b_5 + 2^{-127} a_3 b_4 + 2^{-127} a_4 b_3 + 2^{-127} a_5 b_2 \\ r_2 &= a_0 b_2 + & a_1 b_1 + & a_2 b_0 + 2^{-127} a_3 b_5 + 2^{-127} a_4 b_4 + 2^{-127} a_5 b_3 \\ r_3 &= a_0 b_3 + & a_1 b_2 + & a_2 b_1 + & a_3 b_0 + 2^{-127} a_4 b_5 + 2^{-127} a_5 b_4 \\ r_4 &= a_0 b_4 + & a_1 b_3 + & a_2 b_2 + & a_3 b_1 + & a_4 b_0 + 2^{-127} a_5 b_5 \\ r_5 &= a_0 b_5 + & a_1 b_4 + & a_2 b_3 + & a_3 b_2 + & a_4 b_1 + & a_5 b_0 \end{split}$$

- $\blacktriangleright$  Obviously, we always perform this whole thing  $4\times$  in parallel
- Obviously, we specialize squaring
- Obviously, we specialize multiplications by small constants

## The Hadamard transform



- Only shuffeling operation in Kummer arithmetic
- AVX has limited shuffeling across left and right half
- Plain Hadamard turns out to be expensive

# The Hadamard transform



- Only shuffeling operation in Kummer arithmetic
- AVX has limited shuffeling across left and right half
- Plain Hadamard turns out to be expensive

#### Permuted and negated Hadamard

- Allow generalized Hadamard to output permuted vector
- Self-inverting permutation "cleans" after two generalized Hadamards

# The Hadamard transform



- Only shuffeling operation in Kummer arithmetic
- AVX has limited shuffeling across left and right half
- Plain Hadamard turns out to be expensive

#### Permuted and negated Hadamard

- Allow generalized Hadamard to output permuted vector
- Self-inverting permutation "cleans" after two generalized Hadamards
- Allow generalized Hadamard to negate vector entries
- "Clean" negations by multiplication by negated constants

#### Arithmetic on the squared Kummer surface



## Results

#### 128-bit secure, constant-time scalar multiplication

| arch  | cycles  | open | g | source of software               |
|-------|---------|------|---|----------------------------------|
| Sandy | 194036  | yes  | 1 | Bernstein–Duif–Lange–Schwabe–    |
|       |         |      |   | Yang CHES 2011                   |
| Sandy | 153000? | no   | 1 | Hamburg                          |
| Sandy | 137000? | no   | 1 | Longa–Sica Asiacrypt 2012        |
| Sandy | 122716  | yes  | 2 | Bos–Costello–Hisil–Lauter Euro-  |
|       |         |      |   | crypt 2013                       |
| Sandy | 119904  | yes  | 1 | Oliveira–López–Aranha–Rodríguez- |
|       |         |      |   | Henríquez CHES 2013              |
| Sandy | 96000?  | no   | 1 | Faz-Hernández–Longa–Sánchez CT-  |
|       |         |      |   | RSA 2014                         |
| Sandy | 92000?  | no   | 1 | Faz-Hernández–Longa–Sánchez      |
|       |         |      |   | July 2014                        |
| Sandy | 88916   | yes  | 2 | new (our results)                |

## Results

#### 128-bit secure, constant-time scalar multiplication

| arch | cycles  | open | g | source of software                  |
|------|---------|------|---|-------------------------------------|
| lvy  | 182708  | yes  | 1 | Bernstein–Duif–Lange–Schwabe–Yang   |
|      |         |      |   | CHES 2011                           |
| lvy  | 145000? | yes  | 1 | Costello–Hisil–Smith Eurocrypt 2014 |
| lvy  | 119032  | yes  | 2 | Bos–Costello–Hisil–Lauter Euro-     |
|      |         |      |   | crypt 2013                          |
| lvy  | 114036  | yes  | 1 | Oliveira–López–Aranha–Rodríguez-    |
|      |         |      |   | Henríquez CHES 2013                 |
| lvy  | 92000?  | no   | 1 | Faz-Hernández–Longa–Sánchez CT-     |
|      |         |      |   | RSA 2014                            |
| lvy  | 89000?  | no   | 1 | Faz-Hernández–Longa–Sánchez         |
|      |         |      |   | July 2014                           |
| lvy  | 88448   | yes  | 2 | new (our results)                   |

## More results

#### Also optimized for Intel Haswell

| arch    | cycles | open | g | source of software        |
|---------|--------|------|---|---------------------------|
| Haswell | 145907 | yes  | 1 | Bernstein–Duif–Lange–     |
|         |        |      |   | Schwabe–Yang CHES 2011    |
| Haswell | 100895 | yes  | 2 | Bos–Costello–Hisil–Lauter |
|         |        |      |   | Eurocrypt 2013            |
| Haswell | 55595  | no   | 1 | Oliveira–López–Aranha–    |
|         |        |      |   | Rodríguez-Henríquez       |
|         |        |      |   | CHES 2013                 |
| Haswell | 54389  | yes  | 2 | new (our results)         |

#### Also optimized for ARM Cortex-A8

| arch    | cycles | open | g | source of software          |
|---------|--------|------|---|-----------------------------|
| A8-slow | 497389 | yes  | 1 | Bernstein-Schwabe CHES 2012 |
| A8-slow | 305395 | yes  | 2 | new (our result)            |
| A8-fast | 460200 | yes  | 1 | Bernstein-Schwabe CHES 2012 |
| A8-fast | 273349 | yes  | 2 | new (our result)            |

#### Resources online

#### Paper:

Daniel J. Bernstein, Chitchanok Chuengsatiansup, Tanja Lange, Peter Schwabe. *"Kummer strikes back: new DH speed records"*. To be presented at Asiacrypt 2014 http://cryptojedi.org/papers/#kummer

#### Software:

Included in SUPERCOP, subdirectory crypto\_scalarmult/kummer/ http://bench.cr.yp.to/supercop.html