#### Full accounting for verifiable outsourcing

Riad S. Wahby\*, Ye Ji°, Andrew J. Blumberg<sup>†</sup>, abhi shelat<sup>‡</sup>, Justin Thaler<sup>△</sup>, Michael Walfish°, and Thomas Wies°

> \*Stanford University °New York University <sup>†</sup>The University of Texas at Austin <sup>‡</sup>Northeastern University <sup>△</sup>Georgetown University

> > July 6<sup>th</sup>, 2017





#### Approach: Server's response includes short proof of correctness.

[Babai85, GMR85, BCC86, BFLS91, FGLSS91, ALMSS92, AS92, Kilian92, LFKN92, Shamir92, Micali00, BG02, BS05, GOS06, BGHSV06, IKO07, GKR08, KR09, GGP10, Groth10, GLR11, Lipmaa11, BCCT12, GGPR13, BCCT13, Thaler13, KRR14, ...]



Approach: Server's response includes short proof of correctness.

[Babai85, GMR85, BCC86, BFLS91, FGLSS91, ALMSS92, AS92, Kilian92, LFKN92, Shamir92, Micali00, BG02, BS05, GOS06, BGHSV06, IKO07, GKR08, KR09, GGP10, Groth10, GLR11, Lipmaa11, BCCT12, GGPR13, BCCT13, Thaler13, KRR14, ...]



**Goal:** outsourcing should be less expensive than just executing the computation

# Verifier: can easily check proof (asymptotically)

Verifier: can easily check proof (asymptotically) Prover: has massive overhead ( $\approx 10,000,000 \times$ )

Verifier: can easily check proof (asymptotically) Prover: has massive overhead ( $\approx 10,000,000 \times$ ) Precomputation: proportional to computation size

Verifier: can easily check proof (asymptotically) Prover: has massive overhead ( $\approx 10,000,000 \times$ ) Precomputation: proportional to computation size

How do systems handle these costs?

Verifier: can easily check proof (asymptotically) Prover: has massive overhead ( $\approx 10,000,000 \times$ ) Precomputation: proportional to computation size

How do systems handle these costs? Precomputation: amortize over many instances

Verifier: can easily check proof (asymptotically) Prover: has massive overhead ( $\approx 10,000,000 \times$ ) Precomputation: proportional to computation size

How do systems handle these costs? Precomputation: amortize over many instances Prover: assume  $> 10^8 \times$  cheaper than verifier

# Giraffe: first system to consider all costs and win.

Giraffe: first system to consider all costs and win.

In Giraffe,  $\mathcal{P}$  really is  $10^8 \times$  cheaper than  $\mathcal{V}$ ! (setting: building trustworthy hardware)

Giraffe: first system to consider all costs and win.

In Giraffe,  $\mathcal{P}$  really is  $10^8 \times$  cheaper than  $\mathcal{V}$ ! (setting: building trustworthy hardware)

Giraffe extends Zebra [WHGsW, Oakland16] with:

- an asymptotically optimal proof protocol that improves on prior work [Thaler, CRYPTO13]
- a compiler that generates optimized hardware designs from a subset of C

Giraffe: first system to consider all costs and win.

In Giraffe,  $\mathcal{P}$  really is  $10^8 \times$  cheaper than  $\mathcal{V}$ ! (setting: building trustworthy hardware)

Giraffe extends Zebra [WHGsW, Oakland16] with:

- an asymptotically optimal proof protocol that improves on prior work [Thaler, CRYPTO13]
- a compiler that generates optimized hardware designs from a subset of C

Bottom line: Giraffe makes outsourcing worthwhile

Giraffe: first system to consider all costs and win.

In Giraffe,  $\mathcal{P}$  really is  $10^8 \times$  cheaper than  $\mathcal{V}$ ! (setting: building trustworthy hardware)

Giraffe extends Zebra [WHGsW, Oakland16] with:

- an asymptotically optimal proof protocol that improves on prior work [Thaler, CRYPTO13]
- a compiler that generates optimized hardware designs from a subset of C

**Bottom line:** Giraffe makes outsourcing worthwhile (... sometimes).

Roadmap

# 1. Verifiable ASICs

## 2. Giraffe: a high-level view

# 3. Evaluation

Roadmap

# 1. Verifiable ASICs

#### 2. Giraffe: a high-level view

3. Evaluation

# How can we build trustworthy hardware?



e.g., a custom chip for network packet processing whose manufacture we outsource to a third party



#### What if the chip's manufacturer inserts a **back door**?



What if the chip's manufacturer inserts a **back door**? Threat: incorrect execution of the packet filter (Other concerns, e.g., secret state, are important but orthogonal)



#### What if the chip's manufacturer inserts a **back door**?

#### The Cybercrime Economy

Fake tech gear has infiltrated the U.S. government

by David Goldman @DavidGoldmanCNN

November 8, 2012: 3:10 PM ET







US DoD controls supply chain with trusted foundries.

For example, stealthy trojans can thwart post-fab detection [A2: Analog Malicious Hardware, Yang et al., Oakland16; Stealthy Dopant-Level Trojans, Becker et al., CHES13]

For example, stealthy trojans can thwart post-fab detection [A2: Analog Malicious Hardware, Yang et al., Oakland16; Stealthy Dopant-Level Trojans, Becker et al., CHES13]

But trusted fabrication is not a panacea:

- X Only 5 countries have cutting-edge fabs on-shore
- ✗ Building a new fab takes \$\$\$\$\$\$, years of R&D

For example, stealthy trojans can thwart post-fab detection [A2: Analog Malicious Hardware, Yang et al., Oakland16; Stealthy Dopant-Level Trojans, Becker et al., CHES13]

But trusted fabrication is not a panacea:

- X Only 5 countries have cutting-edge fabs on-shore
- ✗ Building a new fab takes \$\$\$\$\$\$, years of R&D
- Semiconductor scaling: chip area and energy go with square and cube of transistor length ("critical dimension")
- X So using an old fab means an enormous performance hit e.g., India's best on-shore fab is  $10^8 \times$  behind state of the art

For example, stealthy trojans can thwart post-fab detection [A2: Analog Malicious Hardware, Yang et al., Oakland16; Stealthy Dopant-Level Trojans, Becker et al., CHES13]

But trusted fabrication is not a panacea:

- X Only 5 countries have cutting-edge fabs on-shore
- ✗ Building a new fab takes \$\$\$\$\$\$, years of R&D
- Semiconductor scaling: chip area and energy go with square and cube of transistor length ("critical dimension")
- ✗ So using an old fab means an enormous performance hit e.g., India's best on-shore fab is 10<sup>8</sup> × behind state of the art

Idea: outsource computations to untrusted chips

 $\begin{array}{l} \text{Principal} \\ \text{F} \rightarrow \text{designs} \\ \text{for } \mathcal{P}, \mathcal{V} \end{array}$ 











 ${\mathcal V}$  overhead: checking proof is cheap



 $\mathcal{V}$  overhead: checking proof is cheap  $\mathcal{P}$  overhead: high compared to cost of F...



\$\mathcal{V}\$ overhead: checking proof is cheap
\$\mathcal{P}\$ overhead: high compared to cost of F...
...but \$\mathcal{P}\$ uses an advanced circuit technology



\$\mathcal{V}\$ overhead: checking proof is cheap
\$\mathcal{P}\$ overhead: high compared to cost of F...
...but \$\mathcal{P}\$ uses an advanced circuit technology

# Can Verifiable ASICs be practical?



\$\mathcal{V}\$ overhead: checking proof is cheap
\$\mathcal{P}\$ overhead: high compared to cost of F...
...but \$\mathcal{P}\$ uses an advanced circuit technology
Precomputation: proportional to cost of F

$$\begin{array}{c} \mathsf{Prior work:} \\ \mathcal{V} + \mathcal{P} + \mathsf{Precomp} > \mathsf{F} \end{array}$$

# Can Verifiable ASICs be practical?



\$\mathcal{V}\$ overhead: checking proof is cheap
\$\mathcal{P}\$ overhead: high compared to cost of F...
...but \$\mathcal{P}\$ uses an advanced circuit technology
Precomputation: proportional to cost of F
Prior work assumes this away

 $\begin{array}{c} \mathsf{Prior} \ \mathsf{work:} \\ \mathcal{V} + \mathcal{P} + \mathsf{Precomp} > \mathsf{F} \end{array}$ 

# Can Verifiable ASICs be practical?



\$\mathcal{V}\$ overhead: checking proof is cheap
\$\mathcal{P}\$ overhead: high compared to cost of F...
...but \$\mathcal{P}\$ uses an advanced circuit technology
Precomputation: proportional to cost of F
Prior work assumes this away

$$\mathsf{Our} \; \mathsf{goal}: \ \mathcal{V} + \mathcal{P} + \mathsf{Precomp} < \mathsf{F}$$

Roadmap

# 1. Verifiable ASICs

# 2. Giraffe: a high-level view

3. Evaluation

GKR08 base protocol

GKR08 base protocol

CMT12 reduces  ${\mathcal P}$  and precomp costs for all ckts

GKR08 base protocol

CMT12 reduces  $\mathcal{P}$  and precomp costs for all ckts

Thaler13 reduces precomp for structured circuits

GKR08 base protocol

CMT12 reduces  $\mathcal{P}$  and precomp costs for all ckts

Thaler13 reduces precomp for structured circuits

Giraffe reduces  $\mathcal{P}$  cost for structured circuits (plus optimizations for  $\mathcal{V}$ ; see paper)

GKR08 base protocol

CMT12 reduces  $\mathcal{P}$  and precomp costs for all ckts

Thaler13 reduces precomp for structured circuits

Giraffe reduces  $\mathcal{P}$  cost for structured circuits (plus optimizations for  $\mathcal{V}$ ; see paper)

Let's take a high-level look at how these optimizations work. (The following all use a nice simplification [Thaler15].)

For each layer of an arithmetic circuit,  ${\cal P}$  and  ${\cal V}$  engage in a sum-check protocol.



For each layer of an arithmetic circuit,  ${\cal P}$  and  ${\cal V}$  engage in a sum-check protocol.

In the first round,  $\mathcal{P}$  computes  $(q \in \mathbb{F}^{\log G})$ :

$$\sum_{h_0 \in \{0,1\}^{\log G}} \sum_{h_1 \in \{0,1\}^{\log G}} \left( \tilde{\mathsf{add}}(q, h_0, h_1) \left( \tilde{\mathsf{V}}(h_0) + \tilde{\mathsf{V}}(h_1) \right) + \tilde{\mathsf{mul}}(q, h_0, h_1) \left( \tilde{\mathsf{V}}(h_0) \cdot \tilde{\mathsf{V}}(h_1) \right) \right)$$



For each layer of an arithmetic circuit,  ${\cal P}$  and  ${\cal V}$  engage in a sum-check protocol.

In the first round,  $\mathcal{P}$  computes  $(q \in \mathbb{F}^{\log G})$ :

$$\sum_{h_0 \in \{0,1\}^{\log G}} \sum_{h_1 \in \{0,1\}^{\log G}} \left( \tilde{\operatorname{add}}(q, h_0, h_1) \left( \tilde{\mathsf{V}}(h_0) + \tilde{\mathsf{V}}(h_1) \right) + \tilde{\operatorname{mul}}(q, h_0, h_1) \left( \tilde{\mathsf{V}}(h_0) \cdot \tilde{\mathsf{V}}(h_1) \right) \right)$$

This has  $2^{2 \log G} = G^2$  terms. In total,  $\mathcal{P}$ 's work is O(poly(G)).



For each layer of an arithmetic circuit,  ${\cal P}$  and  ${\cal V}$  engage in a sum-check protocol.

In the first round,  $\mathcal{P}$  computes  $(q \in \mathbb{F}^{\log G})$ :

$$\sum_{h_0 \in \{0,1\}^{\log G}} \sum_{h_1 \in \{0,1\}^{\log G}} \left( \operatorname{add}(q, h_0, h_1) \left( \tilde{\mathsf{V}}(h_0) + \tilde{\mathsf{V}}(h_1) \right) + \operatorname{mul}(q, h_0, h_1) \left( \tilde{\mathsf{V}}(h_0) \cdot \tilde{\mathsf{V}}(h_1) \right) \right)$$

This has  $2^{2 \log G} = G^2$  terms. In total,  $\mathcal{P}$ 's work is O(poly(G)).

Precomputation is one evaluation of add and mul, costing O(poly(G)).



 $add(g_O, g_L, g_R) = 0$  except when  $g_O$  is + with inputs  $g_L, g_R$ 







add $(g_O, g_L, g_R) = 0$  except when  $g_O$  is + with inputs  $g_L, g_R$ This means we can rewrite  $\mathcal{P}$ 's sum in the first round as:

$$\sum_{\substack{(h_0,h_1)\in S_{add}}} \tilde{add}(q, h_0, h_1) \left(\tilde{V}(h_0) + \tilde{V}(h_1)\right) + \\\sum_{\substack{(h_0,h_1)\in S_{mul}}} \tilde{mul}(q, h_0, h_1) \left(\tilde{V}(h_0) \cdot \tilde{V}(h_1)\right)$$



add $(g_O, g_L, g_R) = 0$  except when  $g_O$  is + with inputs  $g_L, g_R$ This means we can rewrite  $\mathcal{P}$ 's sum in the first round as:

$$\sum_{(h_0,h_1)\in S_{add}} \tilde{add}(q, h_0, h_1) \left(\tilde{V}(h_0) + \tilde{V}(h_1)\right) + \\\sum_{(h_0,h_1)\in S_{mul}} \tilde{mul}(q, h_0, h_1) \left(\tilde{V}(h_0) \cdot \tilde{V}(h_1)\right)$$

*G* terms/round for  $2 \log G$  rounds:  $\mathcal{P}$ 's work is  $O(G \log G)$ .



add $(g_O, g_L, g_R) = 0$  except when  $g_O$  is + with inputs  $g_L, g_R$ This means we can rewrite  $\mathcal{P}$ 's sum in the first round as:

$$\sum_{(h_0,h_1)\in S_{add}} \tilde{add}(q, h_0, h_1) \left(\tilde{V}(h_0) + \tilde{V}(h_1)\right) + \\\sum_{(h_0,h_1)\in S_{mul}} \tilde{mul}(q, h_0, h_1) \left(\tilde{V}(h_0) \cdot \tilde{V}(h_1)\right)$$

*G* terms/round for  $2 \log G$  rounds:  $\mathcal{P}$ 's work is  $O(G \log G)$ .

Using a related trick, precomputing add and mul costs O(G) in total.



Idea: for a batch of identical subckts, add and mul can be "small."



Idea: for a batch of identical subckts, add and mul can be "small."



Idea: for a *batch* of identical subckts, add and mul can be "small."  $\rightarrow$  Precomp costs O(G), amortized over N copies!



Idea: for a *batch* of identical subckts, add and mul can be "small."  $\rightarrow$  Precomp costs O(G), amortized over N copies!

Now  $\mathcal{P}$ 's sum in the first round is  $(q' \in \mathbb{F}^{\log N})$ :

$$\sum_{\substack{(h_0,h_1)\in S_{\text{add}}}} \tilde{\operatorname{add}}(q, h_0, h_1) \sum_{\substack{h'\in\{0,1\}^{\log N}}} \tilde{\operatorname{eq}}(q', h') \left(\tilde{\mathsf{V}}(h', h_0) + \tilde{\mathsf{V}}(h', h_1)\right) + \\\sum_{\substack{(h_0,h_1)\in S_{\text{mul}}}} \tilde{\operatorname{mul}}(q, h_0, h_1) \sum_{\substack{h'\in\{0,1\}^{\log N}}} \tilde{\operatorname{eq}}(q', h') \left(\tilde{\mathsf{V}}(h', h_0) \cdot \tilde{\mathsf{V}}(h', h_1)\right)$$



Idea: for a *batch* of identical subckts, add and mul can be "small."  $\rightarrow$  Precomp costs O(G), amortized over N copies!

Now  $\mathcal{P}$ 's sum in the first round is  $(q' \in \mathbb{F}^{\log N})$ :



Idea: for a *batch* of identical subckts, add and mul can be "small."  $\rightarrow$  Precomp costs O(G), amortized over N copies!

Now  $\mathcal{P}$ 's sum in the first round is  $(q' \in \mathbb{F}^{\log N})$ :





Idea: for a *batch* of identical subckts, add and mul can be "small."  $\rightarrow$  Precomp costs O(G), amortized over N copies!

Now  $\mathcal{P}$ 's sum in the first round is  $(q' \in \mathbb{F}^{\log N})$ :



For each gate, sum over each subcircuit.



Idea: for a *batch* of identical subckts, add and mul can be "small."  $\rightarrow$  Precomp costs O(G), amortized over N copies!

Now  $\mathcal{P}$ 's sum in the first round is  $(q' \in \mathbb{F}^{\log N})$ :

$$\sum_{\substack{(h_0,h_1)\in S_{\text{add}}}} \tilde{\operatorname{add}}(q, h_0, h_1) \sum_{\substack{h'\in\{0,1\}^{\log N}}} \tilde{\operatorname{eq}}(q', h') \left(\tilde{\mathsf{V}}(h', h_0) + \tilde{\mathsf{V}}(h', h_1)\right) + \\\sum_{\substack{(h_0,h_1)\in S_{\text{mul}}}} \tilde{\operatorname{mul}}(q, h_0, h_1) \sum_{\substack{h'\in\{0,1\}^{\log N}}} \tilde{\operatorname{eq}}(q', h') \left(\tilde{\mathsf{V}}(h', h_0) \cdot \tilde{\mathsf{V}}(h', h_1)\right)$$

*NG* terms/round in first  $2 \log G$  rounds:  $\mathcal{P}$ 's work is  $\Omega(NG \log G)$ .



Idea: arrange for copies to "collapse" during sum-check protocol.



$$\sum_{\substack{h' \in \{0,1\}^{\log N} \\ h' \in \{0,1\}^{\log N}}} \tilde{eq}(q',h') \sum_{\substack{(h_0,h_1) \in S_{add} \\ (h_0,h_1) \in S_{add}}} \tilde{add}(q,h_0,h_1) \left(\tilde{V}(h',h_0) + \tilde{V}(h',h_1)\right) + \sum_{\substack{h' \in \{0,1\}^{\log N} \\ (h_0,h_1) \in S_{mul}}} \tilde{mul}(q,h_0,h_1) \left(\tilde{V}(h',h_0) \cdot \tilde{V}(h',h_1)\right)$$



Idea: arrange for copies to "collapse" during sum-check protocol. Rewriting the prior sum and changing sumcheck order:

$$\sum_{\substack{h' \in \{0,1\}^{\log N} \\ h' \in \{0,1\}^{\log N}}} \tilde{eq}(q',h') \sum_{\substack{(h_0,h_1) \in S_{add} \\ (h_0,h_1) \in S_{mul}}} \tilde{mul}(q,h_0,h_1) \left(\tilde{V}(h',h_0) \cdot \tilde{V}(h',h_1)\right) + \tilde{V}(h',h_1) \left(\tilde{V}(h',h_0) \cdot \tilde{V}(h',h_1)\right)$$

#### For each subcircuit,



Idea: arrange for copies to "collapse" during sum-check protocol. Rewriting the prior sum and changing sumcheck order:



### For each subcircuit, sum over each gate.









Idea: arrange for copies to "collapse" during sum-check protocol. Rewriting the prior sum and changing sumcheck order:

$$\sum_{\substack{h' \in \{0,1\}^{\log N} \\ h' \in \{0,1\}^{\log N}}} \tilde{eq}(q',h') \sum_{\substack{(h_0,h_1) \in S_{add} \\ (h_0,h_1) \in S_{mul}}} \tilde{mul}(q,h_0,h_1) \left(\tilde{V}(h',h_0) + \tilde{V}(h',h_1)\right) + \sum_{\substack{h' \in \{0,1\}^{\log N} \\ (h_0,h_1) \in S_{mul}}} \tilde{mul}(q,h_0,h_1) \left(\tilde{V}(h',h_0) \cdot \tilde{V}(h',h_1)\right)$$

 $\mathcal{P}$  does  $\left(N + \frac{N}{2} + \frac{N}{4} + ...\right) G + 2G \log G = O(NG + G \log G)$  work.



Idea: arrange for copies to "collapse" during sum-check protocol. Rewriting the prior sum and changing sumcheck order:

$$\sum_{\substack{h' \in \{0,1\}^{\log N} \\ h' \in \{0,1\}^{\log N}}} \tilde{eq}(q',h') \sum_{\substack{(h_0,h_1) \in S_{add} \\ (h_0,h_1) \in S_{mul}}} \tilde{mul}(q,h_0,h_1) \left(\tilde{V}(h',h_0) + \tilde{V}(h',h_1)\right) + \sum_{\substack{h' \in \{0,1\}^{\log N} \\ (h_0,h_1) \in S_{mul}}} \tilde{mul}(q,h_0,h_1) \left(\tilde{V}(h',h_0) \cdot \tilde{V}(h',h_1)\right)$$

 $\mathcal{P} \operatorname{does} \left(N + \frac{N}{2} + \frac{N}{4} + \ldots\right) G + 2G \log G = O(NG + G \log G) \text{ work.}$  $\Rightarrow \operatorname{Linear in size of computation when } N > \log G!$ 



Roadmap

# 1. Verifiable ASICs

## 2. Giraffe: a high-level view

3. Evaluation

# Giraffe is an end-to-end hardware generator:

Giraffe is an end-to-end hardware generator:

a hardware design template given computation, chip parameters (technology, size, ...), produces optimized hardware designs for  $\mathcal{P}$  and  $\mathcal{V}$  Giraffe is an end-to-end hardware generator:

a hardware *design template* 

given computation, chip parameters (technology, size,  $\ldots$ ), produces optimized hardware designs for  ${\cal P}$  and  ${\cal V}$ 

a (subset of) C compiler produces the representation used by the design template How does Giraffe perform on real-world computations?

1. Curve25519 point multiplication

2. Image matching

How does Giraffe perform on real-world computations?

1. Curve25519 point multiplication

2. Image matching

**Goal:** total cost of  $\mathcal{V}$ ,  $\mathcal{P}$ , and precomputation should be less than building F on a trusted chip



Baselines: Zebra; implementation of F in same technology as  $\mathcal{V}$ 



Baselines: Zebra; implementation of F in same technology as  $\mathcal{V}$ 

Metric: total energy consumption



Baselines: Zebra; implementation of F in same technology as  $\mathcal V$ 

Metric: total energy consumption

Measurements: based on circuit synthesis and simulation, published chip designs, and CMOS scaling models

Charge for  $\mathcal{V}$ ,  $\mathcal{P}$ , communication; precomputation; PRNG



Baselines: Zebra; implementation of F in same technology as V

Metric: total energy consumptie

Measurements: based on circuit published chip designs, and CM 350 nm: 1997 (Pentium II) 7 nm:  $\approx$  2018

 $\approx$  20 year gap between trusted and untrusted fab

Charge for  $\mathcal{V}$ ,  $\mathcal{P}$ , communication; precomputation; PRN

Constraints: trusted fab = 350 nm; untrusted fab = 7 nm 200 mm<sup>2</sup> max chip area; 150 W max total power Application #1: Curve25519 point multiplication

## Curve25519: a commonly-used elliptic curve

Point multiplication: primitive, e.g., for ECDH



#### Application #1: Curve25519 point multiplication

Application #2: Image matching

## Image matching via Fast Fourier transform

C implementation, compiled by Giraffe's front-end to  $\mathcal{V}$  and  $\mathcal{P}$  hardware designs—no hand tweaking!

Application #2: Image matching







# **X** Giraffe is restricted to batched computations



X Giraffe is restricted to batched computations

Giraffe's front-end includes two static analysis passes:

**Slicing** extracts only the parts of programs that can be efficiently outsourced **Squashing** extracts batch-parallelism from serial computations



- **X** Giraffe is restricted to batched computations
- ✓ Giraffe's proof protcol and optimizations save orders of magnitude compared to prior work



- **X** Giraffe is restricted to batched computations
- ✓ Giraffe's proof protcol and optimizations save orders of magnitude compared to prior work
- ✓ Giraffe is the first system in the literature to account for *all costs*—and win.



- **X** Giraffe is restricted to batched computations
- ✓ Giraffe's proof protcol and optimizations save orders of magnitude compared to prior work
- ✓ Giraffe is the first system in the literature to account for *all costs*—and win.

Giraffe is a step, but much work remains!



- **X** Giraffe is restricted to batched computations
- ✓ Giraffe's proof protcol and optimizations save orders of magnitude compared to prior work
- ✓ Giraffe is the first system in the literature to account for *all costs*—and win.

Giraffe is a step, but much work remains!

https://giraffe.crypto.fyi
http://www.pepper-project.org