hey, I've finally found some time to try using fiat-crypto for ECDSA (starting with NIST P256), as suggested in https://github.com/mit-plv/fiat-crypto/issues/885 I'm now running `./word_by_word_montgomery --static 'p256' '64' '0xffffffff00000000ffffffffffffffffbce6faada7179e84f3b9cac2fc632551'`

-- while using the P256 curve there takes < 1 minute, this is now running since 20 CPU minutes (still < 40MB memory) without any output. the number (given in hex above) is the group order (n) -- should that be passed as polynomial instead?

I think we will have to ping @Jason Gross for this one.

Thanks Karl -- sorry first time zulip user here -- please tell if I'm doing something wrong.

I'm looking into it now, I suspect the issue is an asymptotic complexity problem in the hex string parser

@Rasmus Holdsbjerg-Larsen and @Benjamin Salling Hvass may also be interested/have insights.

This is definitely a parsing issue. `Time Compute parse_num false "0xffffffff00000000ffffffffffffffffbce6f".`

take 0.056s, but adding `a`

to the end of the string results in it taking 2.303 s.

ok, 50 CPU minutes 50 MB memory, still running..

And adding one more `a`

to the string results in `vm_compute`

stack overflowing...

Is there an easy way out? E.g. manual conversion of the group order from hex into a polynomial? Or hardcoding the group order number in word_by_word_montgomery.ml and recompile (I'm more familiar with OCaml than Coq...)?

Ah, I figured it out. The issue is that I had the parser accepting `0x<num>e<num>`

as a notation for exponentiation, so it was computing `10^0x6faada7179e84f3b9cac2fc632551`

, which is a very very big number. (This is starting to become a theme, where taking lots of RAM and time is a result of somehow misinterpreting the user's input and treating something as an exponent when it should not be treated as such.) Anyway, I should have a commit fixing this pushed in a couple of minutes. In the meantime, if you write your number in decimal (or as any arithmetic expression where you avoid using `e`

and `E`

in hex numbers), that should work.

Okay, I've just pushed the commit https://github.com/mit-plv/fiat-crypto/commit/e31a36d5f1b20134e67ccc5339d88f0ff3cb0f86 which should fix the issue (not fully tested)

thanks Jason, I updated, running the `make standalone-ocaml`

again, and will report back results.

Ok, the C code generation (see invocation on top) finished < 1 second now. Thanks! :)

Great, I'm glad to have been able to solve this so easily!

hmm, and while trying out the generated C code, it looks like I'm unable to make good use of it (maybe due to a lack of understanding what representations are expected) -- I get a test vector from NIST (specifying a `k`

and a `k ^ -1`

-- both as hex numbers). When using zarith - an OCaml multi-precision integer library (based on GMP), I'm able to let k = Z.of_string "0x..." in let kinv = Z.invert k (Z.of_string "0x <the P256 group order>") in assert (kinv = Z.of_string "0x <nist-provided kinv>").

With the generated fiat code, when attempting to do the same (convert hex to binary, apply inverse -- using the inversion_template.c), I get a different output for kinv. Now questions include: should I need to call of_bytes/from_bytes (they seem to not to anything memory-wise (eventually C-type-wise?))? what about to/from_montgomery, should I call this before passing the value to inverse? May it be a big-endian/little-endian representation issue? I have tried many of these variants (to/from_montgomery, do a reverse on the bytes before/after) with no success in getting the NIST provided kinv.

I looked a bit further,and expected \forall i . inverse (inverse i) = i [with the side condition that 0 <= i < m] -- and this does not hold in my code for some reason.

I switched back to the p256_64.c (i.e. normal 256 NIST curve generated by fiat), and then instantiate the inversion_template.c as suggested in the ECDSA issue.

Hannes Mehnert said:

I switched back to the p256_64.c (i.e. normal 256 NIST curve generated by fiat), and then instantiate the inversion_template.c as suggested in the ECDSA issue.

I can't quite tell, does this solve the issue you're having?

Regarding from/to bytes and from/to montgomery: the from/to bytes are things to be done when you need to get numbers from the wire or send them out. The pre and postconditions of most of the functions should tell you when you need to convert from and to montgomery. e.g., `fiat_p256_mul`

has the post-condition `eval (from_montgomery out1) mod m = (eval (from_montgomery arg1) * eval (from_montgomery arg2)) mod m`

and `to_montgomery`

has the postcondition `eval (from_montgomery out1) mod m = eval arg1 mod m`

so you would need to write something like `from_montgomery (fiat_p256_mul(to_montgomery(x), to_montgomery(y)))`

to operate on non-Montgomery-form numbers.

No, I have some trouble with inversion, even with the "normal" P-256 prime (and not the group with the prime order I generated). I'll write that up tomorrow in a more concise form (including reproducible example) and put it into the GitHub issue.

thanks for your explanation how to read the pre/postconditions.

I figured out: (a) inversion requires an array with an additional limb set to 0, i.e. in[LIMBS + 1], in[LIMBS] = 0. (b) a from_montgomery should be preceeded before the call to inverse. with these changes, inverse(inverse x) = x.

I'm still a bit confused that -- using P256 (mod p) -- the result of 2^-1 mod p is not the same according to inverse vs Zarith's Z.invert:

0x800000007fffffff 0x8000000100000000 0xffffffff7fffffff 0xfffffffe80000000 (fiat's inverse)

0x7fffffff80000000 0x8000000000000000 0x80000000 0x0 (the big-endian output from zarith)

the input is 2, and { 0, 0, 0, 0x2 } for fiat (on a 64 bit computer)...

You I think you need to run from_montgomery(inverse(to_montgomery(2))) to get 2^-1 mod p (Z.invert). Is this what you're running?

I tried at https://github.com/hannesm/p256-test

at 0: inv_two 7fffffff80000000 inv 7fffffff80000001 mo 8000000300000002

at 1: inv_two 8000000000000000 inv 8000000080000000 mo 800000047ffffffb

at 2: inv_two 80000000 inv 7fffffff mo fffffffc00000000

at 3: inv_two 0 inv 0 mo fffffffc80000005

(with inv_two being the hardcoded constant, inv before the from_montgomery, and mo the value after the from_montgomery)

here we go, just messed up the endianness

{ 0, 0, 0, 0x2 } for fiat

Yeah, according to https://github.com/hannesm/p256-test/blob/63bc99c9530dbeac8301580d41f1fb21b3e9e5fe/p256_64.c#L15, this is `2*(1<<192)`

with https://github.com/hannesm/p256-test/commit/0d8c8b3735f8d74aaae1862080a4ffddd8f93c40 it works, so it is 2^-1 mod p == inverse (to_montgomery(2)) [no from_montgomery needed]

as an update here, I use the tests from https://github.com/google/wycheproof/ for testing ECDSA signing and ECDH [see https://github.com/mirage/mirage-crypto/pull/101 if interested)

- P256 (1071 tests) and P384 (1054 tests) work smoothly \o/
- P224 (1308 tests) ECDH fails one test case, ECDSA 6 test failures, need to investigate further
- SECP256K1 (1047 tests, 534 failures) there's something wrong, nearly all expect-to-be-valid test cases fail, need to investigate
- P521: here the Makefile in fiat-crypto uses "unsaturated_salinas" (not "word_by_word_montgomery") -- could someone point me to reasoning behind when to use what? (I just tried word_by_word_montgomery on P521 and the 32bit code wasn't generated after ~60 minutes CPU time, so I guess I'll use unsaturated_salinas then -- but would like to understand why and what the tradeoff is)

Thanks for the update! I'm glad to hear P256 and P384 are working smoothly!

Are you doing 32-bit P224 or 64-bit P224 or both? If the failures are only seen in 64-bit P224, I'd expect there's some sort of misalignment of assumptions, since, unlike P256, P384, and P224 32-bit, 64-bit P224 leaves half-a-limb unused. Perhaps there's some presumption somewhere of zeroing it out.

Montgomery reduction is saturated, i.e., the number of bits per limb must equal the bitwidth of the machine. So 64-bit P256 uses 256/64=4 limbs, 32-bit P224 uses 224/32=8 limbs. 64-bit 224 wastes half a limb because 224/64=3.5. When the base-2 log of the leading coefficient of the prime is not a multiple of the bitwidth, generally unsaturated Solinas is used, so that we don't need to carry as often. Unfortunately our Montgomery generation is quartic in the number of limbs (see https://github.com/mit-plv/fiat-crypto/issues/851#issuecomment-658350696), and P521 x32 has ceil(521/32)=17 limbs, so I expect it to take about 16 minutes. (How much time does 64-bit Montgomery generation for P521 take for you? For me it takes about 2 minutes with `./src/ExtractionOCaml/word_by_word_montgomery p521 64 '2^521 - 1'`

.)

thanks for your kind answer

Are you doing 32-bit P224 or 64-bit P224 or both?

I develop with 64bit (and run the tests there), though CI (atm 256 & 384 only, since I have not pushed 224/256K1) also runs them on 32bit.

zeroing out

that may very well be the case, thanks for reminding me that there are 32bit unused

and thanks for the explanation. the word_by_word_montgomery for 64bit and P521 was around 5 minutes on this computer.

Then I'd expect 32-bit P521 WBW Montgomery to complete within 40 minutes. (I am trying it on my machine currently with `ulimit -s unlimited`

; it stack overflowed after 5 minutes without that.)

@Hannes Mehnert It's good to do these test, but given that fiat is provably correct.

What would cause test failures? The specification being uncomplete?

I suspect it's a mismatch between the specification of the generated code and the implicit specification of the higher-level algorithm it's being plugged into. (Examples: whose responsibility it is to mask off high bits when converting from bytes; whether the representation is little-endian or big-endian; where to convert to/from montgomery)

An excellent question. There's a thin layer to actually do ECDH&ECDSA on top of the generated C code (plus my usage of calling the C code) -- the generated C code is basically "arithmetics in the group" - while there's still some scalar multiplication etc. to be done to do a full ECDH/ECDSA operation.

@Rasmus Holdsbjerg-Larsen and @Benjamin Salling Hvass have started to work on using bedrock2 to generate a little more of the operations. It might be relevant in your case too.

@Jason Gross indeed using the 32bit P224 code leads to all tests succeeding.

WBW Montgomery for P521 32-bit took 221m50s on my laptop

If you want the output for testing, I can send it to you, if that's helpful. It's almost 20,000 loc

I tried unsaturated_solinas, and figured there's no inversion for them (am I missing something?)

Last updated: Dec 07 2023 at 06:38 UTC