Is anyone [else] working on proving things in Coq about neural nets?
No, but I am working on numerics, including numerical linear algebra, so I am not that far away.
probably insiders know about this work already, but for other people reading this the following might be interesting: http://ace.cs.ohio.edu/~gstewart/mlcert.html
In particular, this "prove[s] generalization bounds [in Coq] on neural networks
trained using TensorFlow on the extended MNIST data set" (https://github.com/OUPL/MLCert)
(I don't work in this area myself, but try to follow it)
the MLCert website is apparently not updated with the latest PLDI paper/code that looks at verification of sampling as used in TensorFlow.
http://ace.cs.ohio.edu/~gstewart/mlcert.html seems down?
works for me?
Good for you, but tried from multiple VPN locations / browsers / hosts, _all_ fail.
latest archive snapshot: https://web.archive.org/web/20230603202215/http://ace.cs.ohio.edu/~gstewart/mlcert.html
Afaik Gordon is now with BedRock, so it is natural that his university page was shut down (the complete personal page is gone).
GitHub at least proves more stable than university pages. It used to be that university pages were the most stable
seemingly open-access canonical version of the 2019 paper: https://ojs.aaai.org/index.php/AAAI/article/view/4115
sigh, they must have pulled the ohio.edu
website just today, after I posted the URL
@Michael Soegtrop Where's your codebase / what bits are you working on? Should we de-duplicate work?
I have some code here on arbitrary-rank tensors, indexing wrap-around, reshaping, broadcasting, round-tripping through primitive arrays (so computations can be forced at specific points to avoid duplicated work) and some other basic operations, some work on the fancy slicing that python supports including index wrap-around, slicing by ranges with and without steps, slicing by tensors of indices, and leaving implicit the leading ranks; and some preliminary support for Einstein summation notation. I also have a bunch of proofs of Proper
instances as well as some development of heterogenous and dependently-typed Proper
instances that I'm using to switch between doing linear algebra on primitive floats vs linear algebra on Flocq floats. (I also have a small development of for
loops on uint63
based on proving well-foundedness of ltb
, which I'm currently using for sum, mean, var, etc) (I also have a development of ln
and exp
on primitive floats that currently has no proofs, that I'd be happy to replace with something more proven as long as its as fast)
Some design choices I'm not sure about:
@Karl Palmskog I was not aware of that, and that looks like it might overlap significantly with what I'm working on currently, thanks!
MLCert doesn't support softmax, though, it seem?
Also it's for Coq 8.8 :-/
Maybe I should try to update it to 8.18 and get it in Coq's CI / opam?
@Jason Gross we have multiple people here on the Zulip from Bedrock where Gordon Stewart currently works. Maybe you could get in touch with Gordon about porting to 8.18, either directly or through them? We'd be happy to host MLCert in Coq-community if you get permission from authors (edit: I found Gordon on the Zulip)
@Gordon Stewart what is the status of MLCert (https://github.com/OUPL/MLCert)? Would you be open to getting it ported (by others) to recent Coq, and possibly having it hosted in Coq-community for future maintenance? See discussion above in this topic by Jason Gross.
@Jason Gross there is also this work in Lean that may be adjacent: https://arxiv.org/abs/1911.00385
We present a formal proof in Lean of probably approximately correct (PAC) learnability of the concept class of decision stumps. This classic result in machine learning theory derives a bound on error probabilities for a simple type of classifier. Though such a proof appears simple on paper, analytic and measure-theoretic subtleties arise when carrying it out fully formally. Our proof is structured so as to separate reasoning about deterministic properties of a learning function from proofs of measurability and analysis of probabilities.
https://github.com/jtristan/stump-learnable
With machine learning in particular it seems that the culture in that field is about moving quickly, setting up something that seems to work, publishing a conference paper and moving on to the next idea. The major libraries are all in Python, which is unsuitable for static analysis. It seems that rapid prototyping and exploration are more important than correctness in machine learning. I wonder if this work will be appreciated as valuable within that community, it seems like a hard sell.
Honestly insistence on correctness may be seen as splitting hairs. Many statistical fallacies are quite common and it's difficult to get people to take them seriously, i.e. https://arxiv.org/pdf/1603.05766
"The work is only exploratory and used as a guide to see what opportunities are possibly worth investigating, so it doesn't have to be perfect"
I'm trying to understand the gap between what I'm trying to do and what others have done, and trying to figure out if I should be reusing their code, etc. (Insights / opinions welcome!)
What I'm trying to do:
native_compute
on the entire test set, when the neural net and test set are sufficiently small (currently it takes 10 minutes to run a ~13k parameter network on a 64^2 size test-set)It looks like MLCert proves expected(?) generalization bounds on the outputs of training procedures. The theorem (and proof???) seem to be independent of the actual weights?
It looks like there's code for doing bounds analysis on the actual parameters in NNCert/net.v, but I can't tell if/how it gets run
again not familiar with details, but if I remember correctly this project was partially developed to do machine learning verification (not sure to what extent it got applied): https://github.com/jtassarotti/coq-proba
I was also looking at the cited "Developing Bug-Free Machine Learning Systems With Formal Mathematics" by Selsam et al in Lean, and it looks like they handwave the difference between floats and reals and axiomatize matrix semantics (which is maybe fine, and was a thing I was considering doing as a first pass, though I'm interested in spending a bit more time trying to get a formalization that actually does floats correctly without assuming false)
do you already know about VCFloat? https://github.com/VeriNum/vcfloat
I think the slogan is that it's like Gappa, but completely inside Coq
MLCert isn't actively maintained but I would be happy to see it ported. My
student Alexander Bagnall was the primary developer. I will try to loop him
in (not sure if he's on zulip yet).
@Patrick Nicodemus The people working in ML alignment/steerability/safety are interested in formal verification, the main question, I think, is if formal verification is powerful enough (yet) to prove anything interesting at all about NNs.
there is this quite general (but abstract) result about NNs proved in Isabelle/HOL (Expressiveness of Deep Learning): https://link.springer.com/article/10.1007/s10817-018-9481-5
Not sure how relevant it is for practical NN verification.
@Karl Palmskog I haven't seen vcfloat, thanks! Right now what I have is an expression in terms of Coq's primitive floats. I've managed to prove it equal to the same expression in terms of flocq's binary_floats. My current goal is
kv(r(x, y))
as a function of x - y
(just by brute force enumeration, at first), and (b) whenever kv(r(x, y))
is within particular bounds, then f
is within certain other bounds.Can vcfloat help with these?
Sorry about the MLCert website going down. I don't work at the university
anymore and didn't receive notice that the site/server were being pulled.
What's the relationship between vcfloat and coq-interval?
I've only used Interval, but from what I remember from Andrew Appel's explanation to me, Interval is about reasoning about reals directly, i.e., your goals will only mention real numbers (I think the underlying machinery uses Coq's primitive floats). In contrast, VCFloat and Gappa (https://gitlab.inria.fr/gappa/coq) are about establishing connections between floats from Flocq and reals.
VCFloat actually uses significant parts of Interval under the hood, specifically some parts that are not exposed to regular Interval users.
proving absence of nan/inf/neg_inf seems (from short discussion with Andrew) like something VCFloat or Coq-Gappa should be able to do, but I can unfortunately only refer to manual and papers here: https://verinum.org/vcfloat/
@Jason Gross : unfortunately non of my work on verification of numerical C code is published - it is all about verification of proprietary devices and a correctness proof obviously discloses how the stuff works. But I am available to share experience.
Regarding VCfloat vs gappa: they serve a similar purpose: prove the deviation between an infinite precision and a float (or fixed point) computation. I would say that gappa is substantially more powerful than VCfloat in the sense that it can handle substantially higher complexity in a reasonable time. It might be though that for rather simple goals VCfloat is faster, because it doesn't need the external process. I guess that there are cases where gappa finds tighter bounds and there might also be cases where VCfloat finds tighter bounds (gappa is sometimes 1 LSB off the bound I can prove manually and the way VCfloat works makes me believe it could handle this in some cases). Btw.: you don't have to trust external tools when using gappa - gappa creates a Coq proof script. Gappa is a bit awkward to use though and has some strange restrictions. E.g. the bounds you give must be exactly representable in binary floats of the precision you specify. This means it is a bad idea to give bounds which are negative powers of 10. I am using VCfloat mostly when I have small terms with explicit epsilons in them. For larger terms, say higher order polynomial approximations, I tend to use gappa.
@Jason Gross : can you point me to the code the two above questions refer to?
For absence of nan/inf for functions with not too many arguments I would use coq-interval to prove ranges in ℝ and then gappa to prove that the computation in floats doesn't deviate too much from the computation in ℝ. For proving deviations of approximations from non elementary functions coq-interval is only usable for uni-variate functions, since it can do Taylor or differential analysis only in one variable. For proving bounds for non overflow questions, it can handle larger numbers of variables - it does bisection then in these variables. This can work well even for quite a few variables (I think the max I had was 6) if you need almost no bisections to prove your goal - that is if your bounds are rather loose. If your bounds are rather tight, obviously the effort of bisection grows exponentially with the number of variables.
fair warning that VCFloat is broken on Interval 4.8.0 (also the VCFloat repo last I checked). This is due to changes in Interval internals in 4.8.0. The latest VCFloat released package has bounds to reflect this.
Just as a bit of context, the main goal of MLCert/NNCert was proving generalization bounds on learning procedures modeled as probabilistic programs. The bounds were statistical in nature, applying some results from learning theory like Hoeffding's inequality. This meant that they depended on the size of the training set and the size of the parameter space of the models being learned but not on specifics of the learning procedure itself. While there are interesting aspects of this approach, I'm not sure how well it scales well to large networks (the Limitations section of the paper explains some of the concerns, such as reliance on iid assumptions and on large training sets relative to model sizes).
We do instantiate the theorems to neural networks (see the NNCert directory). But our method may not be totally suitable for Jason's purpose. In particular, we axiomatize some things (floats, vectors), use dyadic rationals in some other places, and compute training and test error only under extraction (i.e., not in Coq itself via something like native_compute
). Also, as far as I can recall right now, we supported only ReLU units, not softmax.
@Gordon Stewart I don't see any mention of license in the MLCert repo except for the dataset.py
file. If Jason sees potential for reuse of the project, would you consider licensing it under some open source license so the Coq community could benefit? We give some recommendations here.
Sure, I'd be happy to license it (probably under a permissive license like MIT). Not doing so was an oversight. Right now, the project has no CI and doesn't build with current Coq (as Jason noted), so there are other problems that would also need to be fixed to make it re-usable :-/
sounds good. I'd say the ball is in Jason's court w.r.t. if he wants to spend time updating it then (given what has been said about details on things like relu vs. softmax, etc.).
@Michael Soegtrop Are gappa / vcfloat reflective? It looks like gappa is not, and I can't imagine this going well on my example, given that it takes cbv
4s just to execute the function on some arbitrarily picked inputs, using primitive floats the entire time. (The vm takes only 0.2 s, and lazy
takes about 2s, but I can't imagine Ltac being less than a couple orders magnitude slower than reduction.)
The code for the first question is here and here. Any insight / advice / help is welcome! The code for the second question doesn't really exist yet; I figured I should first figure out how to do simple manipulations like turning abs (x / y - z) < w
into abs (x - z * y) < w*y
before trying to split the more complicated operation.
or proving bounds for non overflow questions, it can handle larger numbers of variables - it does bisection then in these variables. This can work well even for quite a few variables (I think the max I had was 6) if you need almost no bisections to prove your goal - that is if your bounds are rather loose.
Are these variables like context variables, or variables like "intermediate values used multiple times"? I have (depending on how you want to look at it) two Fin.t 64
-valued variables; or 128 bool-valued variables; or 64 float-valued variables that I can probably manually craft ranges on. (I am one-hot-encoding two tokens between 0 and 64 into a 32-dimensional space.) But as far as intermediate values used multiple times, I expect I have at least a couple hundred, possibly as many as tens of thousands. (Intermediates used at least once number at least in the tens of thousands.)
When are bisections needed? If I only need bisection to increase precision, then I expect that I won't need any bisections whatsoever. My numbers never exceed a couple hundred.
@Karl Palmskog @Gordon Stewart It seems like MLCert doesn't overlap enough with my goals (proving things about concrete NNs in Coq, being able to efficiently evaluate transformers with softmax, etc, and make use of the native compiler) to make code reuse make sense. But I expect I'll at some point be interested in the training procedures as well, so I'm glad to have taken a look at MLCert!
Karl Palmskog said:
fair warning that VCFloat is broken on Interval 4.8.0 (also the VCFloat repo last I checked). This is due to changes in Interval internals in 4.8.0. The latest VCFloat released package has bounds to reflect this.
I've proposed a fix at https://github.com/VeriNum/vcfloat/pull/17
@Gordon Stewart Just to make sure, am I right that your NN representation doesn't have support for reusing computed intermediates / for let-binding intermediate values? (I think this is a must-have for transformer skip-connections.)
@Jason Gross Yes, that's right. We didn't support skip connections.
@Jason Gross : gappa is an early work of Guillaume and written in C++. The tactic creates a gappa input file and calls the gappa executable (a separate opam package) which creates a Coq proof term in text form. Note that gappa itself is a bit more flexible than the tactic but if one sticks to the rules I didn't fins a reason yet to write gappa files manually. For VCFloat I can't tell, but I don't think it is reflexive. As far as I know It mostly works by calling other reflective tactics like field to rearrange the terms to get the epsilon terms to the top. I will have a look at your cases later today. As I said I used VCfloat mostly for more analytic situations, where its way of working can be convenient.
Regarding the variable count:
For coq-interval: In general (afaik) it is so that if a variable occurs only once in a term, nothing bad can happen in interval arithmetic say as would in x-x
, which is obviously 0, but interval arithmetic won't find this. In this sense (again afaik) coq-interval can handle only one variable occurring multiple times and requiring differential analysis to compute sufficiently tight bounds (as in x-x
). For the remaining variables, the situation must be solvable by interval bisection, which wouldn't give you much in the x-x
case (the error decreases only linearly with the bisection effort), but is can be quite effective.
For gappa: it is used to compute the difference between terms with and without roundoff error. Round off errors are usually uncorrelated even between different usages of the same variable. Say if you compute (ax+b)(c*x+d) for different constants a,b,c,d and variable x, the rounding error of the two float additions will be quite uncorrelated. I don't think it is generally feasible to exploit such dependencies, so I don't think gappa attempts to do this. So far gappa it shouldn't matter if variables occur multiple times. I guess what counts is the number of rounding operations you have in one term. Where the limit is here I can't tell.
@Guillaume Melquiond : Jason is working on verification of neural network floating point code and is exploring the capabilities of the various proof tools to help with reasoning about floats. We discussed the questions how many variables coq-interval (for proving non overflow) and gappa (for proving round off error) can handle. As far as I understand Jason he has two float variables, but they occur many times in the terms (at least a few 100). Maybe you can review my above comments.
I don't think "two float variables" is accurate. I have two length-32 float vector variables, each of which can take on 64 distinct values, though I'm currently planning to treat this as 64 float variables with a bound that is the smallest interval containing all 64 possible (concrete, known) values.
(You could say that I have two "integer index" variables which just index into a float matrix)
Hm, it seems that vcfloat's reflective expr language doesn't support let bindings?
And it looks like it also doesn't support things like min, max, cast to integer, explicit tests for nan, etc?
It is not floating point neural networks as far as I’m aware, but https://github.com/IBM/FormalML has a number of formalizations of theorems in probability theory and machine learning. They formally prove the correctness of various RL algorithms (not sure if any use deep RL), and Dvoretzky's approximation theorem which can be used to prove theorems about stochastic gradient descent. I’m happy to connect you with the authors.
@Michael Soegtrop @Guillaume Melquiond (and others) If you're interested in taking a look at the concrete goal I'm struggling to prove, I factored out a theorem here saying that my program instantiated on Flocq's binary_float
s is equal to the same program instantiated on R
up to rounding error. (This file and its dependencies should build fine, but there are some other files in the repo that may run out of stack space or be very slow)
@Jason Gross : I didn't have time yet to look at it (working on a Coq Platform release), but I will have a look in the next days.
AITP is happening this week. Not much information online. Is anyone from the Coq side involved?
http://aitp-conference.org/2023/
I don't know if you mean affiliated with Coq/Inria itself or Coq users but Talia Ringer is listed there and she uses Coq.
I see that Liao Zhang is giving a talk on inductive logic programming in interactive theorem proving. That could potentially be relevant to Coq by way of Elpi.
I personally think that it would be interesting to try and implement some basic concepts of inductive logic programming in Elpi and see if you can learn patterns in proofs that way. But I don't know very much about inductive logic programming and I don't have enough of a clear idea of what we would get out of this.
extended_abstract.pdf
I wrote an extended abstract for AITP 2023 but I didn't submit it as I felt I didn't know enough about ILP to bring the idea to fruition and I didn't have a confident vision of what this would look like in practice, I didn't have a clear enough picture of the end goal.
Bas Spitters said:
AITP is happening this week. Not much information online. Is anyone from the Coq side involved?
http://aitp-conference.org/2023/
where the talks recorded? are the videos available to the public?
The slides are available on the website.
Last updated: Oct 13 2024 at 01:02 UTC