I watched the talk with some delay so could not comment nor questions, so I do it here.
First, I am impressed by the results and I wish we could use it in the near future maybe as an informative CI step.
@Karl Palmskog you seem to be the only co-author with a zulip account, would you mind answering or bringing your co-authors here?
I can answer questions here, and I can tell the main machine learning co-author to get an account as well
he is also speaking at the Coq workshop, so he might as well
Karl Palmskog said:
I can answer questions here, and I can tell the main machine learning co-author to get an account as well
Sorry I posted the message by accident, I still had one sentence to complete in the first bullet, which I fixed now.
Cyril Cohen said:
First, I am impressed by the results and I wish we could use it in the near future maybe as an informative CI step.
- You mention one problem you often have is that Roosterize picks a name that is too generic, my intuition tells me that by doing so, it is likely to conflict with already defined theorems.
Thank you for the kind words.
On the one hand, yes, a too generic suggestion may indicate that the lemma itself is very generic. On the other hand, generic suggestions are a pervasive problem in transduction/translation systems in NLP. It is hard to make the system use "background knowledge" in the right context. So it could also be an artifact of how we designed our systems that some suggestions will be generic, in particular when one is applying the tool on some completely "new"/"different" lemma (as compared with training data). In our appendix, we show that performance improves a lot for suggesting names for Infotheo when we do some extra training on Infotheo lemmas (we started with only "MathComp core" training).
Karl Palmskog said:
Cyril Cohen said:
First, I am impressed by the results and I wish we could use it in the near future maybe as an informative CI step.
- You mention one problem you often have is that Roosterize picks a name that is too generic, my intuition tells me that by doing so, it is likely to conflict with already defined theorems.
Thank you for the kind words.
On the one hand, yes, a too generic suggestion may indicate that the lemma itself is very generic. On the other hand, generic suggestions are a pervasive problem in transduction/translation systems in NLP. It is hard to make the system use "background knowledge" in the right context. So it could also be an artifact of how we designed our systems that some suggestions will be generic, in particular when one is applying the tool on some completely "new"/"different" lemma (as compared with training data). In our appendix, we show that performance improves a lot for suggesting names for Infotheo when we do some extra training on Infotheo lemmas (we started with only "MathComp core" training).
Sorry, what I meant is name conflicts may indicate that the guessed name is unlikely to be correct... (and it's also the point of my second bullet)
Cyril Cohen said:
- Actually, one of the main ingredients of my personal heuristic for deciding how to chop the tree is disambiguation: if I would get a name conflict by picking fewer symbols, then I know I chopped the tree too much. Do you think it would be possible to refine your chopping heuristic in this way, i.e. by extending (i.e. unchopping) successively the tree until there are no more conflicts with other names in the same file (and in the rest of the library).
Chopping heuristics are right now implemented as simple Python scripts operating on kernel term sexps, so implementing another chopping heuristic like the one you suggest should not be hard in itself (I would expect at most a working day). The time consuming part (human time and machine time) is to set up the evaluation to see if it actually improves performance (for some dataset) - unfortunately machine learning is to a large extent a black box that is hard to predict things about. I think it is an interesting direction to explore integrating information about name collisions in general, and I could see us looking at this in a potential journal version of the paper. However, it could turn out to be very time consuming (in terms of machine time) to "iterate" the chopping "online", I would have to check the specifics with Pengyu.
Cyril Cohen said:
- as such do you think it could be possible for the tool to self assess it's confidence i.e. give a measure or how good the generated names are. I guess that when a theorem is extremely close to another one from the library, and the generated names reflect that, the confidence must be way higher than for the generate name of novel statements.
My understanding is that confidences/probabilities are already present in the "backend" of the tool, and could be displayed for any individual suggestion or for each entry in a list of top-X suggestions. My intuition is the same as yours, i.e., a name suggestion for a "new" theorem close to an existing theorem will have high confidence, but again there is a black machine learning box that could obfuscate this. For example, the so-called copy mechanism we use takes data directly from input rather than "querying" background knowledge.
Cyril Cohen said:
- Maybe some names with a confidence rate below than some number could be automatically rejected if there is no explicit comment from the developer do deactivate the failure and provide an explanation.
Rejecting (or simply not displaying) name suggestions which have a confidence level below a certain threshold should be very easy to set up, and we could definitely modify our tool with a command line option to either completely suppress these low-confidence suggestions, or have them come with a disclaimer. One caveat is that a good threshold is likely to be dataset specific. So perhaps for models trained on only MathComp 1.10.0, it may be <= 0.6 that is most practical, etc.
Cyril Cohen said:
- Some theorem are named after mathematicians (e.g. Cayley, Hamilton, Lagrange, etc) and the user means it. Do you think one could tag such lemmas (cf. above) and let the whole training ignore them.
It's straightforward to curate the training set in this way by excluding certain lemma names that the model will not learn much from (because they are "special" in some way, rather than "structural"). If you would be willing to curate the inclusion/exclusion of lemmas in a certain MathComp release for use in training Roosterize, this is definitely something we could collaborate on. However, it's always hard to say beforehand how much of an effect this kind of curation will result in (there are so many probabilities/vector operations involved). At best, it could bring a noteworthy BLEU increase, and at worst it could make no difference at all.
Karl Palmskog said:
Cyril Cohen said:
- Maybe some names with a confidence rate below than some number could be automatically rejected if there is no explicit comment from the developer do deactivate the failure and provide an explanation.
Rejecting (or simply not displaying) name suggestions which have a confidence level below a certain threshold should be very easy to set up, and we could definitely modify our tool with a command line option to either completely suppress these low-confidence suggestions, or have them come with a disclaimer. One caveat is that a good threshold is likely to be dataset specific. So perhaps for models trained on only MathComp 1.10.0, it may be <= 0.6 that is most practical, etc.
Actually, I was thinking of if for CI purpose to evaluate PRs, and thus rejecting the names chosen by the user according to the distance to high confidence automatically generated names.
Actually I was thinking of annotations in the comment something like
#[roosterize(reference)]
Lemma naming_that_serves_as_a_reference : ...
#[roosterize(disruptive)]
Lemma custom_naming_that_does_not_follow_conventions_voluntarily : ...
I see, so you're considering annotating lemmas in MathComp itself with naming style/quality. This is definitely something that we could use (the annotations would show up in our serialized data and we could use them to automatically exclude/include theorems in the training phase). However, maybe there should be a general "naming annotation" language for Coq that could also be used by mechanical linters? One might also consider using (some variant of) this for regular definitions.
Karl Palmskog said:
I see, so you're considering annotating lemmas in MathComp itself with naming style/quality. This is definitely something that we could use (the annotations would show up in our serialized data and we could use them to automatically exclude/include theorems in the training phase). However, maybe there should be a general "naming annotation" language for Coq that could also be used by mechanical linters? One might also consider using (some variant of) this for regular definitions.
Yes I think naming is so important that annotating names could constitute a big improvement!
for others who are seeing/following this topic, here is the full talk for the Roosterize paper on YouTube for background: https://youtu.be/Ysd-dcizw1A?list=PLl1dj5prwUJz6KV4Q1EPG9Gx0bjiKa_ey&t=3462
Last updated: Oct 13 2024 at 01:02 UTC