{1577} revision 5 modified: 11-11-2023 20:08 gmt

Sketch - Program synthesis by sketching

  • Sketch consists of a C-like language with the additional primitives: integer-valued holes ??, reorder, generator (macros where the holes can vary), assert, repeat (also a variable-hole parameterized macro), regular-expression generators for selecting substitute programming primitives, and atomics for concurrency-focused synthesis (fork and compare-and-swap).
    • The other common programming primitives are there, too - if, else, while, procedures, arrays.
      • Loops, arrays, and recursion are limited to a fixed size to constrain the search space.
  • Sketch is inductive - working solutions are derived from the skeleton provided by the programmer + counterexamples provided by a verification function.
    • This is compared to previous work, which focused on deductive synthesis: programs are derived from a constructive proof of the satisfiabiltiy of a problem.
      • This 'proved' challenging in practice -- programmers are much better versed in construction rather than abstract mathematical derivation.
    • "By giving up the deductive approach, we lose correctness by construction, and we are forced to rely on an external validation procedure."
  • So they use CEGIS: CounterExample Guided Inductive Synthesis.
    • For most sketches, only a small set of counterexamples are needed to fully constrain the solution.
    • The counterexamples are fed to the inductive synthesizer so that the next candidate will be guaranteed to work correctly for all.
    • Each counterexample reduces the search space by (typically) 2^15 -- much information for each -- so only a handful of validation calls are usually needed, even when the solution space is very large.
      • E.g. 40 iterations when the candidate space is 2^600
        • Illuminating that computers can so readily handle such large spaces with prefect fidelity!
    • Counterexamples are much easier to find than an abstact model (as in the decuctive approach)
  • To do the synthesis, the sketch is expanded to a 'control': a set of variables, holes and calling contexts.
  • This control includes parameterized values that track the value of each variable, along with operators that define their relation to other variables, holes, and calling contexts.
    • The denotation function translates any expression to a function from a state to a parameterized value.
  • The program sketch is parsed (partially evaluated) to iteratively update σ,Φ\sigma, \Phi , where σ\sigma is a mapping from the set of variable names to parameterized values, and Φ\Phi is the set of control functions (holes and calling contexts).
    • With control flow like while and if, the set of parameterized values σ\sigma (with calling context) and control functions Φ\Phi are unrolled (to a limit) and unioned for the two branches (if & else).
    • Similarly for procedures and generators -- iterative state σ,Φ\sigma, \Phi is updated with additional calling context τ\tau (and bounded recursion / bounded evaluation! - kk is a hyperparameter)
    • Assertions and "ifs" are modeled as intersections on the iterative state.
      • In this way, semantics can model the behavior of all candidate solutions simultaneously -- e.g. Φ\Phi is a set of all possible assignments to the holes.
  • (Chapter 5) In practice: control is represented in an intermediate language, itself represented as a DAG.
    • Several optimizations are applied to make the DAG simpler, e.g. structural hashing to remove common sub-trees, and constant propagation (which may be redundant with internal SAT rules).
    • There seems to be a tension between Chapter 3 (abstract denotational semantics) and Chapter 5 (concrete DAG manipulations); how exactly they relate could have been spelled out better.
      • I.e.: how do the context, parameterized values, and control functions directly lead to the DAG??
    • I infer: the DAG seems to be a representation of the data & code flow graph that makes explicit the relationships between variables, holes, and control structures. (this is never mentioned, but seems to be the meaning of 'denotation'.)
  • The denotation function (compilation or partial evaluation) only needs to be computed once -- subsequent CEGIS loop evaluations only need to update the SAT problem from the counter-examples.
  • Integers are represented via mutually-exclusive & ordered guard bits: if a bit is 1, then the hole or variable evaluates to that integer. Aka one-hot encoding. This permits translation from DAG to CNF.
  • CEGIS starts by finding a solution = integer assignment of holes & permutations that make the SAT problem evaluate to True.
  • Based on this control vector, it then looks for a solution (value of the input variables) to the SAT problem that evaluates as False.
    • Clever re-use of the SAT solver & representation!
  • This counter-example is used to improve the SAT problem = add statements (system must evaluate to true with all input-output counter-examples) = remove hypotheses from Φ\Phi .
  • Did not delve much into the concurrency-related chapters (7-9) or the Stencil scientific computing chapters (10-11).

The essential algorithm, in words:

Take the sketch, expand it to a set of parameterized variables, holes, and calling contexts. Convert these to a DAG aka (?) data-code flow graph w/ dependencies. Try to simplify the DAG, one-hot encode integers, and convert to either a conjunctive-normal-form (CNF) SAT problem for MiniSat, or to a boolean circuit for the ABC solver.

Apply MiniSat or ABC to the problem to select a set of control values = values for the holes & permutations that satisfy the boolean constraints. Using this solution, use the SAT solver to find a input variable configuration that does not satisfy the problem. This serves as a counter-example. Run this through the validator function (oracle) to see what it does; use the counter-example and (inputs and outputs) to add clauses to the SAT problem. Run several times until either no counter-examples can be found or the problem is `unsat`.

Though the thesis describes a system that was academic & relatively small back in 2008, Sketch has enjoyed continuous development, and remains used. I find the work that went into it to be remarkable and impressive -- even with incremental improvements, you need accurate expansion of the language & manipulations to show proof-of-principle. Left wondering what limits its application to even larger problems -- need for a higher-level loop that further subdivides / factorizes the problem, or DFS for filling out elements of the sketch?

Interesting links discovered in while reading the dissertation:

  • Lock-free queues using compare_exchange_weak in C++ (useful!)
  • Karatsuba algorithm for multiplying large n-digit numbers with O(n 1.58)O(n^{1.58}) complexity. (Trick: divide and conquer & use algebra to re-use intermediate results via addition and subtraction)