*This is part one of a two part series on automata learning and satisfiability modulo theories. Part two will appear soon.*

Deterministic finite automata (DFAs) are useful in a variety of applications. However, the problem of learning a DFA of minimal size from positive (accepted) and negative (rejected) strings can be very hard. In fact, it is the optimization variant of the problem of finding a consistent DFA of a fixed size, which has been shown to be NP-complete. In 2010, Marijn Heule and Sicco Verwer presented an algorithm that encodes the problem of learning a DFA from labeled strings as a satisfiability (SAT) problem. Their algorithm has since won the StaMinA competition, and has led to the creation of the dfasat tool (for which Chris has created an exellent tutorial).

In this post, I present an encoding that takes a satisfiability *modulo theories* (SMT) perspective. This encoding is faster than the one used in dfasat, and benefits from the continuous efforts by fellow researchers on making SMT solvers more powerful. Moreover, I find it more natural, because it makes a distinction between the logic that is required to solve the problem, and the logic imposed by the background theories.

## Automata learning and SAT

A long, long time ago François Coste and Jacques Nicolas have shown that the problem of learning a DFA from labeled data can be encoded as a graph coloring problem. The intuition is as follows. First, a tree-shaped DFA is constructed that accepts exactly the positive examples and rejects exactly the negative ones. Each state in this DFA is represented by a vertex in a conflict graph. Two vertices in the graph are connected by an edge if one vertex represents an accepting state and the other represents a rejecting state. Now, the problem at hand is to color this graph, with the additional constraint that for states that are represented by vertices of the same color, their parents have to be represented by vertices of the same color as well. For such a coloring, a minimal DFA can be constructed in which each state is represented by a different color.

In their paper and tool, Heule and Verwer encode this graph coloring problem in propositional logic. *Satisfiability*, or SAT, is the problem of deciding if there exists an assignment to a propositional logic formula that makes it true. To prove that the minimal size of a DFA is *n*, Heule and Verwer use an iterative procedure to determine that an encoding for *n* colors is satisfiable, but an encoding for *n − 1* colors is unsatisfiable.

## Introduction to SMT

For many applications, encoding the problems into propositional logic (i.e. SAT) is not the right choice. Frequently, a better alternative is to express the problems in a richer logic. This is what *satisfiability modulo theories* (SMT) is about. It is the decision problem of deciding the satisfiability of a formula with respect to one or more background theories expressed in first-order logic, that is, if there exists a SAT assignment consistent with these theories.

First-order logic is about *formulae*, *atoms*, *terms* and *variables*, and a *theory* that defines a set of formation rules for these things. We are concerned with the theory of *equality and uninterpreted functions* (EUF), for which the formation rules are as follows:

- A
*term**t*is inductively defined as a variable, or as a function*f(t1, …, tn)*over terms*t1, …, tn*. - An
*atom*can be true or false, and is either an equality (=) or a predicate (in our case <, >, ≤ or ≥) for two terms. - A
*formula*is inductively defined as an atom or its negation, or as a disjunction (∨) or conjunction (∧) of formulae.

So, *formulae are clauses constructed over atoms, which are predicates over terms that are constructed over variables and functions.* Mind you that we are talking about a function in the mathematical sense here. This means that it is *complete* and *functionally consistent*. Apart from that, EUF is not concerned with the semantics of a function.

Given a formula, an SMT solver answers the following question:

Is there an assignment to the variables of this formula that makes the formula true?

If the answer to this question is *yes*, then the formula is *satisfiable*. Otherwise, it is *unsatisfiable*.

Still a bit unclear? Take a look at the following example.

*f(x) ≠ x* ∧ *f(f(x)) = x* ∧ *f(f(f(x))) = x*

Let’s start with the leftmost atom; *f(x)* is not equal to *x* but to something else, let’s say *f(x) = a*. If we plug this in the first term of the second atom, we get *f(a) = x* because *f(f(x)) = f(a)*. So far, so good. If try to plug this information in the first term of the last atom, we get *f(x) = x *because *f(f(f(x))) = f(f(a)) = f(x)*. This is a contradiction! After all, we know from the first term that *f(x) ≠ x*. Therefore, this formula is *unsatisfiable*; there is no value for *x* which makes the formula true.

Let’s take a look at a strikingly similar example.

*f(x) = y* ∧ *f(f(x)) = x* ∧ *x ≠ y*

If we plug the information of the first atom in the second we get *f(y) = x*, because *f(x) = y* (first atom). Now, the third atom tells us that *x *and *y *are not equal, but this is totally fine! They do not have to be equal (or inequal) as far as the rest of the formula is concerned. We can therefore assign any two (different!) values to *x *and *y *(say, 0 and 1), and the formula is true (remember that EUF is not concerned with the semantics of a function). Therefore, it is *satisfiable*.

So, an SMT solver will tell us whether or not a given formula is satisfiable, and if it is it will give us an assignment to the variables that makes the formula true. Such an assignment is called a *model* for the formula.

## An automaton as a pair of functions

Now that we know how to use variables and functions in formulae, let us look at automata. Consider the following example.

We can describe this DFA using variables (states and symbols), and functions (transitions and output)! Unsurprisingly, this is exactly how we formally describe a DFA. It is a tuple (Σ, Q, q0, δ, λ) where:

- Σ is an alphabet of symbols,
- Q is a set of states,
- q0 is the start state,
- δ is a transition function from states and symbols to states, and
- λ is an output function for states.

This allows us to construct a formula that describes (a part of) this DFA. For the constraints described in the image, for example, we would say:

q0 = s1 ∧ δ(s1, 0) = s2 ∧ δ(s1, 1) = s4 ∧ δ(s2, 0) = s3 ∧ λ(s1) = 0 ∧ λ(s2) = 1 ∧ λ(s3) = 0

## Automata learning and SMT

Let’s recap the story so far.

- Our goal is to find a minimal DFA for a given set of labeled strings.
- SMT is the problem of deciding if there is a a valid assignment (model) for a logic formula.

Now, the key insight is the following:

We make a formula from a set of labeled strings and a maximum number of states. If and only if the formula is satisfiable a DFA with this many states exists, and the solver gives us a description of this DFA in its model!

Let’s say that our states are integers, and that the start state is 0. The first thing we want to make sure is that the DFA we will obtain from the solver’s model is consistent with our labeled strings. Let’s say we have `abaa`

as an accepted string, then we would add the following to our formula:

λ( δ( δ( δ( δ(0, a), b), a), a) ) = 1

Say we have `bb`

as a rejected string, then we add:

λ( δ( δ(0, b), b) ) = 0

At this point our formula is a conjunction (∧) of atoms like these, one for each labeled string. Recall our example formula from before and take a moment to convince yourself that this ensures that our DFA is consistent with our data.

If we would ask the SMT solver to come up with a model at this point, it would give us a description of the transition and output functions of a consistent DFA. However, we want to find the smallest DFA, so we have to restrict the number of states. Consider the states to be integers from* 0 to n (exclusive)*, then

∀ q ∈ [0, n) a ∈ Σ δ(q, a) ≥ 0 ∧ δ(q, a) < n

ensures that our DFA has at most *n *states (we can easily quantify over states and symbols because they are both finite domains).

Now, what remains to be done is to

Minimize the value of n for which the formula is satisfiable!

So we first try adding *n = 1*, then *n = 2*, then *n = 3*…

That’s it. We have just encoded the problem of learning a minimal consistent DFA from labeled strings in SMT!

*Curious about how this helps us in modelling real-world systems? Part two of this series will be online soon. In this hands-on session we will see how we can learn models of bank cards (yes, really). Want to get a head start? Check out the Python implementation on top of Z3!*