I prepared a notebook that installs all dependencies, wraps the resulting binaries in Python functions (view on github, view on Google Colab) and provides some short usage examples using the Stamina competition data. Due to recent changes in the boost.Python library it is not yet possible to compile the Python package (as described in this paper).

If you run into any problems with flexfringe on Colab, contact me.

]]>Deterministic finite automata (DFAs) are useful in a variety of applications. However, the problem of learning a DFA of minimal size from positive (accepted) and negative (rejected) strings can be very hard. In fact, it is the optimization variant of the problem of finding a consistent DFA of a fixed size, which has been shown to be NP-complete. In 2010, Marijn Heule and Sicco Verwer presented an algorithm that encodes the problem of learning a DFA from labeled strings as a satisfiability (SAT) problem. Their algorithm has since won the StaMinA competition, and has led to the creation of the dfasat tool (for which Chris has created an exellent tutorial).

In this post, I present an encoding that takes a satisfiability *modulo theories* (SMT) perspective. This encoding is faster than the one used in dfasat, and benefits from the continuous efforts by fellow researchers on making SMT solvers more powerful. Moreover, I find it more natural, because it makes a distinction between the logic that is required to solve the problem, and the logic imposed by the background theories.

A long, long time ago François Coste and Jacques Nicolas have shown that the problem of learning a DFA from labeled data can be encoded as a graph coloring problem. The intuition is as follows. First, a tree-shaped DFA is constructed that accepts exactly the positive examples and rejects exactly the negative ones. Each state in this DFA is represented by a vertex in a conflict graph. Two vertices in the graph are connected by an edge if one vertex represents an accepting state and the other represents a rejecting state. Now, the problem at hand is to color this graph, with the additional constraint that for states that are represented by vertices of the same color, their parents have to be represented by vertices of the same color as well. For such a coloring, a minimal DFA can be constructed in which each state is represented by a different color.

In their paper and tool, Heule and Verwer encode this graph coloring problem in propositional logic. *Satisfiability*, or SAT, is the problem of deciding if there exists an assignment to a propositional logic formula that makes it true. To prove that the minimal size of a DFA is *n*, Heule and Verwer use an iterative procedure to determine that an encoding for *n* colors is satisfiable, but an encoding for *n − 1* colors is unsatisfiable.

For many applications, encoding the problems into propositional logic (i.e. SAT) is not the right choice. Frequently, a better alternative is to express the problems in a richer logic. This is what *satisfiability modulo theories* (SMT) is about. It is the decision problem of deciding the satisfiability of a formula with respect to one or more background theories expressed in first-order logic, that is, if there exists a SAT assignment consistent with these theories.

First-order logic is about *formulae*, *atoms*, *terms* and *variables*, and a *theory* that defines a set of formation rules for these things. We are concerned with the theory of *equality and uninterpreted functions* (EUF), for which the formation rules are as follows:

- A
*term**t*is inductively defined as a variable, or as a function*f(t1, …, tn)*over terms*t1, …, tn*. - An
*atom*can be true or false, and is either an equality (=) or a predicate (in our case <, >, ≤ or ≥) for two terms. - A
*formula*is inductively defined as an atom or its negation, or as a disjunction (∨) or conjunction (∧) of formulae.

So, *formulae are clauses constructed over atoms, which are predicates over terms that are constructed over variables and functions.* Mind you that we are talking about a function in the mathematical sense here. This means that it is *complete* and *functionally consistent*. Apart from that, EUF is not concerned with the semantics of a function.

Given a formula, an SMT solver answers the following question:

Is there an assignment to the variables of this formula that makes the formula true?

If the answer to this question is *yes*, then the formula is *satisfiable*. Otherwise, it is *unsatisfiable*.

Still a bit unclear? Take a look at the following example.

*f(x) ≠ x* ∧ *f(f(x)) = x* ∧ *f(f(f(x))) = x*

Let’s start with the leftmost atom; *f(x)* is not equal to *x* but to something else, let’s say *f(x) = a*. If we plug this in the first term of the second atom, we get *f(a) = x* because *f(f(x)) = f(a)*. So far, so good. If try to plug this information in the first term of the last atom, we get *f(x) = x *because *f(f(f(x))) = f(f(a)) = f(x)*. This is a contradiction! After all, we know from the first term that *f(x) ≠ x*. Therefore, this formula is *unsatisfiable*; there is no value for *x* which makes the formula true.

Let’s take a look at a strikingly similar example.

*f(x) = y* ∧ *f(f(x)) = x* ∧ *x ≠ y*

If we plug the information of the first atom in the second we get *f(y) = x*, because *f(x) = y* (first atom). Now, the third atom tells us that *x *and *y *are not equal, but this is totally fine! They do not have to be equal (or inequal) as far as the rest of the formula is concerned. We can therefore assign any two (different!) values to *x *and *y *(say, 0 and 1), and the formula is true (remember that EUF is not concerned with the semantics of a function). Therefore, it is *satisfiable*.

So, an SMT solver will tell us whether or not a given formula is satisfiable, and if it is it will give us an assignment to the variables that makes the formula true. Such an assignment is called a *model* for the formula.

Now that we know how to use variables and functions in formulae, let us look at automata. Consider the following example.

We can describe this DFA using variables (states and symbols), and functions (transitions and output)! Unsurprisingly, this is exactly how we formally describe a DFA. It is a tuple (Σ, Q, q0, δ, λ) where:

- Σ is an alphabet of symbols,
- Q is a set of states,
- q0 is the start state,
- δ is a transition function from states and symbols to states, and
- λ is an output function for states.

This allows us to construct a formula that describes (a part of) this DFA. For the constraints described in the image, for example, we would say:

q0 = s1 ∧ δ(s1, 0) = s2 ∧ δ(s1, 1) = s4 ∧ δ(s2, 0) = s3 ∧ λ(s1) = 0 ∧ λ(s2) = 1 ∧ λ(s3) = 0

Let’s recap the story so far.

- Our goal is to find a minimal DFA for a given set of labeled strings.
- SMT is the problem of deciding if there is a a valid assignment (model) for a logic formula.

Now, the key insight is the following:

We make a formula from a set of labeled strings and a maximum number of states. If and only if the formula is satisfiable a DFA with this many states exists, and the solver gives us a description of this DFA in its model!

Let’s say that our states are integers, and that the start state is 0. The first thing we want to make sure is that the DFA we will obtain from the solver’s model is consistent with our labeled strings. Let’s say we have `abaa`

as an accepted string, then we would add the following to our formula:

λ( δ( δ( δ( δ(0, a), b), a), a) ) = 1

Say we have `bb`

as a rejected string, then we add:

λ( δ( δ(0, b), b) ) = 0

At this point our formula is a conjunction (∧) of atoms like these, one for each labeled string. Recall our example formula from before and take a moment to convince yourself that this ensures that our DFA is consistent with our data.

If we would ask the SMT solver to come up with a model at this point, it would give us a description of the transition and output functions of a consistent DFA. However, we want to find the smallest DFA, so we have to restrict the number of states. Consider the states to be integers from* 0 to n (exclusive)*, then

∀ q ∈ [0, n) a ∈ Σ δ(q, a) ≥ 0 ∧ δ(q, a) < n

ensures that our DFA has at most *n *states (we can easily quantify over states and symbols because they are both finite domains).

Now, what remains to be done is to

Minimize the value of n for which the formula is satisfiable!

So we first try adding *n = 1*, then *n = 2*, then *n = 3*…

That’s it. We have just encoded the problem of learning a minimal consistent DFA from labeled strings in SMT!

*Curious about how this helps us in modelling real-world systems? Part two of this series will be online soon. In this hands-on session we will see how we can learn models of bank cards (yes, really). Want to get a head start? Check out the Python implementation on top of Z3!*

The research present here was accepted in a paper titled *“Car-following Behavior Model Learning Using Timed Automata”* at* The 20th World Congress of the International Federation of Automatic Control*, one of the three top conferences in area of automatic control.

We learn a timed automaton model from the Next Generation SIMulation dataset on the I-80 highway. This dataset is from a program funded by the U.S. Federal Highway Administration. It contains car trajectory data, and is so far unique in the history of traffic research, providing a great and valuable basis for validation and calibration of microscopic traffic models. A timed automaton is essentially a finite state machine, consisting of a finite set of states describing the current states, connected by transitions labeled from a finite alphabet. A timed automaton additionally has a guard on each transitions that imposes a time restriction in form of an interval: If the time passed since arriving in the state falls within the interval, the guard is active, otherwise the inactive guard blocks the transition. It imposes a semi-markov condition on the time passed since the last event. The input to a timed automaton is a “time word”: A sequence of symbols (representing a discrete event, like acceleration) annotated with the time passed since the last symbol.

The model we learn from traces of discrete events extracted from the dataset is highly succinct and interpretable for car-following behavior analysis. Using a subsequence clustering technique on the states of the automaton model (i.e., the learned latent state space), the timed automation is partitioned into some regions. Each identified cluster has an interpretation as a semantic pattern, e.g. representing “approaching” and “short/medium/long distance car-following”. A complete car-following period consists of multiple such patterns. The following Figure 1 shows the timed automaton we learned. All clusters (indicating patterns) are distinguished with different colors.

There are loops with signicantly large occurrences in cluster 6, e.g., state sequence: 1-6-11-16-1 with symbolic transitions loop: d-j-c-j. We use a clustering as a symbolic representation for the original numeric data, see the code book in Figure 2. The relative distances of “c” and “d” are very close, see the code book in Table 2, but negative and positive respectively. They are associated with “j”, which has a very small speed difference. This sequence can be interpreted as the steady car-following behavior at short distances, i.e., adapting the speed difference with the lead vehicle around 0. Similarly interesting and signicant loops can also be seen in cluster 2 and cluster 4, which are steady long distance and steady medium distance car-following behaviors respectively. An intermediate state S15 in cluster 5 has many incoming transitions, which explains how to transfer between clusters. For the example S6-S15-S4 with transitions “h, i”, i.e., slowing down and speeding up to catch up, from the short distance following in cluster 6 to the medium distance following in cluster 4. The time split can also be seen in two branches of [0, 37] i and [38, 542] i from S15. They share the same symbolic transition condition but have distinct time guards. It means the “i” speed up action followed by short or long duration of “h”, i.e., after how much time the subject vehicle driver notices that their relative distance has been expanded by the lead vehicle and begins to catch up.

Figure 3 illustrates a complete car-following example in our dataset.

It starts from the bottom (colored orange), passes through clusters 6, 5, and 3, then finishes in cluster 4. In the beginning, the subject vehicle is following the lead vehicle at short distances. Then the lead vehicle speeds up, see the positive relative speed and the increasing relative distance in cluster 5. The subject vehicle then also speeds up to approach the lead vehicle, see the negative relative speed and the decreasing relative distance in cluster 3. Finally, it follows the lead vehicle at medium distances in cluster 4 2 . We can see that in cluster 6 and cluster 4, the subject car enters an unconscious reaction region, also called a steady car-following episode, i.e., the relative distance and the relative speed are both bounded in a small area. Cluster 3 and 5 can be both treated as intermediate transition processes. Source code as well as an animated video can be found in our code repository on bitbucket.

Imagine that the vehicle under observation is following another car. Its driving status, e.g. *approaching*, *short distance following*, or* long distance following* can be recognized by tracking its states and the corresponding cluster in our model. In future work, we will consider more complex driving scenarios including behaviors such as lane changing, turning, etc. Precise recognition or identification helps autonomous vehicles to better understand its surrounding environment and other traffic.

Another interesting further application of our work is on *human-like cruise controller design.* The drawbacks of current automatic cruise control (ACC) system lie in inconsistencies between systems and human drivers: 1) driver’s overconfidence or distrust on the system; 2) a mode awareness error when the system consists of two types of ACCs e.g., a high-speed range ACC and a low-speed range ACC; 3) a difference in timing of acceleration/deceleration between drivers and system [1]. The reason is that the control algorithm of an ACC focuses more on mathematical optimization of safety or comfort rather than driving behaviors.

Note that in this line of our work, the model is learned from a large population of drivers’ car-following data. However, it is possible to learn such a controller from a single driver if enough of his/her driving data are available. This is a promising approach for designing a specified car-following controller that actually mimics an individual driver’s driving behavior and habit! Another advantage of our model is an active control strategy, e.g., we can force a state switching from short-distance following to a medium distance in the automaton. We have already done this part of simulation in our journal version.

]]>

Many people I talk to are very surprised that a black box learning technique is even able to find any bugs at all. When explaining the learning algorithm, it seems at first that no guarantees can be given. The algorithm keeps refining some hypothesis and many people ask: “When do you stop learning?” Luckily, the field of computational learning theory is rather mature, and we can give very precise theoretical guarantees. In this post I will highlight some of these results and why they matter to us. Before I do so, let’s recap the theoretical learning framework which is used in the CACM article. (I strongly suggest, you also read the CACM paper at some point! Or at least watch the accompanying video.)

The type of learning we will investigate is called active learning (also called exact learning, or query learning). We suppose there is some alphabet *A* and a unknown language *L* ⊆ *A** (where *A** is the set of words). It is the task of a learning algorithm to infer an automaton accepting the language *L*. The learner has access to an oracle which can answer two types of queries:

**Membership queries**: The learner provides a word, the oracle will answer whether that sequence is accepted by the language or not.**Equivalence queries**: The learning provides a hypothetical automaton for the language, the oracle will either reply positively (the hypothesis is correct) or negatively and in addition gives a word (a counterexample) for which the automaton has a different acceptance than the language*L*.

In the context of software, acceptance is generalized to arbitrary output (in particular the membership queries reply with actual output of the software). So instead of working with DFAs, one works with Moore or Mealy type automata.

Dana Angluin showed that any regular language can be learned in this model, using only polynomially many queries. As a consequence we can learn the behaviour of unknown software as long as it has some regular behaviour. However, there are some objections one might have.

For a black box system we don’t have such an oracle.

The great thing about software is that we can in fact run it. It is interactive. This means that membership queries can be answered immediately by the software itself!

For equivalence queries, however, this is a reasonable objection. How can we know whether the hypothesis is correct w.r.t. a unknown system? Luckily any efficient learning algorithm with membership queries and equivalence queries can be transformed into an efficient **probably approximately correct** (PAC) algorithm. (This is a fun exercise in the book by Kearns & Vazirani. Their book provides a very good introduction in PAC theory.) In practice it means that we can randomly sample test cases (with whatever distribution we seem fit) and obtain a good approximation. The bounds on accuracy are quantitative and can be set arbitrarily high (at polynomial cost only).

Since we are doing machine learning we need big data. It takes ages to gather so many samples.

Indeed, learning neural networks, for example, is very hard in typical learning settings. In fact, in the PAC framework it is proven to be intractable (unless some cryptographic assumptions fail). As a consequence, if you want to increase accuracy you’ll need more than polynomially many more samples. That is, you need big data.

Our setting, however, is slightly different from the typical machine learning. We are saved by the membership queries. The learner does not rely on just any sample, it relies on precisely the samples it chooses to query. This makes all the difference and gives the polynomial bounds mentioned earlier.

I once asked Falk Howar (one of the developers of LearnLib) whether he considers himself as a machine learning scientist. Indeed he does so, “but”, he added, “we do small data, not big data.”

All this talk on polynomial bounds doesn’t apply in practice.

The celebrated survey paper P versus NP by Scott Aaronson already answers this objection. Somehow the gap between P and NP seems to be the gap between feasible and non-feasible. Partly, because algorithms in P often have a low-order polynomial as running time.

I experience that learning algorithms are no exception. The algorithm by Dana Angluin is polynomial (counting number of queries here) with a low-order polynomial. These asymptotics and also the constants have been improved in recent years, as mentioned in the CACM article. It’s just very efficient. All of this is witnessed by the many applications mentioned in the CACM.

A while ago, I posted about the RERS challenge, a challenge where software had to be analysed. This was a white-box challenge, but we used exclusively black-box techniques and got very good results. In particular we used a fuzzing and a learning algorithm. We noticed that their results were very comparable (the fuzzer was slightly better). However, the fuzzer afl-lop executed a couple orders of magnitude more traces than the learner. I have no doubts any more that learning is efficient in practice!

So… When **do** you stop learning?

One way is to use the PAC framework. Specify the accuracy, compute the number of samples needed, and run the algorithm. This might not be completely satisfactory, though. What does it mean when a hypothesis is epsilon-close to the real software?

There is another way. In testing literature there are many conformance testing algorithms. Instead of giving probabilistic bounds, these algorithms give more concrete bounds on the hypothesis. A typical guarantee is: The black box system is equivalent to the hypothesis, unless it has at least k more states. The parameter k can be set as high as you want (this time at exponential cost).

In our work we often combine the two approaches. We have probabilistic guarantees and additionally, have equivalence up to k states, for a small k. For many experiments we do, we actually obtain the equivalent automaton.

Software is often Turing complete, and cannot be captured by regular behaviour.

This is a deeper problem. The CACM article touches on this and already gives some solutions. There is (at least) two ways to generalize from automata to richer classes of behaviour.

One is to consider richer automata, still having a regular structure as a core. Examples are register automata (which add data-flow and predicates on transitions) and weighted automata (which add some arithmetic to the transitions). Learning algorithms have been described for such automata.

Another direction is going up to Chomsky hierarchy. It is known that visibly pushdown automata can be learned in a similar fashion. Currently, we’re not able to go further up the hierarchy.

However, in the many applications we see that the regular universe is big enough. This is partly because one can choose an alphabet small enough (or abstract enough) to exhibit regular behaviour. By choosing such a alphabet we might get a smaller, more abstract view on the actual behaviour. But it’s often enough for bug finding, and additionally it gives a model for which we have very good tooling, theory and interpretability (see Chris’ post).

Using black box techniques is silly if you actually have source code!

I totally agree! Despite the effectiveness of learning, we can imagine it being much more successful if we incorporate information from the source code. Frits Vaandrager likes to call this “grey box methods”. A good example is found in work by Andreas Zeller. He learns grammars more effectively by instrumenting the source code, tracking certain values through the execution. I hope we will see more of this combined techniques, utilizing best of both worlds.

I hope I gave a nice overview of some of the theoretical guarantees, making clear that computational learning has a good foundation. Maybe you have some ideas of how to apply learning in your own application. Or you have some ideas on how to combine it with white box methods. We are very interested in hearing these ideas! People contacting us with new applications of the theory always inspires us!

]]>In recent years, neural models outperform automata models, especially deep networks, when solving real-world tasks. Deep networks perform particularly well on large datasets. Interestingly, recent developments in parts of the deep learning community took renewed inspiration from the field of automata and formal models to improve RNN- and LSTM-based deep network for sequence prediction and transducing tasks. This isn’t the first time the two fields meet, see such as Giles et al.’s work from the early 90ies on neural stacks, but it is the first time deep networks are used in practice at large-scale, offering the best performance.

The key idea behind all proposals is to extend neural networks with memory managed by a controller. The controller, managing access and use to the memory, is built to be a differentiable operator (e.g. another kind of network with differentiable access operators). The resulting network can be trained using standard optimization algorithms and frameworks, benefiting from the same GPU acceleration other networks do, too.

To my limited knowledge, the increase in interest in these models came with Graves et al. at Deep Mind *neural *turing* machines (NTM)*., and Weston et al. at Facebook *memory networks,* roughly proposed at the same time*. *Both approaches extend neural networks with a read-write memory block. While the NTM paper focuses on program inference and solving algorithmic tasks, the memory network paper focuses on increasing performance on language problems. Since other blogs already offer nice high-level summaries on NTMs and memory networks, I will not go into more details*. *Moreover, at this year’s nampi workshop at NIPS, Graves extended the idea of memory access by additionally learning how many computation steps are required to finish computation and output a decision.

The paper I am focusing on is Learning to Transduce with Unbounded Memory by Grefenstette et al. The paper’s goal is to provide a middle ground between the fully random access memory of NTM and the static memory of RNNs. The abstract says:

Recently, strong results have been demonstrated by Deep Recurrent Neural Networks on natural language transduction problems. In this paper we explore the representational power of these models using synthetic grammars designed to exhibit phenomena similar to those found in real transduction problems such as machine translation. These experiments lead us to propose new memory-based recurrent networks that implement continuously differentiable analogues of traditional data structures such as Stacks, Queues, and DeQues. We show that these architectures exhibit superior generalisation performance to Deep RNNs and are often able to learn the underlying generating algorithms in our transduction experiments.

The key data structure implemented is a *“continuous”* stack. Its read- and write-operations are not discrete, but on a continuum in (0,1), modeling the certainty of wanting to push or pop onto the stack. The data objects are vectors. The stack is modeled by two components: a value matrix V, and a strength vector s. The value matrix grows with each time step by appending a new row and models an append-only memory. The logical stack is extracted by using the strength vector s. A controller acts on the tuple of value matrix and strength vector (V, s). It takes in a pop signal u, a push signal d, a value v, and produces an (output) read vector r. The quantities u and d are used to update the strength vector s, whereas v is appended to the value matrix V, and the read vector r is a weighted sum of the rows of the value matrix V.

The following figure illustrates the initial push of v_1 onto the stack, a very “weak” push of v_2, and then a pop operation and another push operation of a value v_3 (you can find the exact equations and rules to modify s and read r are stated in the paper).

The next figure illustrates the setup: the memory at the center and the controller input values d for pushing, u for popping, and the value v. Moreover, the previous value matrix and previous strength vector are used. The outputs are the next value matrix and strength vector as well as the read vector. This construction a differentiable memory block containing a stack. But there are no free parameters to optimize its behavior. By viewing the previous value matrix, strength vector, and read vector as a state output of an RNN that receives an input vector i, the authors obtain a trainable system with free parameters.

But what advantage does such a system offer? To determine its effectiveness, the authors consider several simple tasks (copying a sequence, reversing a sequence, and inverting bigrams in a sequence) and tasks from linguistics (using inversion transduction grammars, a subclass of context-free grammars). The network enhanced with a stack is compared to a deep LSTM network. Overall, the stack enhanced network not only performs better but also converges faster.

Unfortunately, the authors don’t provide an analysis of the stack usage. I think it would be interesting to see how the LSTM controller learns how to use the stack and compare the results with traditional pushdown automata. In grammatical inference, the usual goal is to find the smallest possible automaton. How different is this goal from learning a stack-enhanced LSTM? Can we understand the model, and does it offer some insight? The ability to interpret automata (and their use as a specification language in formal systems) is a huge motivating factor for our own work (see e.g. our paper on interpreting automata for sequential data). What can we learn from others?

]]>For example in psychology, questionnaires and experiments are typically given to other students. On top of the data collection, to use supervised applications, data needs to be labeled. In many cases, human labeling can introduce more errors, e.g. by mislabeling, omission, or misinterpretation of the data sample. Moreover, effects and correlations present in society, e.g. caused by sexism, racism, or poverty can be preserved or amplified in collected data.

All in all, these problems lead to vague demands to (be able to) *understand *what our predictive models are doing, and why they are doing it. Responses to this demand have been diverse, and lead to the creation of workshops such as Interpretable ML@NIPS and WHI@ICML. Initiatives like the workshop on Fairness, Accountability and Transparency (FATML) can also be seen in this light. A paper I really like, *The Mythos of Model Interpretability*, sheds some light on the different definitions, needs, and motivations researchers and practitioners bring to the table. I think one key made in this paper, despite seemingly trivial, is:

If you don’t specify your needs for interpretation or explanations, you cannot expect your needs to be met by the model.

It seems that computer scientists tend to forget this. It is not too much of a surprise: We’re used to extract meaning from syntactical and mathematical structures because we use these structures to describe how computers work. But not every machine learning practitioner or receiver of a machine learned decision is a computer scientist, and not every mathematical description is readily accessible and understandable to computer scientists either.

In our work, we use finite state machines, as depicted in the next figure. Most computer scientists are taught finite state machines very early on, as one of the first formal systems to encounter—only to never really hear of them again. They are related to other, more expressive automata models like push-down automata, Büchi machines, hiddenMarkovv models, and other less well-known variants. In the field of grammatical inference/grammar learning, inferring such models from given data is the main task.

Finite state machines and variants are generators (or acceptors) of sequence data. They can accept or reject a given string, and therefore be used to cluster sequences. For a given string, seen as a prefix, an automaton can be used to obtain a list of possible continuations or a distribution over possible continuations. In this way, automata can be used for sequence prediction. Finite state machines are not Turing complete and have a limited expressiveness. They will not approximate any function very well. But in practice, a lot of problems are still described fairly well; in fact, they are almost as expressive as hidden Markov models which have an internal memory that is logarithmic to the number of states. For problems that require limited memory, e.g. high-level description of phenomena, they are a good choice. Very common use cases of automata are in software engineering, where they are used for specifying the desired behavior of systems to be implemented.

In terms of interpretation, they I think that 4 key properties make them very easy:

**Automata have an easy graphical representation as cyclic, directed, labeled graphs, offering a hierarchical view of sequential data.**

Instead of looking a large set of long sequences, we can look at a model that has loops and cycles. It is a much more compact representation of the same data.

**Computation of automata is transparent.**

In each step of the computation can be verified manually (e.g. visually), and compared to other computation paths through the latent state space. This makes it possible to analyze training samples and their contribution to the final model. It is also possible to answer questions like “What would happen if the data were different at this stop of the sequence?” or “What other data leads to the same computation outcome?”.

**Automata are generative models.**

Sampling from the model, e.g. “pressing play”, helps to understand what it describes: By generating a wide range of possible computation paths, tools like model checkers can be used to query properties in a formal way, e.g. using temporal logic. This can help to analyze the properties of the model in a formal way.

**Automata are well studied in theory and practice.**

We know a lot about composition and closure properties of automata and their sub-classes. We can relate them to equally expressive formalisms. In many cases, this allows us to think about the model as a composition of smaller parts and makes it easy for humans to transfer their knowledge onto it: The model is frequently used in system design as a way to describe system logic. We can use this knowledge to understand a learned model, and relate it to known functions.

We try to summarize these points, together with some more examples, in our paper online on arxiv. The abstract reads:

Automaton models are often seen as interpretable models. Interpretability itself is not well defined: it remains unclear what interpretability means without first explicitly specifying objectives or desired attributes. In this paper, we identify the key properties used to interpret automata and propose a modification of a state-merging approach to learn variants of finite state automata. We apply the approach to problems beyond typical grammar inference tasks. Additionally, we cover several use-cases for prediction, classification, and clustering on sequential data in both supervised and unsupervised scenarios to show how the identified key properties are applicable in a wide range of contexts.

I am very happy and grateful to receive your thoughts and feedback on it. What do you think about the interpretability and understandability of automata?

]]>The notebook works you through basic usage and parameter setting. It also contains a small task to familiarize the user with the effect of different parameter settings. At the moment, dfasat has about 30 different options to choose from. Some can be combined, whereas other combinations have never been tried in combination. The easiest way to use the introduction is to download the virtual appliance for VirtualBox (3GB download, password for user winter/sudo: ‘iscoming’). It contains the practical data sets and the python notebook (ipynb/html). You can also download the files separately, and clone the dfasat repository or install the dfasat python package. I personally recommend using the virtual appliance: It is well tested by 20 students during the session at the winter school. Please contact me for assistance. My email address is included in the notebook. ]]>

The task of the competition was to predict a (ranked) list of most likely continuations (a_1, …, a_5) for a given prefix (y_0, .., y_i), based on learning from a training set of complete words.

One of my students (competing as team PING) placed 7th, using the dfasat tool. The main goal was to test a python interface for dfasat (early release here). But what can we take away from placing 7th? Is PDFA-learning not competitive for sequence prediction? The answer is a solid jein: By using dfasat, we assumed that all problem sets were generated by a deterministic and probabilistic finite state automaton (PDFA). In practice, most problem sets were generated by HMMs or contained linguistic data. Both data types cannot necessarily be learned very well by our PDFA models. The results reflect this, as outlined in the following table. For the HMM problems, we obtain OK-scores. That is expected because our PDFA models are not quite as expressive as the HMMs used to generate the data, but the gap is not too large. On the linguistics data, we really struggle to obtain reasonable scores (e.g. problem 10).

But problem 9 is a very interesting case: it contains software traces. For this problem type, PDFA models obtained the second best score and beat most of the RNN and CNN approaches. I expect that LSTM/RNN approaches can obtain equally good or better scores, but require a lot more data to learn a model of equal predictive quality. I am planning to analyze the character-level networks (e.g. with methods used here) used by the competitors to understand better what aspects they managed to learn.

I will add a more detailed description of the problem sets later on.

]]>This year, however, automata learning was applied to great success. For the problems where LTL formulas had to be (dis)proven, the team managed to get a perfect score. Other teams did not manage to get so many results here. For the reachability problems, they performed well but did not win in the rankings. The team applied state of the art learning algorithms, but did not tweak or alter them for the challenge.

It is interesting that a black box technique can get such good scores, compared to white box methods. Indeed, less information is used and by using black box techniques one cannot have 100% guarantees. But more results are obtained. So it seems one can trade confidence for scaling to bigger problems.

More information will follow.

]]>