This is part one of a two part series on automata learning and satisfiability modulo theories. Part two will appear soon.
Deterministic finite automata (DFAs) are useful in a variety of applications. However, the problem of learning a DFA of minimal size from positive (accepted) and negative (rejected) strings can be very hard. In fact, it is the optimization variant of the problem of finding a consistent DFA of a fixed size, which has been shown to be NP-complete. In 2010, Marijn Heule and Sicco Verwer presented an algorithm that encodes the problem of learning a DFA from labeled strings as a satisfiability (SAT) problem. Their algorithm has since won the StaMinA competition, and has led to the creation of the dfasat tool (for which Chris has created an exellent tutorial).
In this post, I present an encoding that takes a satisfiability modulo theories (SMT) perspective. This encoding is faster than the one used in dfasat, and benefits from the continuous efforts by fellow researchers on making SMT solvers more powerful. Moreover, I find it more natural, because it makes a distinction between the logic that is required to solve the problem, and the logic imposed by the background theories.
This is our team’s first attempt at applying automata learning theory to problems in autonomous driving. Our starting point, learning patterns of car-following behavior, is the most fundamental task in daily driving scenarios. It has been widely recognized the first milestone towards autonomous or semi-autonomous driving is a good cruise-controller for save car-following. The goal is to succinctly describe how a car follows the car in front of it, often called the lead car.
The research present here was accepted in a paper titled “Car-following Behavior Model Learning Using Timed Automata” at The 20th World Congress of the International Federation of Automatic Control, one of the three top conferences in area of automatic control.
We learn a timed automaton model from the Next Generation SIMulation dataset on the I-80 highway. This dataset is from a program funded by the U.S. Federal Highway Administration. It contains car trajectory data, and is so far unique in the history of traffic research, providing a great and valuable basis for validation and calibration of microscopic traffic models. A timed automaton is essentially a finite state machine, consisting of a finite set of states describing the current states, connected by transitions labeled from a finite alphabet. A timed automaton additionally has a guard on each transitions that imposes a time restriction in form of an interval: If the time passed since arriving in the state falls within the interval, the guard is active, otherwise the inactive guard blocks the transition. It imposes a semi-markov condition on the time passed since the last event. The input to a timed automaton is a “time word”: A sequence of symbols (representing a discrete event, like acceleration) annotated with the time passed since the last symbol.
The model we learn from traces of discrete events extracted from the dataset is highly succinct and interpretable for car-following behavior analysis. Using a subsequence clustering technique on the states of the automaton model (i.e., the learned latent state space), the timed automation is partitioned into some regions. Each identified cluster has an interpretation as a semantic pattern, e.g. representing “approaching” and “short/medium/long distance car-following”. A complete car-following period consists of multiple such patterns. The following Figure 1 shows the timed automaton we learned. All clusters (indicating patterns) are distinguished with different colors.
There are loops with signicantly large occurrences in cluster 6, e.g., state sequence: 1-6-11-16-1 with symbolic transitions loop: d-j-c-j. We use a clustering as a symbolic representation for the original numeric data, see the code book in Figure 2. The relative distances of “c” and “d” are very close, see the code book in Table 2, but negative and positive respectively. They are associated with “j”, which has a very small speed difference. This sequence can be interpreted as the steady car-following behavior at short distances, i.e., adapting the speed difference with the lead vehicle around 0. Similarly interesting and signicant loops can also be seen in cluster 2 and cluster 4, which are steady long distance and steady medium distance car-following behaviors respectively. An intermediate state S15 in cluster 5 has many incoming transitions, which explains how to transfer between clusters. For the example S6-S15-S4 with transitions “h, i”, i.e., slowing down and speeding up to catch up, from the short distance following in cluster 6 to the medium distance following in cluster 4. The time split can also be seen in two branches of [0, 37] i and [38, 542] i from S15. They share the same symbolic transition condition but have distinct time guards. It means the “i” speed up action followed by short or long duration of “h”, i.e., after how much time the subject vehicle driver notices that their relative distance has been expanded by the lead vehicle and begins to catch up.
Figure 3 illustrates a complete car-following example in our dataset.
It starts from the bottom (colored orange), passes through clusters 6, 5, and 3, then finishes in cluster 4. In the beginning, the subject vehicle is following the lead vehicle at short distances. Then the lead vehicle speeds up, see the positive relative speed and the increasing relative distance in cluster 5. The subject vehicle then also speeds up to approach the lead vehicle, see the negative relative speed and the decreasing relative distance in cluster 3. Finally, it follows the lead vehicle at medium distances in cluster 4 2 . We can see that in cluster 6 and cluster 4, the subject car enters an unconscious reaction region, also called a steady car-following episode, i.e., the relative distance and the relative speed are both bounded in a small area. Cluster 3 and 5 can be both treated as intermediate transition processes. Source code as well as an animated video can be found in our code repository on bitbucket.
Imagine that the vehicle under observation is following another car. Its driving status, e.g. approaching, short distance following, or long distance following can be recognized by tracking its states and the corresponding cluster in our model. In future work, we will consider more complex driving scenarios including behaviors such as lane changing, turning, etc. Precise recognition or identification helps autonomous vehicles to better understand its surrounding environment and other traffic.
Another interesting further application of our work is on human-like cruise controller design. The drawbacks of current automatic cruise control (ACC) system lie in inconsistencies between systems and human drivers: 1) driver’s overconfidence or distrust on the system; 2) a mode awareness error when the system consists of two types of ACCs e.g., a high-speed range ACC and a low-speed range ACC; 3) a difference in timing of acceleration/deceleration between drivers and system . The reason is that the control algorithm of an ACC focuses more on mathematical optimization of safety or comfort rather than driving behaviors.
Note that in this line of our work, the model is learned from a large population of drivers’ car-following data. However, it is possible to learn such a controller from a single driver if enough of his/her driving data are available. This is a promising approach for designing a specified car-following controller that actually mimics an individual driver’s driving behavior and habit! Another advantage of our model is an active control strategy, e.g., we can force a state switching from short-distance following to a medium distance in the automaton. We have already done this part of simulation in our journal version.
The Communications of ACM published an article on automata learning in software engineering last February. The techniques described in the article are used to obtain models for the (input/output) behaviour of software. Even without access to source code, one can now use model checking or other bug finding tools on these models. The article shows many successful applications. Why is this possible at all?
In linguistic applications, tasks typically are translating a sentence, or deciding whether a given string belongs to a specific language. In the past, popular models to learn rules were finite state machines, pushdown automata, and hidden Markov models. We understand these models fairly well, and they each describe a class in the Chomsky hierarchy. This makes them very apt to model formal systems. But when it comes to describing natural language and solving problems in NLP, the rules imposed by formal grammars are often too strict and limited to model human writing and speech.
In recent years, neural models outperform automata models, especially deep networks, when solving real-world tasks. Deep networks perform particularly well on large datasets. Interestingly, recent developments in parts of the deep learning community took renewed inspiration from the field of automata and formal models to improve RNN- and LSTM-based deep network for sequence prediction and transducing tasks. This isn’t the first time the two fields meet, see such as Giles et al.’s work from the early 90ies on neural stacks, but it is the first time deep networks are used in practice at large-scale, offering the best performance.
The key idea behind all proposals is to extend neural networks with memory managed by a controller. The controller, managing access and use to the memory, is built to be a differentiable operator (e.g. another kind of network with differentiable access operators). The resulting network can be trained using standard optimization algorithms and frameworks, benefiting from the same GPU acceleration other networks do, too.
To my limited knowledge, the increase in interest in these models came with Graves et al. at Deep Mind neural turing machines (NTM)., and Weston et al. at Facebook memory networks, roughly proposed at the same time. Both approaches extend neural networks with a read-write memory block. While the NTM paper focuses on program inference and solving algorithmic tasks, the memory network paper focuses on increasing performance on language problems. Since other blogs already offer nice high-level summaries on NTMs and memory networks, I will not go into more details. Moreover, at this year’s nampi workshop at NIPS, Graves extended the idea of memory access by additionally learning how many computation steps are required to finish computation and output a decision.
Recently, strong results have been demonstrated by Deep Recurrent Neural Networks on natural language transduction problems. In this paper we explore the representational power of these models using synthetic grammars designed to exhibit phenomena similar to those found in real transduction problems such as machine translation. These experiments lead us to propose new memory-based recurrent networks that implement continuously differentiable analogues of traditional data structures such as Stacks, Queues, and DeQues. We show that these architectures exhibit superior generalisation performance to Deep RNNs and are often able to learn the underlying generating algorithms in our transduction experiments.
The key data structure implemented is a “continuous” stack. Its read- and write-operations are not discrete, but on a continuum in (0,1), modeling the certainty of wanting to push or pop onto the stack. The data objects are vectors. The stack is modeled by two components: a value matrix V, and a strength vector s. The value matrix grows with each time step by appending a new row and models an append-only memory. The logical stack is extracted by using the strength vector s. A controller acts on the tuple of value matrix and strength vector (V, s). It takes in a pop signal u, a push signal d, a value v, and produces an (output) read vector r. The quantities u and d are used to update the strength vector s, whereas v is appended to the value matrix V, and the read vector r is a weighted sum of the rows of the value matrix V.
The following figure illustrates the initial push of v_1 onto the stack, a very “weak” push of v_2, and then a pop operation and another push operation of a value v_3 (you can find the exact equations and rules to modify s and read r are stated in the paper).
The next figure illustrates the setup: the memory at the center and the controller input values d for pushing, u for popping, and the value v. Moreover, the previous value matrix and previous strength vector are used. The outputs are the next value matrix and strength vector as well as the read vector. This construction a differentiable memory block containing a stack. But there are no free parameters to optimize its behavior. By viewing the previous value matrix, strength vector, and read vector as a state output of an RNN that receives an input vector i, the authors obtain a trainable system with free parameters.
But what advantage does such a system offer? To determine its effectiveness, the authors consider several simple tasks (copying a sequence, reversing a sequence, and inverting bigrams in a sequence) and tasks from linguistics (using inversion transduction grammars, a subclass of context-free grammars). The network enhanced with a stack is compared to a deep LSTM network. Overall, the stack enhanced network not only performs better but also converges faster.
Unfortunately, the authors don’t provide an analysis of the stack usage. I think it would be interesting to see how the LSTM controller learns how to use the stack and compare the results with traditional pushdown automata. In grammatical inference, the usual goal is to find the smallest possible automaton. How different is this goal from learning a stack-enhanced LSTM? Can we understand the model, and does it offer some insight? The ability to interpret automata (and their use as a specification language in formal systems) is a huge motivating factor for our own work (see e.g. our paper on interpreting automata for sequential data). What can we learn from others?
The precision, speed and deterministic, algorithmic problem-solving strategies of computers are often idealized. Consequently, computers are often seen as unbiased and objective. This view is also transferred to automated decision making using machine-learned models. But this is dangerous for multiple reasons: Between false positives and false negatives, models can be wrong in more than one way. The effect of the human component in the data can be severe and is often ignored or grossly underestimated (see for example this paper here): The data we collect in real life has some context, and this context can introduce a bias.
For example in psychology, questionnaires and experiments are typically given to other students. On top of the data collection, to use supervised applications, data needs to be labeled. In many cases, human labeling can introduce more errors, e.g. by mislabeling, omission, or misinterpretation of the data sample. Moreover, effects and correlations present in society, e.g. caused by sexism, racism, or poverty can be preserved or amplified in collected data.
All in all, these problems lead to vague demands to (be able to) understand what our predictive models are doing, and why they are doing it. Responses to this demand have been diverse, and lead to the creation of workshops such as Interpretable ML@NIPS and WHI@ICML. Initiatives like the workshop on Fairness, Accountability and Transparency (FATML) can also be seen in this light. A paper I really like, The Mythos of Model Interpretability, sheds some light on the different definitions, needs, and motivations researchers and practitioners bring to the table. I think one key made in this paper, despite seemingly trivial, is:
If you don’t specify your needs for interpretation or explanations, you cannot expect your needs to be met by the model.
It seems that computer scientists tend to forget this. It is not too much of a surprise: We’re used to extract meaning from syntactical and mathematical structures because we use these structures to describe how computers work. But not every machine learning practitioner or receiver of a machine learned decision is a computer scientist, and not every mathematical description is readily accessible and understandable to computer scientists either.
In our work, we use finite state machines, as depicted in the next figure. Most computer scientists are taught finite state machines very early on, as one of the first formal systems to encounter—only to never really hear of them again. They are related to other, more expressive automata models like push-down automata, Büchi machines, hiddenMarkovv models, and other less well-known variants. In the field of grammatical inference/grammar learning, inferring such models from given data is the main task.
Finite state machines and variants are generators (or acceptors) of sequence data. They can accept or reject a given string, and therefore be used to cluster sequences. For a given string, seen as a prefix, an automaton can be used to obtain a list of possible continuations or a distribution over possible continuations. In this way, automata can be used for sequence prediction. Finite state machines are not Turing complete and have a limited expressiveness. They will not approximate any function very well. But in practice, a lot of problems are still described fairly well; in fact, they are almost as expressive as hidden Markov models which have an internal memory that is logarithmic to the number of states. For problems that require limited memory, e.g. high-level description of phenomena, they are a good choice. Very common use cases of automata are in software engineering, where they are used for specifying the desired behavior of systems to be implemented.
In terms of interpretation, they I think that 4 key properties make them very easy:
Automata have an easy graphical representation as cyclic, directed, labeled graphs, offering a hierarchical view of sequential data.
Instead of looking a large set of long sequences, we can look at a model that has loops and cycles. It is a much more compact representation of the same data.
Computation of automata is transparent.
In each step of the computation can be verified manually (e.g. visually), and compared to other computation paths through the latent state space. This makes it possible to analyze training samples and their contribution to the final model. It is also possible to answer questions like “What would happen if the data were different at this stop of the sequence?” or “What other data leads to the same computation outcome?”.
Automata are generative models.
Sampling from the model, e.g. “pressing play”, helps to understand what it describes: By generating a wide range of possible computation paths, tools like model checkers can be used to query properties in a formal way, e.g. using temporal logic. This can help to analyze the properties of the model in a formal way.
Automata are well studied in theory and practice.
We know a lot about composition and closure properties of automata and their sub-classes. We can relate them to equally expressive formalisms. In many cases, this allows us to think about the model as a composition of smaller parts and makes it easy for humans to transfer their knowledge onto it: The model is frequently used in system design as a way to describe system logic. We can use this knowledge to understand a learned model, and relate it to known functions.
We try to summarize these points, together with some more examples, in our paper online on arxiv. The abstract reads:
Automaton models are often seen as interpretable models. Interpretability itself is not well defined: it remains unclear what interpretability means without first explicitly specifying objectives or desired attributes. In this paper, we identify the key properties used to interpret automata and propose a modification of a state-merging approach to learn variants of finite state automata. We apply the approach to problems beyond typical grammar inference tasks. Additionally, we cover several use-cases for prediction, classification, and clustering on sequential data in both supervised and unsupervised scenarios to show how the identified key properties are applicable in a wide range of contexts.
I am very happy and grateful to receive your thoughts and feedback on it. What do you think about the interpretability and understandability of automata?