Now it is even easier to use flexfringe (read our tool paper here), reproduce our experiments, or play with the flexible state-merging framework to learn extended variants of finite state machines, Mealy machines, or other regular/memory free automata thanks to Google Colab(oratory). Google Colab is a cloud-hosted Jupyter notebook environment (read more about Colab here). The notebooks run on a virtual machine powered by Ubuntu and allow to install new packages and dependencies.

I prepared a notebook that installs all dependencies, wraps the resulting binaries in Python functions (view on github, view on Google Colab) and provides some short usage examples using the Stamina competition data. Due to recent changes in the boost.Python library it is not yet possible to compile the Python package (as described in this paper).

If you run into any problems with flexfringe on Colab, contact me.

In linguistic applications, tasks typically are translating a sentence, or deciding whether a given string belongs to a specific language. In the past, popular models to learn rules were finite state machines, pushdown automata, and hidden Markov models. We understand these models fairly well, and they each describe a class in the Chomsky hierarchy. This makes them very apt to model formal systems. But when it comes to describing natural language and solving problems in NLP, the rules imposed by formal grammars are often too strict and limited to model human writing and speech.

In recent years, neural models outperform automata models, especially deep networks, when solving real-world tasks. Deep networks perform particularly well on large datasets. Interestingly, recent developments in parts of the deep learning community took renewed inspiration from the field of automata and formal models to improve RNN- and LSTM-based deep network for sequence prediction and transducing tasks. This isn’t the first time the two fields meet, see such as Giles et al.’s work from the early 90ies on neural stacks, but it is the first time deep networks are used in practice at large-scale, offering the best performance.

The key idea behind all proposals is to extend neural networks with memory managed by a controller. The controller, managing access and use to the memory, is built to be a differentiable operator (e.g. another kind of network with differentiable access operators). The resulting network can be trained using standard optimization algorithms and frameworks, benefiting from the same GPU acceleration other networks do, too.
To my limited knowledge, the increase in interest in these models came with Graves et al. at Deep Mind neural turing machines (NTM)., and Weston et al. at Facebook memory networks, roughly proposed at the same time. Both approaches extend neural networks with a read-write memory block. While the NTM paper focuses on program inference and solving algorithmic tasks, the memory network paper focuses on increasing performance on language problems. Since other blogs already offer nice high-level summaries on NTMs and memory networks, I will not go into more details. Moreover, at this year’s nampi workshop at NIPS, Graves extended the idea of memory access by additionally learning how many computation steps are required to finish computation and output a decision.

Recently, strong results have been demonstrated by Deep Recurrent Neural Networks on natural language transduction problems. In this paper we explore the representational power of these models using synthetic grammars designed to exhibit phenomena similar to those found in real transduction problems such as machine translation. These experiments lead us to propose new memory-based recurrent networks that implement continuously differentiable analogues of traditional data structures such as Stacks, Queues, and DeQues. We show that these architectures exhibit superior generalisation performance to Deep RNNs and are often able to learn the underlying generating algorithms in our transduction experiments.

The key data structure implemented is a “continuous” stack. Its read- and write-operations are not discrete, but on a continuum in (0,1), modeling the certainty of wanting to push or pop onto the stack. The data objects are vectors. The stack is modeled by two components: a value matrix V, and a strength vector s. The value matrix grows with each time step by appending a new row and models an append-only memory. The logical stack is extracted by using the strength vector s. A controller acts on the tuple of value matrix and strength vector (V, s). It takes in a pop signal u, a push signal d, a value v, and produces an (output) read vector r. The quantities u and d are used to update the strength vector s, whereas v is appended to the value matrix V, and the read vector r is a weighted sum of the rows of the value matrix V.
The following figure illustrates the initial push of v_1 onto the stack, a very “weak” push of v_2, and then a pop operation and another push operation of a value v_3 (you can find the exact equations and rules to modify s and read r are stated in the paper).

The next figure illustrates the setup: the memory at the center and the controller input values d for pushing, u for popping, and the value v. Moreover, the previous value matrix and previous strength vector are used. The outputs are the next value matrix and strength vector as well as the read vector. This construction a differentiable memory block containing a stack. But there are no free parameters to optimize its behavior. By viewing the previous value matrix, strength vector, and read vector as a state output of an RNN that receives an input vector i, the authors obtain a trainable system with free parameters.

But what advantage does such a system offer? To determine its effectiveness, the authors consider several simple tasks (copying a sequence, reversing a sequence, and inverting bigrams in a sequence) and tasks from linguistics (using inversion transduction grammars, a subclass of context-free grammars). The network enhanced with a stack is compared to a deep LSTM network. Overall, the stack enhanced network not only performs better but also converges faster.

Unfortunately, the authors don’t provide an analysis of the stack usage. I think it would be interesting to see how the LSTM controller learns how to use the stack and compare the results with traditional pushdown automata. In grammatical inference, the usual goal is to find the smallest possible automaton. How different is this goal from learning a stack-enhanced LSTM? Can we understand the model, and does it offer some insight? The ability to interpret automata (and their use as a specification language in formal systems) is a huge motivating factor for our own work (see e.g. our paper on interpreting automata for sequential data). What can we learn from others?

The precision, speed and deterministic, algorithmic problem-solving strategies of computers are often idealized. Consequently, computers are often seen as unbiased and objective. This view is also transferred to automated decision making using machine-learned models. But this is dangerous for multiple reasons: Between false positives and false negatives, models can be wrong in more than one way. The effect of the human component in the data can be severe and is often ignored or grossly underestimated (see for example this paper here): The data we collect in real life has some context, and this context can introduce a bias.

For example in psychology, questionnaires and experiments are typically given to other students. On top of the data collection, to use supervised applications, data needs to be labeled. In many cases, human labeling can introduce more errors, e.g. by mislabeling, omission, or misinterpretation of the data sample. Moreover, effects and correlations present in society, e.g. caused by sexism, racism, or poverty can be preserved or amplified in collected data.
All in all, these problems lead to vague demands to (be able to) understand what our predictive models are doing, and why they are doing it. Responses to this demand have been diverse, and lead to the creation of workshops such as Interpretable ML@NIPS and WHI@ICML. Initiatives like the workshop on Fairness, Accountability and Transparency (FATML) can also be seen in this light. A paper I really like, The Mythos of Model Interpretability, sheds some light on the different definitions, needs, and motivations researchers and practitioners bring to the table. I think one key made in this paper, despite seemingly trivial, is:

If you don’t specify your needs for interpretation or explanations, you cannot expect your needs to be met by the model.

It seems that computer scientists tend to forget this. It is not too much of a surprise: We’re used to extract meaning from syntactical and mathematical structures because we use these structures to describe how computers work. But not every machine learning practitioner or receiver of a machine learned decision is a computer scientist, and not every mathematical description is readily accessible and understandable to computer scientists either.

In our work, we use finite state machines, as depicted in the next figure. Most computer scientists are taught finite state machines very early on, as one of the first formal systems to encounter—only to never really hear of them again. They are related to other, more expressive automata models like push-down automata, Büchi machines, hiddenMarkovv models, and other less well-known variants. In the field of grammatical inference/grammar learning, inferring such models from given data is the main task.

Finite state machines and variants are generators (or acceptors) of sequence data. They can accept or reject a given string, and therefore be used to cluster sequences. For a given string, seen as a prefix, an automaton can be used to obtain a list of possible continuations or a distribution over possible continuations. In this way, automata can be used for sequence prediction. Finite state machines are not Turing complete and have a limited expressiveness. They will not approximate any function very well. But in practice, a lot of problems are still described fairly well; in fact, they are almost as expressive as hidden Markov models which have an internal memory that is logarithmic to the number of states. For problems that require limited memory, e.g. high-level description of phenomena, they are a good choice. Very common use cases of automata are in software engineering, where they are used for specifying the desired behavior of systems to be implemented.

In terms of interpretation, they I think that 4 key properties make them very easy:

Automata have an easy graphical representation as cyclic, directed, labeled graphs, offering a hierarchical view of sequential data.

Instead of looking a large set of long sequences, we can look at a model that has loops and cycles. It is a much more compact representation of the same data.

Computation of automata is transparent.

In each step of the computation can be verified manually (e.g. visually), and compared to other computation paths through the latent state space. This makes it possible to analyze training samples and their contribution to the final model. It is also possible to answer questions like “What would happen if the data were different at this stop of the sequence?” or “What other data leads to the same computation outcome?”.

Automata are generative models.

Sampling from the model, e.g. “pressing play”, helps to understand what it describes: By generating a wide range of possible computation paths, tools like model checkers can be used to query properties in a formal way, e.g. using temporal logic. This can help to analyze the properties of the model in a formal way.

Automata are well studied in theory and practice.

We know a lot about composition and closure properties of automata and their sub-classes. We can relate them to equally expressive formalisms. In many cases, this allows us to think about the model as a composition of smaller parts and makes it easy for humans to transfer their knowledge onto it: The model is frequently used in system design as a way to describe system logic. We can use this knowledge to understand a learned model, and relate it to known functions.

We try to summarize these points, together with some more examples, in our paper online on arxiv. The abstract reads:

Automaton models are often seen as interpretable models. Interpretability itself is not well defined: it remains unclear what interpretability means without first explicitly specifying objectives or desired attributes. In this paper, we identify the key properties used to interpret automata and propose a modification of a state-merging approach to learn variants of finite state automata. We apply the approach to problems beyond typical grammar inference tasks. Additionally, we cover several use-cases for prediction, classification, and clustering on sequential data in both supervised and unsupervised scenarios to show how the identified key properties are applicable in a wide range of contexts.

I am very happy and grateful to receive your thoughts and feedback on it. What do you think about the interpretability and understandability of automata?

I am very happy to announce that we finally have a nice introduction to our dfasat tool: a short python notebook tutorial (html preview) originally developed for a 2-hour hands-on session at the 3TU BSR winter school.
The notebook works you through basic usage and parameter setting. It also contains a small task to familiarize the user with the effect of different parameter settings. At the moment, dfasat has about 30 different options to choose from. Some can be combined, whereas other combinations have never been tried in combination. The easiest way to use the introduction is to download the virtual appliance for VirtualBox (3GB download, password for user winter/sudo: ‘iscoming’). It contains the practical data sets and the python notebook (ipynb/html). You can also download the files separately, and clone the dfasat repository or install the dfasat python package. I personally recommend using the virtual appliance: It is well tested by 20 students during the session at the winter school. Please contact me for assistance. My email address is included in the notebook.

The Sequence PredIction ChallengE (SPICE), organized and co-located with the International Conference on Grammatical Inference (ICGI) 2016, was won by Chihiro Shibata, who combined LSTM neural networks and strictly piecewise grammars (SP-k, proposed by Heinz et al), the latter capturing long-term dependencies in the input words. The combination beat the competitors using “pure” LSTM- and CNN-based neural networks. Overall, all networks used were not very deep (2 hidden layers), and deeper networks decreased performance.

The task of the competition was to predict a (ranked) list of most likely continuations (a_1, …, a_5) for a given prefix (y_0, .., y_i), based on learning from a training set of complete words.

One of my students (competing as team PING) placed 7th, using the dfasat tool. The main goal was to test a python interface for dfasat (early release here). But what can we take away from placing 7th? Is PDFA-learning not competitive for sequence prediction? The answer is a solid jein: By using dfasat, we assumed that all problem sets were generated by a deterministic and probabilistic finite state automaton (PDFA). In practice, most problem sets were generated by HMMs or contained linguistic data. Both data types cannot necessarily be learned very well by our PDFA models. The results reflect this, as outlined in the following table. For the HMM problems, we obtain OK-scores. That is expected because our PDFA models are not quite as expressive as the HMMs used to generate the data, but the gap is not too large. On the linguistics data, we really struggle to obtain reasonable scores (e.g. problem 10).

But problem 9 is a very interesting case: it contains software traces. For this problem type, PDFA models obtained the second best score and beat most of the RNN and CNN approaches. I expect that LSTM/RNN approaches can obtain equally good or better scores, but require a lot more data to learn a model of equal predictive quality. I am planning to analyze the character-level networks (e.g. with methods used here) used by the competitors to understand better what aspects they managed to learn.

I will add a more detailed description of the problem sets later on.