Categories
spotlight Uncategorized

A passive automata learning tutorial with dfasat

I am very happy to announce that we finally have a nice introduction to our dfasat tool: a short python notebook tutorial  (html preview) originally developed for a 2-hour hands-on session at the 3TU BSR winter school.
The notebook works you through basic usage and parameter setting. It also contains a small task to familiarize the user with the effect of different parameter settings. At the moment, dfasat has about 30 different options to choose from. Some can be combined, whereas other combinations have never been tried in combination. The easiest way to use the introduction is to download the virtual appliance for VirtualBox (3GB download, password for user winter/sudo: ‘iscoming’). It contains the practical data sets and the python notebook (ipynb/html). You can also download the files separately, and clone the dfasat repository or install the dfasat python package. I personally recommend using the virtual appliance: It is well tested by 20 students during the session at the winter school. Please contact me for assistance. My email address is included in the notebook.

Categories
Announcement spotlight

The Performance of PDFA-Learning in the SPiCE Competition

The Sequence PredIction ChallengE (SPICE), organized and co-located with the International Conference on Grammatical Inference (ICGI) 2016,  was won by Chihiro Shibata, who combined LSTM neural networks and strictly piecewise grammars (SP-k, proposed by Heinz et al), the latter capturing long-term dependencies in the input words. The combination beat the competitors using “pure” LSTM- and CNN-based neural networks. Overall, all networks used were not very deep (2 hidden layers), and deeper networks decreased performance.

The task of the competition was to predict a (ranked) list of most likely continuations (a_1, …, a_5) for a given prefix (y_0, .., y_i), based on learning from a training set of complete words.

One of my students (competing as team PING) placed 7th, using the dfasat tool. The main goal was to test a python interface for dfasat (early release here). But what can we  take away from placing 7th? Is PDFA-learning not competitive for sequence prediction? The answer is a solid jein: By using dfasat, we assumed that all problem sets were generated by a deterministic and probabilistic finite state automaton (PDFA). In practice, most problem sets were generated by HMMs or contained linguistic data. Both data types cannot necessarily be learned very well by our PDFA models. The results reflect this, as outlined in the following table. For the HMM problems, we obtain OK-scores. That is expected because our PDFA models are not quite as expressive as the HMMs used to generate the data, but the gap is not too large. On the linguistics data, we really struggle to obtain reasonable scores (e.g. problem 10).

Table of the performance metrics.
Table of the performance metrics.

But problem 9 is a very interesting case: it contains software traces. For this problem type, PDFA models obtained the second best score and beat most of the RNN and CNN approaches. I expect that LSTM/RNN approaches can obtain equally good or better scores, but require a lot more data to learn a model of equal predictive quality. I am planning to analyze the character-level networks (e.g. with methods used here) used by the competitors to understand better what aspects they managed to learn.

I will add a more detailed description of the problem sets later on.