## Journée "optimisation et traitement automatique des langues" (16 décembre 2013)

Les méthodes d'optimisation sont de plus en plus utilisées en traitement automatique des langues pour concevoir des méthodes efficaces tant pour l'apprentissage de modèles que pour le décodage, notamment en offrant un cadre formel qui distingue clairement les contraintes globales des contraintes locales, tout en permettant de les combiner.

L'équipe RCLN du Laboratoire d'Informatique de Paris Nord s'intéresse de près à ces méthodes pour la modélisation conjointe des différents niveaux de description linguistique. Dans le cadre du pôle math/stic de Paris 13, avec le soutien du LABEX EFL, et en collaboration avec l'équipe AOC du LIPN, l'équipe RCLN organise une journée de séminaires autour des thèmes de l'optimisation et du traitement automatique des langues.

Cette journée aura lieu sur le campus de l'Université Paris 13 à Villetaneuse, dans l'amphithéâtre Euler.

- 09h30-10h00 :: Accueil (Amphi Euler)
- 10h00-11h00 :: Exposé d'
**André Martins**(Amphi Euler) - 11h15-12h15 :: Exposé de
**Sebastian Riedel**(Amphi Euler) - Pause déjeuner (LIPN)
- 13h45-14h45 :: Exposé de
**Xavier Carreras**(Amphi Euler) - 15h00-16h00 :: Exposé de
**Joseph Le Roux**(Amphi Euler) - 16h00 :: goûter au LIPN !

La participation à cette journée est libre. Si vous comptez y assister, nous vous demandons, pour des raisons pratiques (organisation du buffet notamment), de nous l'indiquer à cette adresse : http://doodle.com/87mnbcdbzdmytffc.

Résumé des quatre exposés prévus :

**André Martins**, Priberam, Lisbon

Title: **AD3: A New Decoder for Structured Prediction**

Abstract: In this talk, I will present AD3 ("Alternating Directions Dual Decomposition"), a new consensus-based decoder for problems representable as factor graphs. AD3 is an approximate decoder that ignores global effects caused by the cycles of the graph, solving a linear relaxation of the original problem. It can handle many scenarios often encountered in NLP and IR applications, such as models with constraints in first-order logic; models involving budget or knapsack constraints; and combinations of structured models which are individually tractable, but hard to decode jointly. Like other dual decomposition algorithms, AD3 has a modular architecture, where local subproblems are solved independently, and their solutions are gathered to compute a global update. The key characteristic of AD3 is that each local subproblem has a quadratic regularizer, leading to faster convergence (both theoretically and in practice). After providing closed-form solutions for several of these subproblems, I will proceed to discuss a recent active set method that works for arbitrary factors, requiring only a local maximization oracle (the same oracle required in subgradient-based dual decomposition). In the second part of the talk, I will discuss two recent applications of AD3 in NLP problems: dependency parsing and compressive summarization. I will present "Turbo Parser," an open source dependency parser, which was recently improved with AD3 and the active set method to permit fast decoding of non-projective third-order models. Experiments in 14 languages yield state-of-art results, with parsing speeds ranging between 700 and 4,000 tokens per second. For compressive summarization, the use of AD3 leads to a system which is modular in the three qualities that define a good summary (conciseness, informativeness, and grammaticality), with state-of-the-art ROUGE scores, and runtimes close to extractive summarizers. This work was done in collaboration with Noah Smith, Mário Figueiredo, Eric Xing, Pedro Aguiar, and Miguel Almeida.

**Sebastian Riedel**, UCL, London.

Title: **Predict, Price and Cut: Column and Row Generation for Structured Prediction**

Abstract: Many problems in NLP, and structured prediction in general, can be cast as finding high-scoring structures based on a large set of candidate parts. For example, In second order tagging, we have to select high-scoring transitions between tags in a globally consistent fashion. In second order graph-based dependency parsing we have to choose a quadratic number of first order and a cubic number of second order edges such that the graph is both high-scoring and a tree. What makes such problems challenging is the large number of possible parts to consider. This number not only affects the cost of search or optimization but also slows down the process of scoring parts before they enter the optimisation problem, and extracting features. In this talk I present an approach that can solve problems with large sets of candidate parts without considering all of these parts in either optimization or scoring. In contrast to most pruning heuristics, our algorithm can give certificates of optimality before having optimized over, or even scored, all parts. It does so without the need of auxiliary models or tuning of threshold parameters. This is achieved by a delayed column and row generation algorithm that iteratively solves an LP relaxation over a small subset of current candidate parts, and then finds new candidates with high scores that can be inserted into the current optimal solution without removing high scoring existing structure. The latter step subtracts from the cost of a part the price of resources the part requires, and is often referred as pricing. Sometimes parts may score highly after pricing, but are necessary in order to make the current solution feasible. We add such parts in a step that roughly amounts to violated cuts to the LP. We evaluate our approach on two applications: second order dependency parsing and first order tagging with large domains. In both cases we dramatically reduce the number of parts considered, and observe about an order of magnitude speed-up. This is possible without loss of optimality guarantees, and hence accuracy.

**Xavier Carreras**, UPC, Barcelona

Title: **Learning Automata and Grammars: From Spectral Algorithms to Convex Optimizations**

Abstract: There is an increasing interest in spectral methods to learn latent-variable language models in the form of weighted automata and context-free grammars. Spectral methods provide an algebraic formulation to the problem of inducing automata or grammars from data, and directly exploit the recurrence relations behind the model. I will review the spectral method from an algebraic perspective, making use of Hankel matrices as the key object behind the method: a Hankel matrix collects all necessary statistics of the distribution we want to learn; and finding a low-rank factorization of this matrix results in the automata or grammar. Under mild assumptions, it can be shown that this method nicely approximates the target model. From here, I will show how we can reformulate the spectral learning algorithm as a low-rank convex optimization. This will be useful to adapt the method to other settings, by adding linear constraints. I will focus in "unsupervised" induction of context-free grammars, that is, learning a grammar from plain strings. Our formulation involves optimizing for a low-rank Hankel matrix that is linearly constrained to satisfy inside-outside recursions. An analogous method method can be formulated to learn finite-state transducers from unaligned parallel strings.

**Joseph Le Roux**, LIPN, Paris

Title: **Combining PCFG-LA Models with Dual Decomposition: A case Study with Function Labels and Binarization**

Abstract: It has recently been shown that different NLP models can be effectively combined using dual decomposition. In this talk, we present how PCFG-LAs (Probabilistic Context-Free Grammars with Latent Annotations, the state-of-the-art model for unlexicalized constituent parsing) are suitable for combination in this way. We first show how the intractable problem of exact PCFG-LA decoding is approximated with anchored PCFGs. Then we present a method for combining anchored PCFGs based on the partial superposition of tree structures. We experiment with the different models which result from alternative methods of extracting a grammar from a treebank (retaining or discarding function labels, left binarization versus right binarization) and achieve state-of-the-art parsing performance, with a labeled Parseval F-score of 92.4 on Wall Street Journal Section 23 – this represents an error reduction rate of 7% over a strong PCFG-LA product-model baseline. This work was done in collaboration with Antoine Rozenknop and Jennifer Foster.