Friday, November 18, 2016

Rich Statistical Parsing and Literary Language, PhD Thesis by A.W. van Cranenburgh


In which we introduce the topics & contributions of this thesis: syntax and literature, analyzed with computational models.

There is a traditional contrast in the study of language between linguistics and philology; this thesis presents computational work in both areas. Computational linguistics, and its applied variant Natural Language Processing (nlp), takes language as its object of study and uses computers as an instrument to develop and evaluate models. Evaluation of predictive models provides an important methodological heuristic that sets computational linguis- tics apart from other fields of linguistics and many branches of science. New tasks or improved models for existing tasks are benchmarked with quantita- tive metrics (this has been referred to as the common task framework, Liberman 2015b), giving an immediate indication of how much has been achieved and what remains to be done. One such task which is considered in the first part of this thesis is that of syntactic parsing, where the goal is to analyze the syntactic phrases or relations between words in a sentences.

Philology means the love of words, learning, interpretation, and literature; often from a historical perspective. The second part of this thesis deals with a topic in what could be called computational philology, more specifically, computational literary stylistics. The increasing availability of digitized texts and effective computational methods to study them offer new opportunities. These methods not only allow more data to be processed, they also suggest different questions. Machine learning and the broader field of data science offer the possibility of extracting knowledge from data in an automated, reproducible manner.
Research questions. Since this thesis covers two main topics, we state the research question for each topic, and another connecting the two. 
Parsing language: To what extent is it possible to create linguistically rich parsing models without resorting to handwritten rules by exploiting statistics from annotated data?
Probabilistic algorithms for parsing and disambiguation select the most probable analysis for a given sentence in accordance with a certain probability distribution. A fundamental property of such algorithms is thus the definition of the space of possible sentence structures that constitutes the domain of the probability distribution. Modern statistical parsers are often automatically derived from corpora of syntactically annotated sentences (“treebanks”). In this case, the “linguistic backbone” of the probabilistic grammar naturally depends on the convention for encoding syntactic structure that was used in annotating the corpus.

Statistical parsers are effective but are typically limited to producing projec- tive dependencies or constituents. On the other hand, linguistically rich parsers recognize non-local relations, and analyze both form and function phenomena but rely on extensive manual grammar engineering. We combine advantages of the two by building a statistical parser that produces richer analyses.
Markers of literariness: What sorts of syntactic and lexical patterns may cor- relate with and explain the concept of literature?
In contrast to genre fiction, literary novels do not deal with specific themes and topics. However, they may still share stylistic and other implicit characteristics that may be uncovered using text analysis and machine learning.

These two research questions are connected by the following question:
Syntax in literature: For what sorts of stylistic and stylometric tasks and under which conditions can morphosyntactic information be exploited fruitfully?
Previous work has shown that simple textual features that are easy to extract, in particular Bag-of-Words features, typically outperform structural features such as syntax, which are comparatively expensive to extract. Our aim is to see to what degree this holds in the case of (literary) fiction, and whether there are specific aspects for which syntax is important.

Outline. The common themes in the two parts of this thesis are (a) the use of tree fragments as building blocks and predictive features, and (b) non-local and functional relations.

Part I deals with parsing and is concerned with general language use as made available in annotated data sets of several languages.

Part II deals with literature and focuses on contemporary Dutch novels and in particular on what differentiates literary language from the language of genre fiction. This work is done in the context of the project “The Riddle of Literary Quality,” which aims to investigate the concept of literature empirically by searching for textual features of literary conventions in contemporary novels.

The two parts of this thesis can be read independently. One exception is that the algorithm for extracting recurring tree fragments defined in part I is used in part II, and should be referred to for specifics on that method.

Contributions. The contributions of this thesis can be summarized as follows:

  • Efficient extraction of recurring patterns in parse trees, which can be used to build grammars, as features in machine learning tasks, and in linguistic research in general. The method that is presented provides a significant improvement in efficiency and makes it possible to handle much larger corpora.
  • A statistical parser automatically learned from treebanks, reproducing rich linguistic information from the treebanks, such as discontinuous constituency & function tags. The parsers are induced from data with minimal manual intervention and evaluated on several languages.
  • An investigation of what makes texts literary, use ratings from a large online survey, and machine learning models of texts to predict those ratings.
  • These models, based on lexical, topical, and syntactic features, demon- strate that the concept of literature is non-arbitrary, and predictable from textual characteristics to a large extent.

No comments:

Post a Comment