Human Language Technology: Applications to Information Access

Detailed schedule for autumn 2016

Andrei Popescu-Belis - Idiap Research Institute

N. Date Morning (10:15-12:00, room ELE 111) Afternoon (13:15-15:00, same room)
INTRODUCTION
1 Sep. 22, 2016 Introduction
  • Presentation of the topic. Recent applications of human language technology (HLT), focusing on accessing text-based information across three barriers: the quantity barrier, the cross-lingual barrier, and the subjective barrier. Objectives of the course, applications, resulting knowledge/skills.
  • Plan and structure of the course. Useful references: books, websites (for data and software), course page.
  • Organization, student duties, grading: exercises (one graded), article presentation (graded), personal project (requirements, defense).
  • Introduction to the data-driven HLT framework: machine learning, classifiers, features, training-development-test, data, evaluation.
  • Basics (or reminders) of NLP and CL theory: methods for language analysis, formalisms, linguistics.
 
I. OVERCOMING THE QUANTITY BARRIER
1 (cont.)   Classification
  • Document classification using lexical features.
Run a classifier on newswire data (Reuters) mainly as a test for getting working methods into place. extract features (possibly with tokenization, stemming, etc.). Run Weka. Evaluate the results.
-- Sep. 29, 2016
(APB absent)
No course. No practical work.
2 Oct. 6, 2016
  • Discussion of TP on classifying Reuters articles, including students' results.
Beyond information retrieval (part 1)
  • Basics of IR: boolean model, vector space model, probabilistic models. Extensions: relevance feedback; pseudo-relevance feedback; query expansion.
Experiments with Lucene: pre-process Reuters data, index it, search it with various queries.
3 Oct. 13, 2016 Beyond information retrieval (part 2)
  • Learning-to-rank: introductory notions.
  • Recommender systems: main models.
  • Just-in-time retrieval, implicit queries or query-free systems.

Design a simple text-based just-in-time retrieval system (over Reuters or Wikipedia) for a text editing framework (using the Java document listener model provided) and Lucene (or even Google). The system suggests useful documents while the user types a text, such as an article or an email.
Note: practical work on text-based just-in-time document retrieval will be graded. A brief report (around 1-page) is due before Friday, October 28, 2016, 23:59 Lausanne time, by email.
4 Oct. 20, 2016 Deep learning for NLP: Word representation learning (by Nikolaos Pappas)
  • Basics of neural networks: perceptron, logistic regression, backpropagation, multilayer networks, overview of advanced networks.
  • Semantic similarity, traditional and recent approaches, evaluation.
Individual work on the optionally-graded practical exercise (TP), due Friday, October 28, 2016. This session will not be supervised.
Note: to send the query to Lucene (over the index created in Lesson 2), copy/adapt into DocumentEventDemo.java the code from SearchFiles.java, especially: declarations 020-038, initializations 090-092, then 100, 117, 152, 153, 177. Don't forget slides 44-47 of Lesson 3.
II. OVERCOMING THE CROSS-LINGUAL BARRIER
5 Oct. 27, 2016 Introduction to machine translation
Generalities, history of MT, typology of rule-based systems, introduction to statistical systems and to MT evaluation.
End of the TP on just-in-time retrieval, final questions and debugging. Reports due Friday, October 27. Optionally-graded means you can choose if your TP grade (20%) comes from this TP or from the upcoming one on MT.
6 Nov. 3, 2016 Paper presentation by Trung PHAN: "Automatically building a stopword list for an information retrieval system", by Rachel Tsz-Wai Lo, Ben He, Iadh Ounis, Proceedings of the 5th Dutch-Belgium Information Retrieval Workshop (DIR'05), Utrecht, 2005.

Translation models
  • Definition of translation models, discussion of generative modelling and learning from parallel data.
  • IBM Translation models, with emphasis on models 1 and 2. EM algorithm. Perplexity.
  • Phrase-based translation models and extensions.
  • Appendix: sentence and word alignment.
Practical work: build your own SMT system.
The general goal of this series of practical sessions is to create a simple statistical MT system (e.g. EN/FR). See TP-7 on Nov. 7 for instructions.
7 Nov. 10, 2016 Paper presentation by Lesly MICULICICH on neural machine translation, including: "Neural Machine Translation by Jointly Learning to Align and Translate", by D. Bahdanau, K. Cho, Y. Bengio, Proceedings of the International Conference on Learning Representations (ICLR), 2015 (originally on Arxiv in 2014); and also: "Learning phrase representations using RNN encoder-decoder for statistical machine translation", by Cho, K., Van Merrienboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., and Bengio, Y., Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), 2014.

Language modeling: a major component of statistical MT

Decoding with phrase-based translation models
Practical work: phrase-based statistical MT with Moses

Following the instructions on "Machine Translation practical work", train Moses to create a translation model on a small parallel corpus. The Moses system is pre-installed on a Virtual Box Image which will be distributed.
8 Nov. 17, 2016 Presentation by Dhananjay RAM on Neural Network Language Models, based on three papers: "A Neural Probabilistic Language Model" by Y. Bengio et al. (JMLR 2003); "Recurrent neural network based language model" by Mikolov et al. (Interspeech 2010); and "LSTM Neural Networks for Language Modeling" by R. Sundermeyer et al. (Interspeech 2012).

Parameter tuning in phrase-based SMT

MT evaluation and applications
Texts for exercise: intuitive vs. analytic MT evaluation.
Continuation of the practical work on statistical machine translation: building an operational SMT system, train and test in several conditions, and evaluate them comparatively.
III. OVERCOMING THE SUBJECTIVE BARRIER
-- Nov. 24, 2016 No course. This will be replaced by a lecture session on December 15 by Nikolaos Pappas. Personal work (not in classroom): complete the optionally-graded practical work on statistical machine translation. Reports are due Friday, November 25, 2016. Optionally-graded means you can choose if your TP grade (20%) comes from this TP or from the previous one on just-in-time recommendation.
9 Dec. 1st, 2016 Paper presentation by Skanda Muralidhar on sentiment analysis of job interviews.

Introduction to sentiment analysis

Exercise on classifying positive vs. negative reviews using lexical features (see slide 25).
-- Dec. 8, 2016 No course (holiday in Valais and at Idiap). No practical work.
10 / 11 Dec. 15, 2016 Deep learning for NLP: Multilingual Word Sequence Modeling by Nikolaos Pappas
  • Multilingual word embeddings: cross-lingual alignment methods, evaluation tasks.
  • Multilingual word sequence modeling: essentials (RNN, LSTM, GRU, Attention), with applications to machine translation and document classification.
Last part of the presentation by Nikolaos Pappas.

Advising on individual projects (two groups).

Details about final project and oral presentation
CONCLUSION
12 Dec. 22, 2016 Paper presentation by Wissam Halimi on human-computer dialogue in a block world (SHRDLU), based on "A procedural model of language understanding" by Terry Winograd. In Computer Models of Thought and Language, edited by R. C. Schank and K. M. Colby, San Francisco, CA: W. H. Freeman, p. 114-151, 1973. Reprinted in Readings in natural language processing, edited by Barbara J. Grosz, Karen Sparck-Jones, and Bonnie Lynn Webber, San Francisco, CA: Morgan Kaufmann Publishers, p. 249-266, 1986.

Analysis of human interactions

Meeting browsers

Conclusion and synthesis on HLT research: defining a problem, building reference data, finding features for machine learning algorithms, training the algorithms, evaluating and analyzing the performance.
Advising on individual projects (one group). See again the details about final examination.

APB, September-December 2016