Text Mining Applied to Open-Ended Questions

Instructor: Matthias Schonlau, University of Waterloo

Time: 9.00-12.00 July 17th 2017

Text data from open-ended questions in surveys are difficult to analyze and are frequently ignored. Yet open-ended questions are important because they do not constrain respondents’ answer choices.  Where open-ended questions are necessary, sometimes multiple human coders hand-code answers into one of several categories. At the same time, computer scientists have made impressive advances in text mining that may allow automation of such coding.

The purpose of this short course is to introduce participants to (a) how text can be converted into numerical ngram variables and (b) to run a statistical learning algorithm on such ngram variables.  An n-gram is a contiguous sequence of words in a text. A new Stata program will be introduced, ngram, and made available to participants. The program supports a large number of European languages: Danish, German, English, Spanish, French, Italian, Dutch, Norwegian, Portuguese, Romanian, Russian, and Swedish. Broadly speaking, ngram creates hundreds or thousands of variables each recording how often the corresponding n-gram occurs in a given text.  This is more useful than it sounds.

We will use the ngram variables to categorize open-ended questions using a supervised learning algorithm (e.g. support vector machines).  Examples will be given using Stata.

Learning objectives:

(1) explain the bag-of-words/ ngram approach to text mining, 

(2) apply the bag-of-words / ngram approach in Stata,

(3) apply a supervised learning method to the categorization of open-ended questions in Stata.

About the instructor:

Matthias Schonlau is a Professor in the Department of Statistics and Actuarial Science at the University of Waterloo, Canada. Prior to his academic career, he spent 14 years at the RAND Corporation (USA), the Max Planck Institute for Human Development in Berlin (Germany), the German Institute for Economic Analysis (DIW), National Institute of Statistical Sciences (USA), and AT&T Labs Research (USA).  Dr. Schonlau’s current research focuses on applying statistical/machine learning algorithms to open-ended questions. He is a board member of the European Survey Research Association. He is the lead author of the book "Conducting Research Surveys via E-Mail and the Web".  Dr. Schonlau has published more than 60 peer-reviewed articles.