Natural Language Processing
Principal Investigators: Dan Roth
Recent studies have shown that over 85% of the information that corporations handle is unstructured, the vast majority of which is textual. A multitude of techniques has to be used in order to enable intelligent access to this information and to support transforming it to forms that allow sensible use of the information. The fundamental issues that all these techniques have to address is that of semantics - if one wants to access text based on its content (rather than just words mentioned in it), there is a need to move toward understanding the text at an appropriate level. This may include understanding the topic of the text, the events described in it, the sentiments expressed in it, understanding "who is doing what to whom?", etc.
While there has been huge progress in research in these directions over the last few years, all commercial applications (e.g., Google, Yahoo, Microsoft, and others) make use of very shallow techniques - key words that are present in the text. One of the key reasons for that is computational - supporting semantic analysis of text, even at the level of a single sentence or a single paragraph, may take seconds on a single core machine. Analyzing huge amounts of data, as is required to support better search, information extraction, and other deeper analyses of text, is therefore infeasible. Over the last few years, we have developed several algorithms and systems that support deeper analysis of natural language that can serve as a basis for the current research.
Our work focuses on techniques for parallelizing natural language processing, and constructing a parallelized Natural Language Processing (NLP) framework for developers to create large-scale NLP-based applications. This involves creating parallelized functionality for optimization based machine learning (very high dimensional vector operations), constrained optimization based inference (including integer linear programming), and a large number of sub-graph isomorphism-like computations. Parallelizing can be done at multiple levels — from data parallel, to decomposing processes to independent threads to algorithmic innovations that would facilitate more efficient learning and constrained optimization algorithms in very high dimensions.
We foresee NLP-based applications that are server-based and those that are client-centric. Client-side applications include smart language translation in mobile devices, intelligent application user interfaces, and human-like interfaces to virtual characters. Client-side applications demand real-time and interactive performance, thus placing stringent performance demands on the underlying implementations, creating an excellent driver application for our parallelization technologies.