No Comments

Mini-Nuggets: Natural Language Understanding

Scott Fahlman,   May 12, 2008
Categories:  Natural Language    

Rainbow

Sometimes it is useful to post an article that is just a collection of mini-nuggets: a set of propositions, perhaps with some minimal explanation for each, on a given topic. The idea is to sketch out an overall approach or point of view quickly, without getting bogged down in a lot of detail or in lengthy justification. We can always come back later to fill in more detail, point by point.

This article concerns the Scone research group’s overall approach to natural language understanding. Some bits of this approach have been implemented in the natural-language tools we are building as a front end for the Scone knowledge-base system, but much of it is still planned or “under construction”. In any case, it seems useful to produce some kind of roadmap, even if it’s crude, general, incomplete, and subject to later revision.

I should mention that, while most of the points listed here seem obvious to me, many of them are controversial. Others may be generally accepted by researchers in language understanding, but not generally acted upon – that is, other researchers may agree that these are good ideas for the distant future, but in the meantime they feel it is more practical to follow some other path.

I should also mention that, in many cases, my rationale for favoring these propositions is rooted in introspection – what I can observe (or think I can observe) about how I handle various issues in language. Psychologists and linguists are justifiably wary of introspection arguments, which have often been shown to be incorrect or at least misleading. For example, introspection would suggest that multiplying two large numbers is computationally more demanding than vision or speech understanding, and we now know that is false – very false. However, when solid experimental information is lacking, I have found introspection to be a very powerful source of “search-guiding heuristics”, as we say in AI. A heuristic is a hint that gives us some guidance about where the good answers may be hiding in a large and confusing space of possibilities.

So: nobody should just accept these propositions as proven fact or even as a consensus within the field. I’m just trying to indicate my own best guesses, as of today, about the approaches that are most likely to lead to success. And I’m trying to show what part of the space my group is working in, thus providing some context for related posts that will come later.

One final point: I don’t claim that the point of view presented here is original or unique, though it is some distance from the current mainstream of natural-language research. These ideas have been gathered from many sources over the years, and in most cases it is hard for me to reconstruct the influences that went into this view. The work of Chris Riesbeck and his student Kevin Livingston on the DMAP system is pretty close in spirit to what we are doing.[1]

In case you’re wondering, the photo is of Rainbow the Guinea Pig, who is hereby designated as the patron saint of mini-nuggets. She produces them much faster than I can. OK, on with the show…

· The key problem is natural language understanding (NLU) – that is, converting spoken or written language to some language-independent, concept-based internal form that is easy to reason about. There are plenty of other natural-language problems one could work on, without going all the way to meaning, but NLU is the heart of the monster. If we don’t extract and represent the meaning, we can never understand the item well enough to answer questions, draw conclusions, or learn new deep knowledge from the text or speech. And if we do handle NLU, all the other natural-language applications – classification, summarization, even parsing – either fall out or can be done in a deeper and less brittle way. The NLU problem is difficult, but ultimately there is no way around it.

· The knowledge base must be a full partner in NLU, not just a place where you store the output of language processing. As I argued in a previous article, the language fragment itself contains only a minimal amount of information; in order to make sense of it, the listener must combine what was said with a large body of background information that is assumed to be known by both the speaker and the listener.

· Syntax is important – but mostly as a tie-breaker for possible meanings. For decades, a lot of NL researchers have focused on syntax and parsing, in part because of the influence of Noam Chomsky, and in part because this is an easier place to start than going after meaning. But syntax is just a tool, providing some hints about how a linear string of words should be folded into a tangled meaning network whose structure is far from linear. Syntax does play an important role in language: “Dog bites man” is different in meaning from “Man bites dog”, and syntax (or at least word-order) is the only difference between these two fragments. But an utterance like “You me lunch pizza tomorrow” is generally understandable, despite the almost complete lack of syntax. Meaning is doing all the work here: understanding how these words fit together into a larger structure would be impossible if “lunch” and “pizza” and “tomorrow” were concepts whose meanings were unknown to the listener.

· We want our NLU techniques to work well with non-grammatical text and speech. Linguists seem to be endlessly fascinated by the question of whether or not a sentence is correctly formed – in most of their papers, ill-formed sentences are consistently marked with the “asterisk of shame”. While it is interesting that people seem to be able to make consistent and confident correctness judgments about sentences (even if they speak a non-standard dialect of their language, so their judgement differs from that of the “experts”), that’s not nearly as important to us as the ability to extract as much meaning as possible, even from ill-formed utterances.

Recently, members of my research group have been working on understanding free-text fields in patient medical records. These are either hastily scribbled notes or notes that were hastily dictated and later transcribed by personnel with limited medical training. It is interesting that in a corpus of 100 such notes, there are only a few complete grammatical sentences. And yet other readers with medical training can almost always make sense of what the author was trying to say. That, I think, is the more interesting challenge.

· When in doubt, guess and go on, but be ready to backtrack and try again. A lot of NLP systems are organized as a strict pipeline: tokenization, morphology, parsing, and then semantic analysis (meaning), with no backtracking allowed. Other functions, such as spelling correction, ambiguity resolution, and reference resolution, may also be broken out into separate modules. I think that this one-way-flow approach is a bad one. In a strict pipeline, the upstream modules lack the information they need in order to come up with the best single result; instead, they must assemble and pass forward a data structure that lists all the “live” possibilities, with some sort of score for each. Then the downstream modules can narrow the list, throwing away the possibilities that don’t work at their level. The problem is that this data structure can grow to huge size, since the uncertainties tend to multiply. It is possible to come up with an efficient encoding for that large data structure, but the cost of encoding and decoding can be very high.

A better approach, I think, is for the upstream module to engage in a brief conversation with the downstream semantics (or knowledge base) module, trying to gather the information it needs in order to make a choice. And then it should just produce its best guess and pass that on. That guess will sometimes be wrong, so the upstream module must be ready to receive an error message kicked back from downstream: “That guess doesn’t work (for some specific reason). Please try again.” Implementing this loose, flexible control structure, supporting what we call non-local backtracking, does impose some cost, but I believe that this will be far less than the cost of building, passing forward, and unpacking a whole package of possibilities so that the upstream modules can wash their hands of any further responsibility. In most cases, the initial guess will turn out to be the right one, so “guess and go” can be very efficient.

· Probably something like a Construction Grammar is the best tool for handling syntax, given our goals and general approach. Think of a construction as a pattern to be matched against the incoming text – a sequence of certain specific words and some variables that are restricted as to the syntactic type and/or the semantic category of the filler: ( ?X{person} kicks ?Y{physical object}). When the construction is matched, it can produce some new structure in the knowledge base or a new composite language-object (e.g. a noun phrase) that can be matched by some other construction. Or it can initiate some processing, for example to resolve an ambiguity or to choose between two matching constructions. Or all of the above.

Construction grammars seem to be a good fit for us for several reasons: they can process language fragments in a bottom-up way, instead of insisting on having a correct and complete sentence; we can easily create special constructions to represent idiosyncratic or non-grammatical constructs that we encounter frequently; the matching (or potential matching) of a construction is a good time for us to check with the knowledge base to see if the emerging meaning makes sense; and the proponents of construction grammar tend to view language learning mostly as a process of bottom-up generalization, starting with specific utterances that have been encountered and then trying collect these into more general patterns.

Charles Fillmore and George Lakoff are often credited as the pioneers of construction grammars. A good treatment of CG, and of learning constructions by generalization, is the book Constructions at Work by Adele E. Goldberg. There are many variations on this theme: some of my students are fans of Radical Construction Grammar, a version proposed by William A. Croft; as of now, I’m agnostic about all these variations – I’m happy to go with whatever we can get working.

· Robust speech understanding depends on knowledge. There is currently a lot of work on improving the signal-processing levels of speech understanding systems with the goal of making these systems more robust under noisy conditions. Any progress on that front is welcome, but I believe that future improvement in this area is more likely to come from the application of knowledge.[2] The reason I can make out what someone is saying in a very noisy environment – so noisy that some words are completely obliterated – is that I usually can predict what a person is likely to say next. And if I think I have heard (or deduced) what the person said, I can immediately sanity-check my understanding against the current conversational context and my extensive background knowledge. If my interpretation doesn’t make sense in meaning-space, I can go back to the signal-processing level and ask it to come up with some other hypotheses.

· Process the text or speech eagerly, in order of arrival. This proposition is based on introspection, though I suspect that it would be easy to confirm this experimentally – that may already have been done.

I have a very strong sense that I am processing a speech stream more or less as it arrives, rather than waiting until a whole sentence is in hand so that I can apply a more global, top-down style of analysis. I have the same sense when I am reading text: I process English text left-to-right as I read it. The processing may lag the input by a few words, so that we can work at the phrase level rather than the word level. We will occasionally be forced to defer processing of some phrase until we get more information, but I think that we human listeners do as much processing as we can, as soon as we can. This seems like the best strategy if we want to follow the previous advice about robust speech, using background knowledge to help us identify and sanity-check the words we hear them.

A quick example. Here comes some text from a recipe, with the processing indicated by the comments in italics: “In a bowl…” OK, we’ve got a container here – a place where something is going to happen. The action or event is probably coming next. “…mix together…” OK, the action of mixing of ingredients often takes place in a bowl, so that fits our expectation. We’re probably going to get a list of ingredients next. “…the flour, salt, and sugar.” Here’s the list we were expecting. As discussed in an earlier post, the word ‘the’ indicates that we should already know what measures of flour, sugar, and salt are being referred to here. Probably these ingredients were listed, with the quantity of each, at the start of the recipe. Check whether we have seen this list. “Stir it gently.” The word ‘it” here must refer to the mixture in the bowl – the stuff being stirred. The knowledge base knows that stirring is a kind of mixing, and now we’re being told more about the manner of the mixing.

And so it goes. In addition to trying to resolve the referents for various phrases as we go along, the eager processor may also be asking “Why is the speaker telling me this?” and “Do I believe it?”

· To fully represent the “meaning payload” of natural language we need a representation significantly more expressive than first-order logic (FOL). We will have a lot more to say about this in future articles. For now I’ll just say that things like the temporal site of an action (indicated by verb tense), intention, belief, supposition, and meta-information about whether we believe a given statement all rely on mechanisms in Scone that lie outside the domain of conventional first-order logic, and far outside the realm of description logics such as OWL.

· Linguistic knowledge is distinct from world-knowledge, but it can be stored using the same knowledge-base mechanisms. The world-knowledge representations used in Scone are expressed in terms of concepts and relations, independent of how they are described in any human language. These are organized in a multiple-inheritance type hierarchy with exceptions. A Scone concept may be linked to one or more names (words or phrases) in English, and to additional names in other languages and dialects. These, too, live in a type hierarchy, but a different one: an action-verb is a verb is a word, and so on.

The name/meaning relation is many-to-many: even within a single language, an object may have many names, and a word may have many meanings. Similarly, larger linguistic patterns (e.g. construction grammar patterns) may be stored in the knowledge base, occupying their own multiple-inheritance taxonomy. “X kicks the bucket” is a kind of “X kicks Y”, but it is more specialized and, in this case, produces a different meaning.

· Ideally, we would like to use the same body of language knowledge for understanding and for generation. I’m not sure if we can do that efficiently, but it would certainly be a waste of space if we had separate knowledge structures for understanding and for speaking. In addition, if the input and output structures were distinct, we would have to worry about keeping them in synch.


[1] Livingston, K., and Riesbeck, C. K. (2007). Knowledge Acquisition from Simplified Text, In Proceedings of International Conference on Intelligent User Interfaces 2007 (IUI’07), Honolulu, Hawaii, USA, January, 2007.

[2] Actually, improved signal processing and the use of knowledge are complementary approaches, so we do not have to choose between them.

Leave a comment

XHTML: You can use these tags: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>