Michael Kipp - Jan Alexandersson - Norbert Reithinger

Understanding Spontaneous Negotiation Dialogue
[Full Text]
[send contribution]
[debate procedure]
[copyright]

Overview of interactions

No	Comment(s)	Answer(s)	Continued discussion
1	16.3.2000 Joachim Quantz	14.4.2000 Michael Kipp et al

C1. Joachim Quantz (14.3.00):

Let me start by saying that I enjoyed reading the paper - it is very well written and gives a good overview over the current status of Verbmobil's dialogue module.

My main criticism would be that it does not really address issues of scalability, reusability, robustness and performance, but then these are rather fundamental issues of dialogue processing or NLP in general and might be beyond the scope of the paper. In other words, I don't necessarily expect detailed answers to all of my comments.

Domain and Application:

Whereas the domain is clearly described in the paper, the application is less clear to me. In the first paragraph of Section 3, the authors state there assumption "that in a task-oriented dialogue it is sufficient to know the communicative function (dialogue act) and the propositional content of an utterance." Sufficient for what? They continue with "The criterium for a sucessful shallow translation is the conservation of dialogue act and propositional content [Levinson, 1993]", which seems to imply that the application in mind is shallow translation. However, in Section 5 the authors list three different consumers of the information provided by the dialogue module: the dialogue script generator, the semantic transfer and the deep analysis. The dialogue script generator is described in most detail and is the most convincing application from my point of view. It would be interesting to have a general taxonomy of applications in which the dialogue module could be used and to see which of the applications could be supported by the solutions presented in the paper.

Reusability/Scalability:

Could mean, for example, use of the dialogue module in different systems or for different domains, applications, languages.

With respect to applications, I can see how the information is used by the dialogue script generator, and that this approach is somewhat generic and could be used for other domains and scenarios. However, the examples provided for the other two consumers are not very convincing. In particular, the Mr. Hallermann example seems to me very idiosyncratic. I fail to see the generic nature of the information provided by the dialogue component to disambiguate the examples. (Of course, this relates to the general problem of using background knowledge in NLP. It seems that every ambiguity example involving background knowledge requires its own set of inference rules...)

One major problem of reuse of NLP components is the lack of standards for interfaces and formats. Does the dialogue module offer an API or how is information exchanged with the other modules. How easy/difficult would it be to integrate the module into another system (what exactly is the input needed, what exactly is the output provided).

How easy/difficult would it be to adopt the system to other domains and languages? What could be reused, what would have to be redeveloped?

Robustness and Performance:

The authors mention success figures like "approximately 75% correctly translated contributions in the domain of appointment scheduling" (p. 1) or "Performance in dialogue act recognition achieves an accuracy of about 70% on unseen data" (p. 3). Though I think that quantitative evaluation is of rather high importance, it would be necessary to explain in more detail under which conditions these figures have been obtained. Also, it would be interesting to know what kind of consequences such performance figures have. Does it make the components useless in its current state or would these rates be already useful for some applications? For example, what kind of impact will mistakes have on other components, e.g. does the recognition of the wrong act automatically lead to wrong summaries/translations? How good is performance of the other components of the dialogue module, e.g. topic detection, data completion, etc.

Finally it would be nice to get a rough idea of the runtime performance of the components mentioned. Are they (close to) real time?

Regards,
Joachim

A1. Michael Kipp et al. (14.4.00):

Dear Joachim,

thanks for your good selected comments. You've managed to put the finger on a couple of problematic tasks in NLP! We agree that much of the topics in NLP as a broad remains to be solved. This paper might suffer a little bit of focus -- our main focus was dialogue processing, but to motivate and better describe our approach we included related modules/problems/advantages/drawbacks/...

Joachim Quantz's comments:

Let me start by saying that I enjoyed reading the paper - it is very well written and gives a good overview over the current status of Verbmobil's dialogue module.

The authors reply:

We think that our approach (when not directly pointed out in the paper) is robust: We needed a way to deal with recognition problems and still be able to catch the information in an utterance although the recognized words are not syntactically well-formed. It is a well known fact that high vocabulary speaker independent speech recognizers are far from perfect. Moreover, humans do not necessarily speak grammatical.

Our solution to this was: FST -- extract the propositional content HMM -- determine the dialogue act. As long as the users of Verbmobil stick to negotiation and do not deviate too much, recent evaluation result has proven its success.

As to robustness, in the ongoing VM-evaluation, the dialogue act based translation that uses this functionality was the only track that works at least partially when the recognizers deliver bad results. As to performance: yes :-) As to scalability and reusability, we think that in the currently manageable domains of NL applications, the approach scales quite well and is reusable. When extending the domain from scheduling to travel planning etc., we could use the existing knowledge sources as building blocks that could be extended. Not in the sense of ideal universal knowledge sources like a universal grammar, that produces nice trees or graphs, of course. The scalability and reusability of these "universal" solutions is, as we all know, usually sort of a hoax, because the burden and real hard work is deferred to the modules that have to interpret the output of these things and have to make sense of them, or to make the input to these modules so pretty and correct as we don't see very often in real-life systems.

Joachim Quantz's comments:

Domain and Application: Whereas the domain is clearly described in the paper, the application is less clear to me. In the first paragraph of Section 3, the authors state there assumption "that in a task-oriented dialogue it is sufficient to know the communicative function (dialogue act) and the propositional content of an utterance." Sufficient for what? They continue with "The criteria for a successful shallow translation is the conservation of dialogue act and propositional content [Levinson, 1993]", which seems to imply that the application in mind is shallow translation. However, in Section 5 the authors list three different consumers of the information provided by the dialogue module: the dialogue script generator, the semantic transfer and the deep analysis. The dialogue script generator is described in most detail and is the most convincing application from my point of view.

The authors reply:

The focus of the paper and of all our recent work in this project group is summarization, and you are right in criticizing its not being clearly stated in the paper. We *also* provide context information for translation (see below) but the focus of our work is summarization.

The task of the overall system, VERBMOBIL, is to translate and we motivate summarization in a translation system thus:

in negotiation dialogues you like to have a concise description (summary) of what has been agreed on
in translation systems you like to have a control mechanism as to what really was communicated (consider false translations, misunderstandings because of implicit rejection/acceptance etc.)
in a translation system summarization is easy when there is already data about semantic and pragmatic content (which is even language-independent)

Point (3) is the reason for your (and our) confusion about the focus of our work.

Translation needs context data. This is undisputed. In VERBMOBIL this data is provided by another module, the so-called Kontext module. The Kontext module answers direct disambiguation requests by the semantic transfer module. Now, our main contribution to this kind of context is our sending content and time objects to the Kontext module which they try to integrate in their overall representation of dialogue context. Kontext itself extracts content info from the syntactic-semantic representation that is used for transfer. But since Kontext is very late in the module chain of the deep processing of VERBMOBIL, and has to deal with the cumulative error from the recognizer-syntax-semantic processing pipe, the information we provide them is not always correct but augments the information provided by the other processing tracks.

The other examples of "information consumers" stated in the paper are something like workarounds in situations where this method of passing data to the Kontext module is not sufficient for a specific task.

Joachim Quantz's comments:

It would be interesting to have a general taxonomy of applications in which the dialogue module could be used and to see which of the applications could be supported by the solutions presented in the paper.

The authors reply:

see below, marked with (*)

Joachim Quantz's comments:

Reusability/Scalability: Could mean, for example, use of the dialogue module in different systems or for different domains, applications, languages.

The authors reply:

see below, marked with (*)

Joachim Quantz's comments:

The authors reply:

In the paper we described three distinct modules:

the extractor

the dialogue processor and

the plan processor (used for generation)

Data flow is from (A) to (B) to (C). Module (A) communicates with (B) by sending direx expressions as a string (syntax is only hinted at in the paper) through a channel provided by the VERBMOBIL system.

Module (B) communicates with (C) by granting direct access to lisp objects and methods. (C) sends lisp-like structures to the final generator, again using a VERBMOBIL channel.

So, as you can see, none of the modules provides a clean API for easy reuse. They are specifically designed to run together and fit in the VERBMOBIL architecture.

There are some reusable tools in the extraction module (A), though. Tools for training a statistical dialogue act recognizer, for building finite state machines, for manual dialogue act annotation of written data.

Of course the interface between A and B, the direx expressions, resemble those of e.g. C-Star. However, standardization even simple things like dialogue acts (see e.g. SIGdial and DRI) is a good, but not easy task.

Joachim Quantz's comments:

How easy/difficult would it be to adopt the system to other domains and languages? What could be reused, what would have to be redeveloped?

The authors reply:

(*) This question together with two others above aim at the same thing: what happens if we want to change (I) the language (II) the domain (III) the application Here's what happens:

For a new language X use the following recipe:
1. annotate sample dialogues in language X with dialogue acts
2. train dialogue act recognizer
3. built extraction module for X
The dialogue processor stays the same since all data there is language-independent. The summary generator relies on the functionality of the transfer module, and will not be affected.
Switching to another domain entails changes throughout the system. These changes are easy to make if the new domain is somehow related to the current one (travel agent, traffic information etc.), e.g. if the new domain is a superset of the old one. Otherwise the domain shift becomes quite a tedious affair. Note that steps (1) - (4) from above need also be done.
Switching application means probably switching the task, since if it's just a question of changing to another translation system with the same kind of dialogues this would boil down to mere interface matters.
Now what do we mean by task and is there a possible taxonomy of tasks? We like to call our dialogues "negotiation dialogues" and the task this implies is one of proposing objects, probably explaining or modifying these objects and commenting on these objects (accept/reject). You can build your taxonomy by checking whether a certain type of dialogue essentially consists of these operations or not. Positive examples are:
- sales dialogues (insurance contracts, holiday trips, cars)
- information seeking dialogues
Negative examples are:
- tutoring
- interactive help
- instructing
Just as a postscriptum: there's another problem in terms of scalability that may be of interest and that's the number of people involved in the dialogue. Imagine a multi-user video conference and the problems this ensues: who is speaking to whom? How many people have to agree so that a suggestion can be assumed to be "accepted"?
The dialogue structure certainly becomes much more intricate and confusing.

Joachim Quantz's comments:

Robustness and Performance: The authors mention success figures like "approximately 75% correctly translated contributions in the domain of appointment scheduling" (p. 1) or "Performance in dialogue act recognition achieves an accuracy of about 70% on unseen data" (p. 3). Though I think that quantitative evaluation is of rather high importance, it would be necessary to explain in more detail under which conditions these figures have been obtained.

The authors reply:

The first figure stems from an evaluation done at the end of phase 1 of the VERBMOBIL project. System input and output was evaluated by professional translators for correctness. "Approximate correctness" meant that the translation carries across the intended message (preserving all essential facts). 75% of all translations fell into that category. During this evaluation all competing translation tracks were involved, and not just shallow translation.

Accuracy in dialogue act recognition was based on a corpus of sample dialogues which had been hand coded. We ran a leave-one-out-experiment (i.e. testing each dialogue using all other dialogues as training material, so it was actually more than 1000 test runs) and got these numbers as the overall results.

Joachim Quantz's comments:

Also, it would be interesting to know what kind of consequences such performance figures have. Does it make the components useless in its current state or would these rates be already useful for some applications? For example, what kind of impact will mistakes have on other components, e.g. does the recognition of the wrong act automatically lead to wrong summaries/translations?

The authors reply:

Of course the wrong act can lead to a (partly) wrong translation, but not necessarily to a wrong summary. More critical is the recognition of propositional content (depending on speech recognition or extraction) Since the dialogue processor uses the prior context and rules, it is in some respects robust against recognition errors.

Joachim Quantz's comments:

How good is performance of the other components of the dialogue module, e.g. topic detection, data completion, etc.

The authors reply:

We haven't tested performance in this regard yet since this requires manual annotation of the data (which is simple for topics but quite a hassle for content objects).

What we've done recently is some kind of small end-to-end evaluation where we looked at the dialogue transcript and made a summary ourselves, comparing our summary objects with the automatically retrieved ones.

Joachim Quantz's comments:

Finally it would be nice to get a rough idea of the runtime performance of the components mentioned. Are they (close to) real time?

The authors reply:

The performance of all our components together is well within real-time. Producing a summary at the end of a dialogue takes more time (in the one-digit second area, but we didn't measure it). Since this processing is not in the human-to-human dialogue loop, this is no problem.

The overall VERBMOBIL system, however, is aiming at 4x real time, end-to-end. We, and Intel, AMD, Sun etc work heavily towards that goal. We, the Verbmobil developers by making their modules faster, and the others by giving us more and more GHz :-)

Michael, Jan and Norbert

Additional questions and answers will be added here.
To contribute, please click [send contribution] above and send your question or comment as an E-mail message.
For additional details, please click [debate procedure] above.
This debate is moderated by the guest Editors.