Michael Kipp - Jan Alexandersson - Norbert ReithingerUnderstanding Spontaneous Negotiation Dialogue |
[Full Text] [send contribution] [debate procedure] [copyright] |
No | Comment(s) | Answer(s) | Continued discussion |
---|---|---|---|
1 |
16.3.2000 Joachim Quantz |
14.4.2000 Michael Kipp et al |
|
C1. Joachim Quantz (14.3.00):
Let me start by saying that I enjoyed reading the paper - it is very well written and gives a good overview over the current status of Verbmobil's dialogue module.
My main criticism would be that it does not really address issues of scalability, reusability, robustness and performance, but then these are rather fundamental issues of dialogue processing or NLP in general and might be beyond the scope of the paper. In other words, I don't necessarily expect detailed answers to all of my comments.
Domain and Application:
Whereas the domain is clearly described in the paper, the application is less clear to me. In the first paragraph of Section 3, the authors state there assumption "that in a task-oriented dialogue it is sufficient to know the communicative function (dialogue act) and the propositional content of an utterance." Sufficient for what? They continue with "The criterium for a sucessful shallow translation is the conservation of dialogue act and propositional content [Levinson, 1993]", which seems to imply that the application in mind is shallow translation. However, in Section 5 the authors list three different consumers of the information provided by the dialogue module: the dialogue script generator, the semantic transfer and the deep analysis. The dialogue script generator is described in most detail and is the most convincing application from my point of view. It would be interesting to have a general taxonomy of applications in which the dialogue module could be used and to see which of the applications could be supported by the solutions presented in the paper.
Reusability/Scalability:
Could mean, for example, use of the dialogue module in different systems or for different domains, applications, languages.
With respect to applications, I can see how the information is used by the dialogue script generator, and that this approach is somewhat generic and could be used for other domains and scenarios. However, the examples provided for the other two consumers are not very convincing. In particular, the Mr. Hallermann example seems to me very idiosyncratic. I fail to see the generic nature of the information provided by the dialogue component to disambiguate the examples. (Of course, this relates to the general problem of using background knowledge in NLP. It seems that every ambiguity example involving background knowledge requires its own set of inference rules...)
One major problem of reuse of NLP components is the lack of standards for interfaces and formats. Does the dialogue module offer an API or how is information exchanged with the other modules. How easy/difficult would it be to integrate the module into another system (what exactly is the input needed, what exactly is the output provided).
How easy/difficult would it be to adopt the system to other domains and languages? What could be reused, what would have to be redeveloped?
Robustness and Performance:
The authors mention success figures like "approximately 75% correctly translated contributions in the domain of appointment scheduling" (p. 1) or "Performance in dialogue act recognition achieves an accuracy of about 70% on unseen data" (p. 3). Though I think that quantitative evaluation is of rather high importance, it would be necessary to explain in more detail under which conditions these figures have been obtained. Also, it would be interesting to know what kind of consequences such performance figures have. Does it make the components useless in its current state or would these rates be already useful for some applications? For example, what kind of impact will mistakes have on other components, e.g. does the recognition of the wrong act automatically lead to wrong summaries/translations? How good is performance of the other components of the dialogue module, e.g. topic detection, data completion, etc.
Finally it would be nice to get a rough idea of the runtime performance of the components mentioned. Are they (close to) real time?
Regards,
Joachim
A1. Michael Kipp et al. (14.4.00):
Dear Joachim,
thanks for your good selected comments. You've managed to put the finger on a
couple of problematic tasks in NLP! We agree that much of the topics in NLP as
a broad remains to be solved. This paper might suffer a little bit of focus --
our main focus was dialogue processing, but to motivate and better describe our
approach we included related modules/problems/advantages/drawbacks/...
Joachim Quantz's comments:
Let me start by saying that I enjoyed reading the paper - it is very well written and gives a good overview over the current status of Verbmobil's dialogue module.
My main criticism would be that it does not really address issues of
scalability, reusability, robustness and performance, but then these are
rather fundamental issues of dialogue processing or NLP in general and
might be beyond the scope of the paper. In other words, I don't
necessarily expect detailed answers to all of my comments.
The authors reply:
We think that our approach (when not directly pointed out in the paper) is
robust: We needed a way to deal with recognition problems and still be able to
catch the information in an utterance although the recognized words are not
syntactically well-formed. It is a well known fact that high vocabulary speaker
independent speech recognizers are far from perfect. Moreover, humans do not
necessarily speak grammatical.
Our solution to this was: FST -- extract the propositional content HMM
-- determine the dialogue act. As long as the users of Verbmobil stick
to negotiation and do not deviate too much, recent evaluation result
has proven its success.
As to robustness, in the ongoing VM-evaluation, the dialogue act based
translation that uses this functionality was the only track that works
at least partially when the recognizers deliver bad results. As to
performance: yes :-) As to scalability and reusability, we think that
in the currently manageable domains of NL applications, the approach
scales quite well and is reusable. When extending the domain from
scheduling to travel planning etc., we could use the existing
knowledge sources as building blocks that could be extended. Not in
the sense of ideal universal knowledge sources like a universal
grammar, that produces nice trees or graphs, of course. The scalability
and reusability of these "universal" solutions is, as we all know,
usually sort of a hoax, because the burden and real hard work is
deferred to the modules that have to interpret the output of these
things and have to make sense of them, or to make the input to these
modules so pretty and correct as we don't see very often in real-life
systems.
Joachim Quantz's comments:
Domain and Application: Whereas the domain is clearly described in the paper, the application is less clear to me. In the first paragraph of Section 3, the authors state there assumption "that in a task-oriented dialogue it is sufficient to know the communicative function (dialogue act) and the propositional content of an utterance." Sufficient for what? They continue with "The criteria for a successful shallow translation is the conservation of dialogue act and propositional content [Levinson, 1993]", which seems to imply that the application in mind is shallow translation. However, in Section 5 the authors list three different consumers of the information provided by the dialogue module: the dialogue script generator, the semantic transfer and the deep analysis. The dialogue script generator is described in most detail and is the most convincing application from my point of view.The authors reply:
The focus of the paper and of all our recent work in this project group is summarization, and you are right in criticizing its not being clearly stated in the paper. We *also* provide context information for translation (see below) but the focus of our work is summarization.
The task of the overall system, VERBMOBIL, is to translate and we motivate summarization in a translation system thus:
Translation needs context data. This is undisputed. In VERBMOBIL this data is provided by another module, the so-called Kontext module. The Kontext module answers direct disambiguation requests by the semantic transfer module. Now, our main contribution to this kind of context is our sending content and time objects to the Kontext module which they try to integrate in their overall representation of dialogue context. Kontext itself extracts content info from the syntactic-semantic representation that is used for transfer. But since Kontext is very late in the module chain of the deep processing of VERBMOBIL, and has to deal with the cumulative error from the recognizer-syntax-semantic processing pipe, the information we provide them is not always correct but augments the information provided by the other processing tracks.
The other examples of "information consumers" stated in the paper are something like workarounds in situations where this method of passing data to the Kontext module is not sufficient for a specific task.
Joachim Quantz's comments:
It would be interesting to have a general taxonomy of applications in which the dialogue module could be used and to see which of the applications could be supported by the solutions presented in the paper.The authors reply:
Joachim Quantz's comments:
Reusability/Scalability: Could mean, for example, use of the dialogue module in different systems or for different domains, applications, languages.The authors reply:
Joachim Quantz's comments:
With respect to applications, I can see how the information is used by the dialogue script generator, and that this approach is somewhat generic and could be used for other domains and scenarios. However, the examples provided for the other two consumers are not very convincing. In particular, the Mr. Hallermann example seems to me very idiosyncratic. I fail to see the generic nature of the information provided by the dialogue component to disambiguate the examples. (Of course, this relates to the general problem of using background knowledge in NLP. It seems that every ambiguity example involving background knowledge requires its own set of inference rules...)
One major problem of reuse of NLP components is the lack of standards
for interfaces and formats. Does the dialogue module offer an API or how
is information exchanged with the other modules. How easy/difficult
would it be to integrate the module into another system (what exactly is
the input needed, what exactly is the output provided).
The authors reply:
In the paper we described three distinct modules:
Module (B) communicates with (C) by granting direct access to lisp
objects and methods. (C) sends lisp-like structures to the final
generator, again using a VERBMOBIL channel.
So, as you can see, none of the modules provides a clean API
for easy reuse. They are specifically designed to run together and
fit in the VERBMOBIL architecture.
There are some reusable tools in the extraction module (A),
though. Tools for training a statistical dialogue act recognizer, for
building finite state machines, for manual dialogue act annotation of
written data.
Of course the interface between A and B, the direx expressions,
resemble those of e.g. C-Star. However, standardization even simple
things like dialogue acts (see e.g. SIGdial and DRI) is a
good, but not easy task.
Data flow is from (A) to (B) to (C). Module (A) communicates with (B)
by sending direx expressions as a string (syntax is only hinted at in
the paper) through a channel provided by the VERBMOBIL system.
Joachim Quantz's comments:
How easy/difficult would it be to adopt the system to other domains and languages? What could be reused, what would have to be redeveloped?The authors reply:
(*) This question together with two others above aim at the same thing: what happens if we want to change (I) the language (II) the domain (III) the application Here's what happens:
Now what do we mean by task and is there a possible taxonomy of tasks? We like to call our dialogues "negotiation dialogues" and the task this implies is one of proposing objects, probably explaining or modifying these objects and commenting on these objects (accept/reject). You can build your taxonomy by checking whether a certain type of dialogue essentially consists of these operations or not. Positive examples are:
The dialogue structure certainly becomes much more intricate and confusing.
Joachim Quantz's comments:
Robustness and Performance: The authors mention success figures like "approximately 75% correctly translated contributions in the domain of appointment scheduling" (p. 1) or "Performance in dialogue act recognition achieves an accuracy of about 70% on unseen data" (p. 3). Though I think that quantitative evaluation is of rather high importance, it would be necessary to explain in more detail under which conditions these figures have been obtained.The authors reply:
The first figure stems from an evaluation done at the end of phase 1 of the VERBMOBIL project. System input and output was evaluated by professional translators for correctness. "Approximate correctness" meant that the translation carries across the intended message (preserving all essential facts). 75% of all translations fell into that category. During this evaluation all competing translation tracks were involved, and not just shallow translation.
Accuracy in dialogue act recognition was based on a corpus of sample dialogues which had been hand coded. We ran a leave-one-out-experiment (i.e. testing each dialogue using all other dialogues as training material, so it was actually more than 1000 test runs) and got these numbers as the overall results.
Joachim Quantz's comments:
Also, it would be interesting to know what kind of consequences such performance figures have. Does it make the components useless in its current state or would these rates be already useful for some applications? For example, what kind of impact will mistakes have on other components, e.g. does the recognition of the wrong act automatically lead to wrong summaries/translations?The authors reply:
Of course the wrong act can lead to a (partly) wrong translation, but not necessarily to a wrong summary. More critical is the recognition of propositional content (depending on speech recognition or extraction) Since the dialogue processor uses the prior context and rules, it is in some respects robust against recognition errors.
Joachim Quantz's comments:
How good is performance of the other components of the dialogue module, e.g. topic detection, data completion, etc.The authors reply:
We haven't tested performance in this regard yet since this requires manual annotation of the data (which is simple for topics but quite a hassle for content objects).
What we've done recently is some kind of small end-to-end evaluation where we looked at the dialogue transcript and made a summary ourselves, comparing our summary objects with the automatically retrieved ones.
Joachim Quantz's comments:
Finally it would be nice to get a rough idea of the runtime performance of the components mentioned. Are they (close to) real time?The authors reply:
The performance of all our components together is well within real-time. Producing a summary at the end of a dialogue takes more time (in the one-digit second area, but we didn't measure it). Since this processing is not in the human-to-human dialogue loop, this is no problem.
The overall VERBMOBIL system, however, is aiming at 4x real time, end-to-end. We, and Intel, AMD, Sun etc work heavily towards that goal. We, the Verbmobil developers by making their modules faster, and the others by giving us more and more GHz :-)
Michael, Jan and Norbert