Johan Boye, Mats Wirén, Manny Rayner, Ian Lewin, David Carter and Ralph Becket

Language-Processing Strategies and Mixed-Initiative Dialogues

[Full Text]
[send contribution]
[debate procedure]
[copyright]


Overview of interactions

No Comment(s) Answer(s) Continued discussion
1 Kristiina Jokinen 28.3.00
Johan Boye 11.4.00

C1. Kristiina Jokinen (28.3.00):

This paper is a description of a working dialogue system which deals with spoken language interaction on travel domain. It gives a good overview of the system, but I'm afraid it also leaves the reader in a tantalizing state of wondering how exactly do the different design decisions contribute to the overall goal of a flexible mixed-initiative dialogue system, and how the system architecture addresses the specific questions of knowledge sources and reasoning in dialogue systems. In particular, it would be good to have a more elaborated discussion about how the three desiderata (1) handling mixed-initiative dialogues, (2) deep linguistics analysis, and (3) management of disfluent utterances, mentioned as design principles for the system, are taken care of in the system.

(1)

(2)-(3) Kristiina Jokinen
FLV Belgium


C1. Johan Boye (11.4.00):

Kristiina Jokinens comments:

This paper is a description of a working dialogue system which deals with spoken language interaction on travel domain. It gives a good overview of the system, but I'm afraid it also leaves the reader in a tantalizing state of wondering how exactly do the different design decisions contribute to the overall goal of a flexible mixed-initiative dialogue system, and how the system architecture addresses the specific questions of knowledge sources and reasoning in dialogue systems. In particular, it would be good to have a more elaborated discussion about how the three desiderata (1) handling mixed-initiative dialogues, (2) deep linguistics analysis, and (3) management of disfluent utterances, mentioned as design principles for the system, are taken care of in the system.

(1)

The authors reply:

The sentence you quote is indeed not very enlightening. Unfortunately, the description of the agenda and its operational interpretation is too brief and too simplified in the article. We take this opportunity to give a more thorough account of the details.

As explained elsewhere in the article, each user utterance is classified as a certain kind of move (from a fixed repertoire of move types). What might not be clear from the article is that each move type T has an associated updating rule, deciding how the agenda and the list of objects should be updated in case the user's utterance is classified as a move of type T. As soon as the DM has established that the user's utterance is of type T, that rule is fired. (This is exemplified in Section 4.2 in the paragraph beginning with "Once the DM knows..."). An updating rule will only add new items on top of the agenda, or do nothing. It will never cause items to be removed from the agenda.

The agenda items are the basic building blocks when specifying the behaviour of the system. An item has two arguments; a condition and an action. Typically, a condition could be "The destination of trip number 1 is unknown" (where trip no 1 is an object containing the user's constraints concerning the trip under discussion). The corresponding action would then be "Ask for the destination of trip number 1". Clearly, the condition can either be true or false about the current dialogue state. Declaratively, such a condition can be seen as the negation of a goal the system wants to attain - to know the destination of the user's desired trip. Operationally, the condition can be seen as a guard for the corresponding action - we don't want to ask about the destination if it is already known.

The agenda is a stack; i.e. a last-in-first-out data structure. When the system is to decide its response to the user (step 5 in Figure 4 in the paper), it starts by examining the item on top of the agenda. If the condition of that top item is true, the corresponding action is carried out (in the example above, if the destination of trip number 1 is unknown, the system will ask for the destination). If the condition is false (the destination is known), the whole item is removed from the agenda, and the system proceeds to examine the item which is now on top of the agenda. The system will thus continue down the agenda, popping items until it finds an item whose condition evaluates to true. It will then execute the corresponding action.

So to answer the second question first, we do not believe this simple operational mechanism to be more flexible than "traditional action planning", and we hope we haven't given this impression in the article! On the contrary, we have deliberately chosen a simple solution since we thought this was appropriate for this simple domain. The dialogue designer now has a good and simple way of controlling in what order questions will be asked, and of ensuring that unnecessary questions won't be asked at all.

What is "flexible" or not is of course a subjective matter. However, we would like to argue that the following properties are adding to the flexibility of the system: (1) When the system asks the user a specific question (e.g. "When do you want to go?"), the user can give more information than required ("To Stockholm on Friday"), or some information other than that required ("Wait a minute... I mean I want to go to Gothenburg!"). In the first case, the way the agenda works guarantees that the system won't ask for the destination, as it is now known. In the second case, the question "When do you want to go?" will be asked again, since its corresponding condition "The departure time is unknown" is still true. (2) When the system suggests a trip alternative, the user can accept or reject it, but also ask side-questions, or ask for alternative suggestions; the appropriate system response is added on top of the agenda (e.g. a user:ask-for-info move leads to the addition of a system:answer-with-info move, and so on). Thus the user is NOT reduced to answer-supplier in a strictly system-controlled dialogue; rather the operational mechanism controlling the system's behaviour allows for some amount of mixed initiative.

Kristiina Jokinens comments:

How is the adding of the elements in the agenda determined (any reasoning involved, and where?)

The authors reply:

Hopefully this is clarified above.

Kristiina Jokinens comments:

Can you take into account hierarchical goals, what happens if a goal is already achieved, the user introduces goals not on the top of stack?

The authors reply:

There is no explicit concept of an hierarchical goal which can be broken down to subgoals, etc. As explained above, the conditions in the agenda items can be seen as negated goals, but the system has no apprehension of the relationship between those conditions.

A goal already achieved thus corresponds to a condition being false when examined. It is described above what will happen in this case.

When the user takes an initiative (e.g. asks an info-seeking side question like "What airline is that?"), then the system will as a reaction add an action on the agenda for answering this question.

Kristiina Jokinens comments:

How does the DM decide on the initiative/response action: what kind of conditions are used, can you give an example of a goal that occurs in the agenda?

The authors reply:

Hopefully this is clarified above.

Kristiina Jokinens comments:

What happens if there are several items in the agenda that can be applied in the situation?

The authors reply:

As explained above, the topmost one is chosen.

Kristiina Jokinens comments:

What does it mean that "if the Cond is false, the whole item is popped off the agenda, and DM proceeds to the next item"?

The authors reply:

Hopefully this is clarified above.

Kristiina Jokinens comments:

Could you also tell a bit more about the relation between domain-dependent and domain-independent DM codes: how do they interact? Also, it would be good to have some discussion concerning games and moves that are domain-dependent and agenda that encodes global goals. How easy would it be to port the system to another domain?

The authors reply:

Unfortunately, since we have not been in a position to port the system to another domain, we cannot say for sure how easy that would be. We conjecture that the chosen set of move labels are general enough to cover a non-trivial class of interesting applications (see the last paragraph of Section 4.1). We have strived to obtain separation between domain-dependent and domain-independent code by using sound software engineering principles; code that directly refers to domain objects like flights and trains have been put in separate procedures.

Kristiina Jokinens comments:

(2)-(3) It's not clear how the RP and CLE work together to balance the complementary requirements of deep linguistic analysis for accurate understanding and robustness for compensating for inaccuracies in speech recognizer, nor how their results are manipulated/controlled by the DM.

- What are their respective contributions to the analysis of utterances?

The authors reply:

The RP always produces exactly one flat utterance description (FUD) as a result, while the CLE might produce zero, one, or many FUDs, reflecting alternative interpretations of the utterance.

Kristiina Jokinens comments:

How exactly does the DM select between the outputs of RP and CLE? Can you give an example of the different factors in computing the likelihood score, especially the factors "difference between prop.content and context", and "the presence of keywords"?

The authors reply:

The relationship between the prop.contents and the context, as well as the presence of keywords, are important factors when deciding what move type a given utterance belongs to (or more correctly; what move type a given FUD belongs to). This categorization is done for each FUD resulting from a given utterance.

The above factors are however not directly used for assessing the quality of the FUDS, i.e. for computing what FUD most likely reflects the actual utterance. We have experimented with several factors in assessing the quality of a FUD (some of which are mentioned in Section 4.2), e.g. the number of previously unknown slot values determined by the FUD, the number of words of the utterance that contributed to the analysis compared to the number of discarded words, a reference resolution score (antecendents from recent utterances give a higher score), the chosen response action the FUD gives rise to, and so on. The likelihood score is computed from all these factors. The weights assigned to the various factors were first set somewhat arbitrarily, and then hand-adjusted until the system worked well on a number of examples.

Kristiina Jokinens comments:

What are the cases when CLE helps to get a correct analysis - all your examples seem to favour RP, and the cases where RP fails, CLE doesn't seem to help either (it produced no FUD)?

The authors reply:

The RP in its current version cannot make subtle distinctions e.g. between "Is that a direct flight?" as opposed to "Is there a direct flight?", because the pattern "direct flight" is not enough to discriminate between the two. Of course this could be remedied by adding the whole sentences ("Is that a direct flight", etc.) as syntactic patterns on which the RP can trigger. But this would seem to run counter to the general philosophy behind the RP, which is to attain robustness against disfluent speech and bad speech recognition by triggering on short patterns.

Kristiina Jokinens comments:

Why is the default of the Robust Parser wh(X,[]) ?

The authors reply:

The expression 'wh(X, P)', where P is a list of constraints, could be thought of as meaning "Find me X such that P holds", or "I am interested in an X such that P holds". In particular, 'wh(X, [])' means "I am interested in an X", or rather "I am interested in any X" (since there is nothing constraining the possible instantiations of X). Now, one of the explicit goals when writing the Robust Parser was that it should always give some output, no matter how unintelligible the input is. So if the user's utterance does not contain any constraints (as far as the RP can tell), it made sense to us to let the output be "I am interested in any X".

Recall also that the FUDs merely reflect the propositional contents of the utterance, and the DM also receives information about non- propositional contents, like the presence of certain keywords. If for instance the user only says "book", the propositional contents is 'wh(X, [])', but the information that the utterance contains the word "book" could make the DM infer that the user's utterance is an 'accept' move.

Kristiina Jokinens comments:

How confident is the preliminary evaluation of the system (you have only two subjects)!? Although you say that the results can only be taken as suggestive, can you be more specific of the suggestion? Are the results more diagnostic in nature or provide good evidence that both RP and CLE are needed in your system? Could it be that the said bottlenecks of the system (speech reco and DM) need to be elaborated more, and some of the problems, e.g. with the longest fragment selected by the CLE, be thus compensated?

The authors reply:

The results of the limited evaluation point in a certain direction, which can be summarized as follows (and which are compatible with our experiences from using and demonstrating the system):

  1. The RP performs better than we had expected: Thus, from what we have seen so far, the linguistic variability in our domain, even in the face of mixed initiative, is sufficiently limited that a shallow-parsing strategy achieves quite good results.
  2. The CLE performs worse than expected. However, the dominating reason for this is bad interaction with the speech recognizer: More specifically, the principle of trying to analyse the longest grammatical fragment from the N-best list is simply not a good one in our case. As we say in the paper (Section 6), we had underestimated the degree to which the output from the speech recognizer requires fragment analysis whose results might themselves require careful selection.
  3. There are clearly types of sentences that are difficult to capture with the RP, but which the CLE is in a position to deal successfully with (compare the examples mentioned above). However, these sentences have so far occurred much less frequently in our domain than we had anticipated. In any case, as long as the problems in (2) dominate, the advantage obtained from being able to analyse these residual sentences seems insignificant. For the current domain and present configuration of the system, we thus cannot claim to have shown that both the CLE and RP are needed. Making the domain more complex (for example, by including tickets and prices) might change this. More generally, however, any step forward on the problems addressed by the paper seems to require that the interaction between the CLE and the speech recognizer be improved.

Kristiina Jokinens comments:

to be symmetric, it would be good if your conclusion also says something about the mixed-initiative dialogues and the role of DM in the overall system architecture, besides the RP and CLE results

The authors reply:

The last paragraph in the answer to your very first question above outlines in what ways we consider our dialogue manager to allow a high degree of flexibility, including mixed user and system initiatives.

Finally, we would like to thank Kristiina Jokinen for her interest in our article, and for posing these highly relevant questions.


Additional questions and answers will be added here.
To contribute, please click [send contribution] above and send your question or comment as an E-mail message.
For additional details, please click [debate procedure] above.
This debate is moderated by the guest editors.