D. Morpho-syntactical Schemes


In this section we will overview some existing corpus annotation schemes for both morphosyntax and syntax. We will consider them insofar as they have something of interest to say about typical problems encountered in dialogue annotation in connection with the following typology of phenomena:

This typology says nothing about whether the phenomena considered are classifiable as disfluent material or should rather be taken as germane linguistic phenomena characteristic of speech and not of writing. The classificatory perspective entertained here lays emphasis on the impact that the listed phenomena are likely to have on issues of annotation: e.g. if they would simply require introduction of an extra part of speech category, or if they are rather bound to have repercussions on syntactic parsing and segmentation issues in general.

Note that, in some cases, the same phenomenon can be treated under two different headings: interactional markers, for example, pose both a problem of categorial classification (how should they be labelled?) and an issue of segmentation, when they happen to be multi-word units (e.g., is 'I see' in its interactional usage to be treated as a single morphosyntactic unit, or should it rather be treated as a complex syntactic constituent?). Clearly, the two perspectives interact to a large extent.

Not all the annotation schemes overviewed here have actually explicitly addressed all problems in our list. Most of them simply came up with interesting practices which can easily/usefully be extended to dialogue annotation proper with a view to the treatment of such phenomena. For example, we will mention here Eagles 1996 recommendations on both morphosyntax and syntax annotation, although they were initially intended to deal with written material only. As pointed out in Leech et al. 1998, they can in fact be taken as a useful starting point for dialogue annotation too, with the proviso that a certain amount of customization be carried out. Hopefully, this should pave the way to the ultimate integration of practices in the scientific communities of NLP and speech.

D.1 Childes

Coding book:

Information about the purpose and domain of the CHAT system as well as instructions for use are described in MacWhinney (1994).

Number of annotators:

The CHAT system is a widespread standard system for the transcription and coding of child language in many European and non European languages. Approximately 60 groups of researchers around the world are currently actively involved in new data collection and transcription using the CHAT system. As a consequence of its widespread use, it is impossible to calculate the exact number of annotators.

Number of annotated dialogues:

A huge number of dialogues has been/is being annotated with the CHAT coding scheme. This number exceeds the amount of dialogues in the database, as many projects concerning child language make use of CHAT without contributing to the overall CHILDES database. The internationally recognized CHILDES database (http://sunger2.uia.ac.be/childes/database.html) includes transcripts from over forty major projects in English and additional data from 19 other languages. The additional languages are Brazilian Portuguese, Chinese (Mandarin), Chinese (Cantonese), Danish, Dutch, French, German, Greek, Hebrew, Hungarian, Italian, Japanese, Mambila, Polish, Russian, Spanish, Swedish, Tamil, Turkish, and Ukrainian. The total size of the database is now approximately 160 million characters (160 MB). Full documentation about the database can be found at http://sunger2.uia.ac.be/childes/database.pdf.

Evaluations of scheme:

As a result of its worldwide use, CHAT is continuously evaluated and updated to meet the needs of different languages and different users. We are not aware of statistical/quantitative evaluations of its reliability.

Underlying task:

Being first created as a tool for the study of language acquisition, the data collected mainly refer to parent-to-child or child-to-child spontaneous conversations, task-oriented dialogues in play and story-telling situations.

Some of the data coded by CHAT also include second language learners and adults recovering from aphasic disorders.

List of phenomena annotated:

See below.

Examples:

See below.

Mark-up language:

CHAT's own format.

Existence of annotation tools:

The CHILDES system contains several separate, yet integrate, programs which are clustered around two major tools. The first tool is a full-fledged and ASCII-oriented editor (CED, Childes EDitor), specifically designed to facilitate the editing of CHAT files and to check for accuracy of transcriptions. CED also allows the user to link a full digitized audio recording of the interaction directly to the transcript. This is the system called "sonic CHAT". The CED editor is currently being extended to facilitate its use with videotapes. The plan is to make available a floating window in the shape of a VCR controller that can be used to rewind the videotape and to enter time stamps from the videotape into the CHAT file. An alternative way of analyzing video is to record from tape onto QuickTime movies and to link these digitized movies to the transcript.

The second tool, actually a bunch of several smaller tools, is a set of computer programs called CLAN (Child Language ANalysis) which serves different analysis purposes. The full system is presented in detail in MacWhinney (1991) and illustrated through practical examples in Sokolov and Snow (1994).

Usability:

CHAT-encoded databases have been set up as a result of nearly a hundred major research projects in 20 languages. New databases are continuously being set up worldwide.

Contact person:

Brian MacWhinney (macw@cmu.edu)

D.1.1 Word-Level Classification Issues

CHAT makes provision for two physically and in part also conceptually distinct ways of encoding morphological information in a corpus: i) morpheme splitting on the 'main line', that is the line of orthographic transcription, ii) morphological categorization on the 'morphology line', that is a separate tier of encoding specifically devised for containing morphological information.

In order to indicate the ways that words on the main line are composed from morphemes, CHAT uses the symbols -, +, #, ~, &, and 0: they are all used as concatenative operators and accordingly placed between two consecutive morphemes. These same six symbols are also used for parallel purposes on the morphology line, where these symbols form a part of a more extensive system.

Morphemicization on the main line is intended mostly for initial morphemic analysis or general quantitative characterization of morphological development. For more thorough analyses the morphology line is strongly recommended, especially for languages other than English.

The basic scheme for coding of words on the morphology line is:

'part-of-speech' |

'pre-clitic' ~

'prefix' #

'stem'

= 'English translation'

& 'fusional suffix'

- 'suffix'

~ 'post-clitic'

where the gloss between quotes indicates the content and position of corresponding encoded information relative to the symbol/operator. For example, part-of-speech information precedes '|', while fusional suffix follows '&'. Furthermore the delimiter '+' is used between words in a compound (see infra).

The order of elements after the | symbol is intended to correspond to the linear order of morphemes within the word, as shown by the following example:

'sing-s' v|sing-3s

There are no spaces between any of these elements. The English translation of the stem is not a part of the morphology, but is included here for convenience in retrieval and data entry. The morphological status of the affix is identified by the type of delimiter.

In particular, '&' is used to signal that the affix is not realized in its usual phonological shape. For example, the form "men" cannot be broken down into a part corresponding to the stem "man" and a part corresponding to the plural marker "s", hence it is coded as n|man&PL. Similarly, the past forms of irregular verbs may undergo ablaut processes, e.g. "came", which is coded v|come&PAST, or they may undergo no phonological change at all, e.g. "hit", which is coded v|hit&PAST Sometimes there may be several codes indicated with the & after the stem. For example, the form "was" is coded v|be&PAST&13s.

Clitics are marked by a tilde, as in v|parl=speak&IMP:2S~pro|DAT:MASC:SG for Italian "parlagli" and pro|it~v|be&3s for English "it's." Note that part of speech coding is repeated for clitics. Both clitics and contracted elements are coded with the tilde. The use of the tilde for contracted elements extends to forms like "sul" in Italian, "ins" in German, or "rajta" in Hungarian in which prepositions are merged with articles or pronouns.

D.1.1.1 Adverbs, Interjections, Interaction Markers

The category 'communicator' is used in CHAT for interactive and communicative forms which fulfill a variety of functions in speech and conversation. Many of these are formulaic expressions such as hello, good+morning, good+bye, please, thank+you. Also included in this category are words used to express emotion, as well as imitative and onomatopoeic forms, such as ah, aw, boom, boom-boom, icky, wow, yuck, yummy.

D.1.1.2 Pauses, Hesitators

Pauses are treated in CHAT on the prosodic annotation tier. Pauses that are marked only by silence are coded on the main line with the symbol #. The number of # symbols represents the length of pauses. Alternatively, a word after the symbol # is added to estimate the pause length, as in #long.

Example:

*SAR: I don't # know -.

*SAR: #long what do you ### think -?

CHAT allows coding of exact length of the pauses, with minutes, seconds, and parts of seconds following the #.

Example:

*SAR: I don't #0_5 know -.

*SAR: #1:13_41 what do you #2 think -?

D.1.1.3 Word Partials, Non Standard Forms

When an item on the main line is incorrect in either phonological or semantic terms it is marked by a following '[*]'. The coding of that item on the morphology line should be based on its target, as given in the 'error line'. If there is no clear target, the form should be represented with 'xxx', as in the following example:

*PAT: the catty [*] was on a eaber [*].

%mor: det|the *n|kitty v|be&PAST prep|on

det|a *n|xxx.

%err: catty = kitty $BLE $=cat,kitty ; eaber = [?]

In this example the symbol '*' on the morphology line indicates the presence of an incorrect usage, in this case due to blending two different words into one. The detailed analysis of this error should be conducted on the 'error line'. Errors involving segmentation issues (such as omission of a syntactically obligatory unit etc.) will be treated in the following section.

A non standard or incorrect usage can be encoded directly on the main line by trailing after it the replacing standard form in square brackets: example, gonna [: going to]. The material on the %mor line corresponds to the replacing material in the square brackets, not the material that is being replaced. For example, if the main line has gonna [: going to], the %mor line will code going to.

Some special characters are intended to give information about, for example, babbling, child-invented forms, dialect forms, family-specific forms, filled pauses, interjections, neologisms, phrasal repetitions, or other general special forms, according to the following conventions. Note that recording of these phenomena is not made at the coding level, but at the transcription level.
LettersCategories ExampleMeaning Coded Example
@bbabbling Abame abame@b
@cchild-invented form Gummasticky gumma@c
@ddialect form Younzyou younz@d
@ffamily-specific form Bunkobroken bunko@f
@fpfilled pause Huh- huh@fp
@iinterjection, interactional Uhhuh- uhhuh@i
@lletter Bletter b b@l
@nneologism Breakedbroke breaked@n
@oonomatopoeia woof woofdog barking woof@o
@pphonol. consistent forms Aga- aga@p
@prphrasal repetition its a, its a- its+a@pr
@ssecond-language form Istenemmy God istenem@s
@slsign language apple signapple apple@sl
@general special form Gongga- gongga@

D.1.2 Segmentation Issues

D.1.2.1 Multi-Words

Those compounds that are usually written as one word, such as "birthday" or "rainbow," should not be segmented. Those compounds that are generally separated by a hyphen in English orthography are separated by a + symbol in CHAT transcription (e.g., "jack-in-the-box" should be transcribed as "jack+in+the+box"). Rote forms to be counted as a single morpheme may also be joined with a + symbol (e.g., all+right).

Multi-word expressions which are concatenated through a '+' are assigned a unique part-of-speech tag at the level of morphosyntax. For example, the following idiomatic phrases can be coded: qn|a+lot+of, adv|all+of+a+sudden, adv|at+last, co|for+sure, adv:int|kind+of, adv|once+and+for+all, adv|once+upon+a+time, adv|so+far, and qn|lots+of.

D.1.2.2 Error Coding

The symbol *0 is used in CHAT to indicate omission (recall that the symbol * is used to indicate incorrect usage), as in the following examples:

*CHI: dog is eat.

%mor: *0det|the n|dog v:aux|be&PRES v|eat-*0PROG.

*PAT: the dog was eaten [*] the bone.

%mor: det|the n|dog v:aux|be&PAST&3S v|eat-*PERF det|the n|bone.

%err: eaten = eating $MOR $SUB

Here is an example of coding on the morphology line that indicates how the omission of an auxiliary is coded:

*BIL: he going.

%mor: pro|he *0v|be&3S v|go-prog.

Note that the missing auxiliary is not coded on the main line, because this information is available on the morphology line. If a noun is omitted, there is no need to also code a missing article. Similarly, if a verb is omitted, there is no need to also code a missing auxiliary.

The CHAT system for error coding has the following features:

1. it indicates what the speaker actually said, or the erroneous form

2. it indicates that what the speaker actually said was an error

3. it allows the transcriber to indicate the target form

4. it facilitates retrieval, both toward target forms and actually produced forms

5. it allows the analyst to indicate theoretically interesting aspects of the error by delineating the source of the error, the processes involved, and the type of the error in theoretical terms (on the error line)

D.1.2.3 Phrase Partials

In CHAT, the syntactic role of each word can be notated before its part-of-speech on the morphology line. To capture syntactic groupings, provision is made for coding syntactic structure on the syntactic line. Clauses are enclosed in angle brackets and their type is indicated in square brackets, as in the following example:

*CHI: if I don't get all the cookies you promised to give

%syn: <C S X V M M D < S V < R V I > [CP] > [RC] > [CC] <

me, I'll cry.

S V > [MC].

In this notation, each word plays some syntactic role. The rules for achieving one-to-one correspondence to words on the main line apply to the syntactic line also. Higher order syntactic groupings are indicated by the bracket notation. The particular syntactic codes used in this example come from the following list. This list is not complete, particularly for languages other than English.
AAdverbial Adjunct V Verb
CConjunction X Auxiliary
DDirect Object AP Appositive Phrase
IIndirect Object CC Coordinate Clause
MModifier CP Complement
PPreposition MC Main Clause
RRelativizer/Inf PP Prepositional Phrase
SSubject RC Relative Clause

D.1.2.3.1 Trailing off, Interruption, Completion

An incomplete, but not interrupted, utterance, is marked with the "trailing off" '+=8A' symbol on the main line.

Example:

*SAR: smells good enough for +=8A

*SAR: what is that?

If the speaker does not really get a chance to trail off before begin interrupted by another speaker, then the interruption marker '+/.' is used instead. If the utterance that is being trailed off is a question, then the symbol '+..?' is used.

The symbol '+' can be used at the beginning of a main tier line to mark the completion of an utterance after an interruption. It is complementary to the trailing off symbol.

Example:

*CHI: so after the tower +...

*EXP: yeah.

*CHI: +, I go straight ahead.

Others' completion is marked through '++'. This symbol can be used at the beginning to mark "latching", or the completion of another speaker's. It is complementary to the trailing off symbol.

Example:

*HEL: if Bill had known +...

*WIN: ++ he would have come.

D.1.2.3.2 Retrace-and-Repair Sequences

Retracing without correction (simple repetition) [/] takes place when speakers repeat words or whole phrases without change. The retraced material is put in angle brackets.

Example:

*BET: <I wanted> [/] I wanted to invite Margie.

Several repetitions of the same word can be indicated in the following way:

*HAR: It's(/4) like # a um # dog.

Retracing with correction [//] takes place when a speaker starts to say something, stops, repeats the basic phrase, changes the syntax but maintains the same idea. Usually, the correction

moves closer to the standard form, but sometimes it moves away form it. The retraced material is put in angle brackets.

Example:

*BET: <I wanted> [//] uh I thought I wanted to invite Margie.

Retracing with Reformulation [///] takes place when retracings involve full and complete reformulations of the message without any specific corrections.

Example:

*BET: all of my friends had [///] uh we had decided to go home for lunch.

Unclear Retracing Type is marked by [/?].

CHAT distinguishes a False Start without retracing [/-], from false starts with correction. False starts with no retracing are dealt with in the following section. The symbols [/] and [//] are used when a false start is followed by a complete repetition or by a partial repetition with correction.

D.1.2.3.3 Anacolutha (syntactic blending)

If the speaker terminates an incomplete utterance and starts off on a totally new tangent, this can be coded with the [/-] symbol:

*BET: <I wanted> [/-] uh when is Margie coming?

Note that if this coding is not in contrast with the coding of incomplete utterances (either trailed off or interrupted); this uniquely depends on the decisions about what a coder wants to count as an utterance.

D.2 CHRISTINE (SUSANNE)

The CHRISTINE corpus is, for spoken dialogues, what SUSANNE was for written corpora: a carefully annotated collection of real spoken material of British English only.

The CHRISTINE project is using the structural annotation scheme defined for the SUSANNE Corpus (which is probably the most detailed thing of its kind yet produced). The definition of the SUSANNE scheme can be found in G. Sampson's book, "English for the Computer" (see Sampson, 1995). The EAGLES group asked for a copy of this book when it was in proof and its contents (Chapter 6 in particular, which deals with extending annotation to spoken material) played a significant part in their decisions (see Section D.3 in this report for further details). In the CHRISTINE project, the annotation rules of Chapter 6 are being redefined on the basis of experience in actually applying them to sizeable quantities of spontaneous spoken English. G. Sampson (personal communication) reports that in most respects what is being done is only adding to already existing rules, not changing them. Additional annotation rules are not, at the present stage, into a form fit to circulate yet.

The CHRISTINE project is due to be completed at the end of 1999. there may be a few months' "polishing" after that, but then or soon afterwards the annotated corpus will be made available freely to all comers in the same way that the SUSANNE Corpus already is.

Some documentation available at:

http://iris1.let.kun.nl/TSpublic/tosca/index.html

D.3 Eagles 1996-98

Coding book:

documentation available at

http://www.ilc.pi.cnr.it/EAGLES96/annotate/annotate.html

Number of annotators:

not applicable

Number of annotated dialogues:

not applicable

Evaluations of scheme:

indirect evaluation through instantiation in many different projects (see usability)

Underlying task:

standard development

List of phenomena annotated:

list of relevant phenomena provided below

Examples:

list of relevant examples provided below

Mark-up language:

not applicable

Existence of annotation tools:

eagles conformant annotation tools developed in other projects

Usability:

schemes adopted in Multext, Sparkle, Parole

EAGLES is the ancestor of a family of standardization efforts for corpus annotation. It is then worth looking into EAGLES' methodology in some detail, as this will also offer a key to an understanding of the design and development of other Eagles-related annotation schemes.

D.3.1 Word-level classification issues

EAGLES provides a list of morphosyntactic (major) categories.
1.N [noun] 2.V [verb] 3.AJ [adjective]
4.PD [pronoun/

determiner]

5.AT [article] 6.AV [adverb]
7.AP [adposition] 8.C [conjunction] 9.NU [numeral]
10.I [interjection] 11.U [unique/

unassigned]

12.R [residual]
13.PU [punctuation]

They represent the most general and obligatory level of morphosyntactic annotation, in the sense that any set of morphosyntactic tags is expected to convey at least information about morphosyntactic categories.

The set of Eagles category tags is not formally consistent, in that it does not provide a minimal set of mutually exclusive morphosyntactic classes. See, for example, the umbrella-category PD, including both determiners and pronouns, and its coexistence with the overlapping category AT for articles. Accordingly there is no general expectation that the mapping between the EAGLES category tags and a language specific instantiation of it should be one-to-one.

Morphosyntactic categories can further be specified by means of appropriate morphosyntactic features (such as gender, number, case etc.),expressed as supplementary tags. The combination of a category tag with its morphosyntactic feature specification yields complex tags of considerable length and granularity. As an illustration, we provide below the feature matrix for the category verb as detailed .

Verbs (V)

(i)Person: 1. First2. Second 3. Third
(ii)Gender: 1. Masculine2. Feminine 3. Neuter
(iii)Number: 1. Singular2. Plural
(iv)Finiteness: 1. Finite2. Non-finite
(v)Verbform/ Mood: 1. Indicative2. Subjective 3. Imperative4. Conditional
5. Infinite 6. Participle7. Gerund 8. Supine
(vi)Tense: 1. Present2. Imperfect 3. Future4. Past
(vii)Voice: 1. Active2. Passive
(viii)Status: 1. Main2. Auxiliary

Examples of use of this matrix are provided for what is called "Intermediate Tag Set", a specific instantiation of a subset of the list of categories above:

A 3rd person, singular, finite, indicative, past tense, active, main verb,

non-phrasal, non-reflexive, verb is represented: V3011141101200

Wherever an attribute is inapplicable to a given word in a given tagset, the value 0 fills that attribute's place in the string of digits. When the 0s occur in final position, without any non-zero digits following, they can be dropped.

Eagles makes provision for disjunctive specification of morphosyntactic categories in cases of i) genuine systematic ambiguity in a given language (e.g. present indicative and present subjunctive forms in English, or some past participles and adjectives in Italian), ii) practical demands of fully automatic tagging.

D.3.1.1 Adverbs, Interjections, Interactional Markers

The interjection and adverb categories are much broader and variegated than usually assumed in traditional grammar. Eagles 98 provides two illustrative lists of the level of granularity at which both categories can be subclassified, taken from Sampson (1995) and the London Lund Tagset respectively. In both cases a fine-grained functional or semantic analysis of the role of each subclass in dialogue interaction is presupposed. This aspect makes both proposals prohibitive for the purposes of automatic annotation. A practical strategy could be to add interjection to the Eagles inventory of part-of-speech categories and provide a rich feature matrix for subclassification, under the assumption that only the topmost attribute (part-of-speech) be disambiguated in automatic tagging.

D.3.1.2 Pauses, Hesitators

Eagles 98 recommends to treat pauses and hesitators as punctuation marks, to eventually be attached as high in the syntactic tree as possible during parsing.

D.3.1.3 Word Partials, Non Standard Forms

No specific recommendations are provided for word partials, and the suggestion is tentatively put forward to use the peripheral part-of-speech category U ('unique' or 'unassigned', see list above) for their tagging. Non standard forms (e.g. 'gonna') are recommended to be transcribed with standard spelling. Deviations from this practice should be documented and justified.

D.3.2 Segmentation Issues

D.3.2.1 Multi-Words

Eagles 98 leaves the matter open of whether multi-word units should be assigned a single tag or rather a multi-tag. Representation issues are not addressed either in any detail.

D.3.2.2 Error Coding

Coding of mistakes is neither envisaged nor excluded by Eagles 98 recommendations.

D.3.2.3 Phrase Partials

D.3.2.3.1 Trailing Off, Interruption, Completion

Eagles 98 provides a couple of illustrative examples of how syntactic incompleteness could be annotated. In the first one (drawn from the British National Corpus) syntactic incompleteness is annotated by means of a special marker (a slash following the non terminal constituent label) tagging the incomplete constituent as a whole. In the second example (from Sampson 1995), no new label is introduced to mark the incomplete constituent, but only a place holder, '#', which marks the position of the missing element within the incomplete constituent.

It is emphasized that the examples provided are only indicative and should not be taken as standards in any way.

D.3.2.3.2 Retrace-and-Repair Section

Only one example is provided by means of illustration. Once more, it is drawn from Sampson 1995 and recast into an Eagles-conformant style. Both the retrace and the repair are within the minimal superordinate constituent, with the marker '#' used to signal the interruption point

and that [NPs any bonus [RELCL he ] # money [RELCL he gets over that ]] is a bonus

It is not immediately clear from the example what word stretch the repair is meant to replace.

D.3.2.3.3 Anacolutha (syntactic blending)

Cases of syntactic blending are illustrated by means of a drastically incoherent sentence, annotated through maximal parse brackets to enclose the whole parsable unit, and no information about its internal structure. This is what the guidelines of the British National Corpus call 'structure minimization principle':

[and this is what the # the <unclear>] # [ what's name now # now ] # <pause> [ that when it's opened in nineteen ninety-two <pause> the communist block will be able to come through Germany this way in ]

D.4 LE Sparkle

The syntactic annotation schemes developed within SPARKLE are an example of instantiation of Eagles recommendations at the morphosyntactic and syntactic levels, specifically geared towards the completion of two different tasks: i) use of morphosyntactically and syntactically annotated corpora for (semi)automatic acquisition of lexical information from them, and ii) use of annotated material for multi-lingual information retrieval and speech recognition. Both tasks are being carried out on four different languages (namely English, French, German and Italian).

In Sparkle, bootstrapping lexical information from a corpus is modelled as the process of extracting typical contexts of usage of a given lexical item in a shallow-parsed corpus. The acquired information is eventually put to use by either providing a lexicalized version of the shallow parser, or by augmenting the lexicon of another independent parser. In both cases, the ultimate goal of the lexicalized parser is to provide the analysis of a sentence in terms of functional relations holding between head words. Usefulness of this level of analysis is eventually assessed through industrial demonstrators for multilingual information retrieval and monolingual speech recognition.

Accordingly, Sparkle defines the following three possible levels of syntactic annotation:

i) chunking

ii) phrasal parsing

iii) functional parsing

In the following we will review in detail levels i) and ii) only.

Coding book:

documentation available at

http://www.ilc.pi.cnr.it/sparkle.html

Number of annotators:

>5

Number of annotated material:

600 annotated sentences of English, German and Italian

Evaluation of scheme:

Evaluation of automatic annotation over all levels available at: http//www.ilc.pi.cnr.it/sparkle.html

Underlying Task:

Language modelling for Speech Recognition, Multilingual Information Retrieval

List of phenomena annotated:

List of relevant phenomena provided below.

Examples:

Provided below.

Mark-up language:

SPARKLE's own format.

Existence of annotation tool:

Software available for English, German and Italian.

Usability:

Speech Recognition and Multilingual Information Retrieval.

Contact Person:

Vito Pirrelli (vito@ilc.pi.cnr.it)

D.4.1 Word-level Classification Issues

SPARKLE did not develop a specific set of word-level tags, but it simply built on pre-existing part-of-speech Eagles96-conformant encoding schemes. A straightforward extension of these schemes should make provision for the additional tags needed to cover phenomena which are specific of dialogues.

D.4.2 Segmentation Issues

In SPARKLE, segmentation problems are dealt with differently, depending on which level of syntactic annotation one is considering. For the specific purposes of the present overview, we will limit ourselves to consideration of chunking and functional annotation only. This is done for ease of exposition, as these two levels, unlike complete phrase-structure trees, are clearly complementary, and exemplify two profoundly different perspectives on syntactic annotation: one based on the linear arrangement of word forms in a sentence and on the internal cohesion of relatively small syntactic islands, the other on an abstract representation of grammatical functions relative to a verb head. Traditionally, complete phrase-structure trees are assumed to simultaneously convey both types of information. For reasons that will be clear in a moment, syntactic annotation of dialogue favors a view whereby linear adjacency of word forms on the one hand and encoding of functional annotation on the other hand are to be dealt with separately.

D.4.2.1 Chunking in Sparkle

In what follows, we first exemplify the SPARKLE approach to chunking through detailed illustration of the Italian chunking scheme.

The typology of phrase chunks in the Italian chunking annotation scheme is summarised in the table below.

NAMETYPE POTGOVEXAMPLES
ADJ_Cadjectival chunk Adjbello 'nice',

molto bello 'very nice'

BE_Cpredicative chunk Adj

past part

è bello '(it/(s)he) is nice',

è caduto '(it/he) fell'

ADV_Cadverbial chunk Advsempre 'always'
SUBORD_Csubordinating chunk Conjquando 'when',

dove 'where'

N_Cnominal chunk noun

pron

verb

adj

la mia casa 'my house',

io 'I', questo 'this',

l'aver fatto 'having done',

il bello 'the nice (one)'

P_Cprepositional chunk Noun

pron

verb

adj

di mio figlio 'of my son',

di quello 'of that (one)',

dell'aver fatto 'of having done',

del bello 'of the nice (one)'

FV_Cfinite verbal chunk Verbsono stati fatti '(they) have been done',

rimangono '(they) remain'

G_Cgerundival chunk Verbmangiando 'eating'
I_Cinfinitival chunk Verbper andare 'to go',

per aver fatto 'to have done'

PART_Cparticipial chunk Verbfinito 'finished'

Table 1: Typology of phrase chunks

The following informal definitions are intended to make the assumptions underlying this schema fully explicit. More on this can be found in SPARKLE WP1 final report (Carroll et al. 1996), and related papers (Federici et al. 1996 and 1998).

ADJ_C

ADJ_Cs are chunks beginning with any premodifying adverbs and intensifiers and ending with a head adjective. This definition provides a necessary but not sufficient condition for identification of ADJ_C. In fact, adjectival phrases occurring in pre-nominal position are not marked as distinct chunks since their relationship to the governing noun is unambiguously identified within the nominal chunk (see example sentence above). The same holds in the case of predicate adjectival phrases governed by the verb essere 'be', which are part of BE_C (see below).

BE_C

BE_Cs consist of a form of the verb essere 'be' and an ensuing adjective/past participle including any intervening adverbial phrase. E.g.:

[BE_C è intelligente BE_C] '(he) is intelligent'

[BE_C è molto bravo BE_C] '(he) is very good'

[BE_C è appena arrivato BE_C] '(he) just arrived'

ADV_C

ADV_Cs extend from any adverbial pre-modifier to the head adverb. Once more, this definition provides a necessary but not sufficient condition for ADV_C. In fact, adverbial phrases that occur between an auxiliary and a past participle form are not identified as distinct chunks due to their unambiguous dependency on the verb. By the same token, adverbs which happen to immediately premodify verbs or adjectives are respectively part of a verbal chunk and an adjectival chunk. Finally, noun phrases used adverbially (e.g. questa mattina 'this morning') are treated as nominal chunks (see below). E.g.:

[FV_C ha sempre camminato FV_C] [ADV_C molto ADV_C] '(he) has always walked a lot'

[FV_C ha finito FV_C] [ADV_C molto rapidamente ADV_C] '(he) has finished very quickly'

SUBORD_C

SUBORD_Cs are chunks which include a subordinating conjunction. Subordinating conjunctions are chunked as an independent chunk in its own right only when they are not immediately followed by a verbal group. Compare, for example, the chunk structure of the following sentence

[FV_C non so FV_C] [SUBORD_C quando SUBORD_C] [N_C il direttore N_C] [FV_C mi riceverà FV_C] '(I) do not know when the director will receive me'

with the chunk structure of the following sentence, which differs from the previous one in having the subject of the subordinate clause in postverbal position:

[FV_C non so FV_C] [FV_C quando mi riceverà FV_C] [N_C il direttore N_C].

N_C

N_Cs extend from the beginning of the noun phrase to its head. They include nominal chunks headed by nouns, pronouns, verbs in their infinitival form when preceded by an article (i.e. Italian nominalised infinitival constructions) and proper names. Noun phrases functioning adverbially (e.g. questa mattina 'this morning') are also treated as nominal chunks. All kinds of modifiers and/or specifiers occurring between the beginning of the noun phrase and the head are included in N_Cs. E.g.:

[N_C un bravo bambino N_C] 'a good boy'

[N_C tutte le possibili soluzioni N_C] 'all possible solutions'

[N_C i sempre più frequenti contatti N_C] 'the always more frequent contacts'

[N_C questo N_C] 'this'

[N_C il camminare N_C] 'walking'

[N_C il bello N_C] 'the nice (one)'

In the chunking scheme, nominal chunks cover only a portion of the range of linguistic phenomena normally taken care of by nominal phrases: namely only noun phrases with prenominal complementation.

P_C

P_Cs go from a preposition to the head of the ensuing nominal group. Most of the criteria given for N_Cs also apply to this case. Typical instances of P_Cs are:

[P_C per i prossimi due anni P_C] 'for the next two years'

[P_C fino a un certo punto P_C] 'up to a certain point'

FV_C

FV_Cs include all intervening modals, ordinary and causative auxiliaries as well as medial adverbs and clitic pronouns, up to the head verb. E.g.:

verbal chunk with auxiliary or modal verb and medial adverb:

[FV_C può ancora camminare FV_C] '(he) can still walk'

verbal chunk with pre-modifying adverb:

[FV_C non ha mai fatto FV_C] [ADV_C così ADV_C] '(he) has never done so'

the auxiliary essere 'be' in periphrastic verb forms (whether active or passive) such as sono caduto 'I fell', sono stato colpito 'I was hit', or mi sono accorto 'I realized', is dealt with as part of a finite verb chunk, unless the verb essere is followed by a past participle which the dictionary also categorizes as an adjective; in the latter case it is chunked as a BE_C (see above).

[FV_C è FV_C] [N_C un simpatico ragazzo N_C] '(he) is a nice guy'

fronted auxiliaries constitute separate FV_Cs:

[FV_C può FV_C] [N_C la commissione N_C] [I_C deliberare I_C] [P_C su questa materia P_C]? 'can the Commission deliberate on this topic?'

periphrastic causative constructions:

[FV_C fece studiare FV_C] [N_C il bambino N_C] '(he) let the child study'

clitic pronouns are part of the chunk headed by the immediately adjacent verb:

[FV_C lo ha sempre fatto FV_C] '(he) has always done it'

G_C

G_Cs contain a gerund form. When part of a tensed verb group (e.g. in progressive constructions), the gerundival verb form is not marked independently. G_C also includes gerund forms functioning as noun phrases.

[FV_C sta studiando FV_C] '(he) is studying'

[G_C studiando G_C] [FV_C ho imparato FV_C] [ADV_C molto ADV_C] 'by studying (I) have learned a lot'

I_C

Infinitival chunks (I_Cs) include both bare infinitives and infinitives introduced by a preposition.

[FV_C ha promesso FV_C] [I_C di arrivare I_C] [ADV_C presto ADV_C] '(he) has promised to arrive early'

[FV_C desidera FV_C] [I_C partire I_C] [ADV_C domani ADV_C] '(he) wishes to leave tomorrow'

PART_C

A past participle chunk (PART_C) includes participial constructions such as:

[PART_C finito PART_C] [N_C il lavoro N_C] , [N_C Giovanni N_C] [FV_C andò FV_C] [P_C a casa P_C] '(having) finished the job, John went home'

D.4.2.2 Examples of usage

In this section we illustrate, by way of exemplification, the chunking of linguistic phenomena which are typical of dialogues. Examples are only indicative and represent an adaptation to English material of the principles underlying the Italian chunking schema outlined above.

multi-words

Chunking presupposes prior identification and marking of multi-word units.

error coding

Chunking presupposes prior identification and marking of errors and non standard forms.

trailing off, interruption, completion

*SAR: [FV_C smells FV_C] [Adj_C good enough Adj_C] [P_C for P_C]

retrace-and-repair sequences

*BET: [FV_C I wanted FV_C] [filler_C uh filler_C][FV_C I thought FV_C] [FV_C I wanted FV_C] [I_C to invite I_C] [N_C Margie N_C].

anacolutha (syntactic blending)

*BET: [FV_C I wanted FV_C] [filler_C uh filler_C] [WH_C when WH_C] [FV_C is Margie coming FV_C] [Punct_C ? Punct_C]

D.4.2.3 Sparkle Functional Annotation

In EAGLES, a three-layered approach to the specification of grammatical dependencies for verbal arguments was followed (Sanfilippo et al., 1996). The first layer identifies the subject/complement and predicative distinctions as the most general specifications; this layer is regarded as encoding mandatory information. The second layer provides a further partition of complements into direct and indirect as recommended specifications. Finally, a more fine-grained distinction qualified as useful is envisaged introducing further labels for clausal complements and second objects.

The first step in tailoring the EAGLES standards to the needs of SPARKLE, has been to make provisions for modifiers. These were not treated in EAGLES since only subcategorizable functions were taken into consideration. Secondly, the relationship among layers of grammatical dependency specifications has been interpreted in terms of hierarchical links.

In general, grammatical relations (GRs) are viewed as specifying the syntactic dependency which holds between a head and a dependent. In the event of morphosyntactic processes modifying head-dependent links (e.g. the passive, dative shift and causative-inchoative diatheses), two kinds of GRs can be expressed:

  1. the initial GR, i.e.\ before the GR-changing process occurs
  2. the final GR, i.e.\ after the GR-changing process occurs

For example, Paul in Paul was employed by Microsoft is the final subject and initial object of employ. The hierarchical organisation of GRs is shown graphically in Figure 2 below.


Figure 2: GR Hierarchy

Each GR in the current version of the scheme is described individually below.

mod(type,head,dependent)

The relation between a head and its modifier; where appropriate, type indicates the word introducing the dependent; e.g.

mod(_,flag,red)

a red flag

mod(_,walk,slowly)

walk slowly

mod(with,walk,John)

walk with John

mod(while,walk,talk)

walk while talking

mod(_,Picasso,painter)

Picasso the painter

mod is also used to encode the relation between an event noun (including deverbal nouns) and its participants; e.g.

mod(of,gift,book)

the gift of a book

mod(by,gift,Peter)

the gift of a book by Peter

mod(of,examination,patient)

the examination of the patient

mod('s,doctor,examination)

the doctor's examination of the patient

cmod,xmod,ncmod

Clausal and non-clausal modifiers may (optionally) be distinguished by the use of cmod / xmod, and ncmod respectively, each with slots the same as mod. The GR cmod is for when the adjunct is controlled from within, and xmod for control from without. E.g.

cmod(because,eat,be)

he ate the cake because he was hungry

xmod(without,eat,ask)

he ate the cake without asking

arg_mod(type,head,dependent,initial_gr)

The relation between a head and a semantic argument which is syntactically realised as a modifier; thus a by-phrase can be analysed as a `thematically bound adjunct'. The

type slot indicates the word introducing the dependent: e.g.

arg_mod(by,kill,Brutus,subj)

killed by Brutus

subj(head,dependent,initial_gr)

The relation between a predicate and its subject; where appropriate, the initial_gr indicates the syntactic link between the predicate and subject before any

GR-changing process:

subj(arrive,John,_)

John arrived in Paris

subj(employ,Microsoft,_)

Microsoft employed 10 C programmers

subj(employ,Paul,obj)

Paul was employed by Microsoft

With pro-drop languages such as Italian, when the subject is not overtly realised the annotation is, for example, as follows:

subj(arrivare,Pro,_)

arrivai in ritardo '(I) arrived late'

where the dependent slot is filled by the abstract filler Pro, which indicates that person and number of the subject can be recovered from the inflection of the head verb

form.

csubj,xsubj,ncsubj

The GRs csubj and xsubj may be used for clausal subjects, controlled from within, or without, respectively. ncsubj is a non-clausal subject. E.g.

csubj(leave,mean,_)

that Nellie left without saying good-bye meant she was still angry

xsubj(win,require,_)

to win the America's Cup requires heaps of cash

dobj(head,dependent,initial_gf)

The relation between a predicate and its direct object--the first non-clausal complement following the predicate which is not introduced by a preposition (for English and German); initial_gf is iobj after dative shift; e.g.

dobj(read,book,_)

read books

dobj(mail,Mary,iobj)

mail Mary the contract

iobj(type,head,dependent)

The relation between a predicate and a non-clausal complement introduced by a preposition; type indicates the preposition introducing the dependent; e.g.

iobj(in,arrive,Spain)

arrive in Spain

iobj(into,put,box)

put the tools into the box

iobj(to,give,poor)

give to the poor

obj2(head,dependent)

The relation between a predicate and the second non-clausal complement in ditransitive constructions; e.g.

obj2(give,present)

give Mary a present

obj2(mail,contract)

mail Paul the contract

ccomp(type,head,dependent)

The relation between a predicate and a clausal complement which does have an overt subject; type indicates the complementiser / preposition, if any, introducing the clausal XP. E.g.

ccomp(that,say,accept)

Paul said that he will accept Microsoft's offer

ccomp(that,say,leave)

I said that he left

xcomp(type,head,dependent)

The relation between a predicate and a clausal complement which has no overt subject (for example a VP or predicative XP). The type slot is the same as for ccomp above.

E.g.

xcomp(to,intend,leave)

Paul intends to leave IBM

xcomp(_,be,easy)

Swimming is easy

xcomp(in,be,Paris)

Mary is in Paris

xcomp(_,be,manager)

Paul is the manager

Control of VPs and predicative XPs is expressed in terms of GRs. For example, the unexpressed subject of the clausal complement of a subject-control predicate is specified by saying that the subject of the main and subordinate verbs is the same:

Paul intends to leave IBM

subj(intend,Paul,_)

xcomp(to,intend,leave)

subj(leave,Paul,_)

dobj(leave,IBM,_)

arg(head,dependent)

The hierarchical organisation of GRs makes it possible to use underspecified GRs where no reliable bias is available for disambiguation. For example, both Gianni and Mario

can be subject or object in the Italian sentence

Mario, non l'ha ancora visto, Gianni

'Mario has not seen Gianni yet' / 'Gianni has not seen Mario yet'

In this case, the parser could avoid having to try to resolve the ambiguity by using the underspecified GR arg, e.g.

arg(vedere,Mario)

arg(vedere,Gianni)

dependent(introducer,head,dependent)

The most generic relation between a head and a dependent (i.e. it does not specify whether the dependent is an argument or a modifier). E.g.

dependent(in,live,Rome)

Marisa lives in Rome

D.4.2.4 Examples of usage

It can be argued quite convincingly that the level of functional annotation (or any other syntactic representation which abstracts away dramatically from the surface ordering of syntactic units in a sentence) is relatively independent of the specific utterance through which grammatical functions happen to be concretely realized. For example, given the following orthographic transcription

i)

I I I go away

where the pronoun "I" is uttered thrice, it still makes sense to say that the subject of "go away" is one (namely the pronoun "I"), and that it just happens to be repeated more than once, owing to some extra-grammatical factors. The neat separation between chunked representations (where concretely realized syntactic units matter) on the one hand and the level of functional representation on the other hand, allows the annotator to get around somewhat puzzling issues such as "which one of the three overtly realized instances of 'I' is the subject of this utterance?". In fact it makes comparatively little sense to associate the label "subject" with any particular token of "I" in i) above. A level of annotation which abstracts away from the level of linear representation embodied in i) achieves this purpose:

subj(go, I,_)

Still linking the functionally annotated material with elements of i) can be useful. This could be achieved as follows: a) first, the three pronouns in a row are signalled as a repetition at some level of "edited" orthographic transcription; b) a target form ("I") is then added to the surface representation; c) finally, the target form is linked to the functionally annotated material.

D.5 OVIS

Coding book:

No coding book is publicly available. References can be found at http://grid.let.rug.nl:4321

See also Bod and Scha (1997).

Number of annotators:

missing information

Number of annotated dialogues:

21000 sentences, Dutch

Evaluation of scheme:

missing information

Underlying task:

Information-seeking, telephone-mediated human-machine dialogues for travel/transport domain.

Examples:

no examples available

Mark-up Language:

missing information

Existence of annotation tools:

Annotation was done semi-automatically, using a tool called SEMTAGS.

Usability:

Used in the OVIS interactive spoken language system for travel information to users using public transport in the Netherlands.

Contact person:

Rens Bod (Rens.Bod@let.uva.nl)

List of phenomena annotated:

The OVIS system aims at reaching large vocabulary, speaker-independent continuous speech recognition technology, combined with natural language processing using a probabilistic partial parsing approach. The NLP Ovis component is a statistically based language processing system, based on the 'Data-Oriented Parsing' System developed and implemented at the Department of Computational Linguistics of Amsterdam University.

Hesitations, false starts, and additional noises produced by speakers are annotated at the morpho-syntactic level. The following is a slightly more detailed description of information represented at the syntactic and semantic levels of analysis.

1. Syntactic annotation

Syntactic annotation starts from a minimum level consisting in bracketing of constituents. Sentences are annotated with labelled constituent trees, as in the ATIS corpus. The syntactic categories have been reconsidered to fit the needs of the application. The original linguistically inspired annotation convention has received considerable revision: in particular, certain rather broad categories were introduced that are non-standard in linguistic theories. For instance, a notion of 'modifier-phrase' which includes adverbs, PP's, and various kinds of conjunctions and other combination of such constituents. Other ad hoc categories have been introduced to deal with peculiarities of Dutch word order which do not fit well in a purely surface-based syntactic description without features.

The grammar covers most of the common verbal subcategorization types (intransitives, transitives, verbs selecting app, and modal and auxiliary verbs), np-syntax (including pre- and postnominal modification, with the exception of relative clauses), pp-syntax, the distribution of vp-modifiers, various clausal types (declaratives, yes/no and wh-questions, and subordinate clauses), all temporal expressions and locative phrases relevant to the domain, and various typical spoken language constructs.

2. Semantic/pragmatic annotation

Every meaningful node is annotated with a formula expressing that meaning; if the meaning of a node depends on its daughter nodes, this formula contains variables referring to those daughter node meanings. When a new tree is constructed out of subtrees with such annotations, it is obvious how to compute the meaning of this tree.

D.6 The Lancaster/IBM Spoken English Corpus (SEC)

Annotation for the Spoken English Corpus (SEC) is based on the LOB Corpus tag-set. Almost every SEC tag is identical to its LOB equivalent. The major difference between the tag-sets is that LOB differentiates between relative and interrogative WH-pronouns whereas SEC does not. For example, the LOB tag pair WP (WH-pronoun, interrogative, nominative or accusative) and WPR (WH-pronoun, relative, nominative or accusative) are covered by the same SEC tag. Confusingly, this tag is also called WP, but, unlike for LOB, does not imply that the WH-pronoun is interrogative. The following table details the major differences between LOB and SEC with regard to WH-pronouns:
Tag
Description in SEC
Description in LOB
WP
WH-pronoun, nominative or accusative
WH-pronoun, interrogative nominative or accusative
WPR
Not used in SEC (WP used instead)
WH-pronoun, relative nominative or accusative)
WP$
WH-pronoun, genitive
WH-pronoun, relative, genitive
WP$R
not used in SEC (WP$ used instead)
WH-pronoun, relative, genitive
WPO
WH-pronoun, accusative
WH-pronoun, interroga tive, accusative
WPOR
Not used in SEC (WPO used instead)
WH-pronoun, relative, accusative

As its name implies, the Spoken English Corpus is composed of transcriptions of spoken English. This inherently means that there will be differences between it and the LOB corpus which is comprised of written texts only. Phenomena that are used primarily for English in its written form will not be found in SEC. A good example is written abbreviations. These were marked in LOB in a pre-automatic-tagging phase by adding the sequence '\0' to the start of the abbreviated token whereas this is not required in SEC.

Some of the LOB tags do not appear in SEC even though, in theory, they would have been allowable. This is because, at just over 52 thousand words, SEC is much smaller than LOB which has over a million words. Naturally, in such a small corpus the coverage of rare parts-of-speech was reduced. This can also explain why annotation of SEC did not call for a significant extension of the LOB tagset.

Further information on the SEC can be found in Taylor and Knowles (1988) and at the International Computer Archive of Modern English (ICAME) corpus collection (http://nora.hd.uib.no/corpora.html).

D.7 SWITCHBOARD

Coding book:

Marie Meeter et al. 1995. Disfluency annotation stylebook for the Switchboard Corpus.

(ftp://ftp.cis.upenn.edu/pub/treebank/swbd/doc/DFL-book.ps)

Number of annotators:

missing information

Number of annotated dialogues:

2430 conversations, more than 240 hours, 3 million words

Evaluations of scheme:

missing information

Underlying task:

missing information

List of phenomena annotated:

list of relevant phenomena provided below

Examples:

list of relevant phenomena provided below

Mark-up language:

missing information

Existence of annotation tools:

missing information

Usability:

missing information

Contact person:

Linguistic Data Consortium (ldc@ldc.upenn.edu)

D.7.1 Word-Level Classification Issues

D.7.1.1 Adverbs, Interjections, Interactional Markers

Explicit editing terms (such as 'I mean') and discourse markers (such as 'Well') are annotated respectively as '{E...}' and '{D...}'. Use of curly brackets allows annotation of a sequence of words, by simply including it into brackets.

Example:

{E I would say}

D.7.1.2 Pauses, Hesitators

Only filled pauses are markers (hesitators) by '{F}'.

D.7.1.3 Word Partials, Non Standard Forms

Fragmented or incomplete words are marked in the transcription with '-'.

Example:

you kn-

D.7.2 Segmentation Issues

Transcribed texts are subdivided primarily into so-called "slash units". A slash unit is maximally a sentence but can be a smaller unit. Slash units below the sentence level correspond to those parts of the narrative which are not sentential but which the annotator interprets as complete.

D.7.2.1 Multi-Words

Annotation makes provision for marking sequences of more than one word with one label only by encompassing them between curly brackets.

D.7.2.2 Error Coding

No specific marker is envisaged for this purpose.

D.7.2.3 Phrase Partials

D.7.2.3.1 Trailing Off, Interruption, Completion

When a turn does not constitute a complete constituent, it is marked as incomplete with the symbol '-/'. It is possible for the speaker to continue over more than one turn. In this case, the annotation guidelines make provision for use of the symbol '- -'. Combination of the two symbols means the following:

'- - -/' interruption with constituent left incomplete and following

completion

Example:

A: I'll do it if - - - /

B: Yeah/

A: - - you wish/

'- - /' interruption with complete slash unit and following completion

Example:

A: I'll do it - - /

B: Yeah/

A: - - if you wish/

'- -' interruption with neither incomplete constituent nor complete slash unit, and following completion

Example:

A: If you wish - -

B: Yeah/

A: - - I'll do it/

D.7.2.3.2 Retrace-and-Repair Sequences

The entire restart with its repair is contained in square brackets. The Interruption Point is marked by a '+'.

Example:

[ we're + at the same time we're ] real scared

D.7.2.3.3 Anacolutha (syntactic blending)

Syntactic blending is treated as a kind of incomplete slash unit, if the speaker continues speaking but has obviously begun a new slash-unit.

Example:

when it comes to being alone -/ now if you give him the freedom to walk around, he likes that/

D.8 TRAINS

The TRAINS project at the University of Rochester Department of Computer Science is a long-term effort to develop an intelligent planning assistant that is conversationally proficient in natural language. The goal is a fully integrated system involving on-line spoken and typed natural language together with graphical displays and GUI-based interaction. The primary application has been a planning and scheduling domain involving a railroad freight system, where the human manager and the system must co-operate to develop and execute plans.

The current system prototype, named TRIPS (The Rochester Interactive Planning System), involves a more realistic domain and more complicated planning problems, while continuing the emphasis on dialogue-based, mixed-initiative interaction.

Coding book:

No coding book is available, but information can be found in Core and Schubert (1997).

Number of annotators:

missing information

Number of annotated material:

Altogether, the Trains-93 corpus includes 98 dialogs, collected using 20 different tasks and 34 different speakers. This amounts to six and a half hours of speech, about 5900 speaker turns, and 55,000 transcribed words. The collection and transcription of the dialogues is documented in the technical note "The Trains 93 Dialogues"

(ftp://ftp.cs.rochester.edu/pub/papers/ai/94.tn2.Trains_93_dialogues.ps.gz)

The transcriptions themselves are available at http://www.cs.rochester.edu/research/speech/93dialogs

Evaluations of scheme:

missing information

Underlying task:

Task-driven, application-oriented problem solving dialogues. The dialogues involve two participants: one who plays the role of a user and has a certain task to accomplish, and another who plays the role of the system by acting as a planning assistant.

List of phenomena annotated and examples:

For some of the phenomena annotated at the morpho-syntactic level, see the general description below.

Mark-up language:

missing information

Existence of annotation tools

For collecting and annotating ``The Trains 93 Dialogues'', a set of tools has been developed for converting a DAT recording into a fully segmented and annotated dialogue. These tools allow the user to progress stepwise through this process, from creating the initial dialogue audio file, breaking up the dialogue into a sequence of single-speaker utterance files that preserve the sequentiality of the dialogue, annotating the utterance files, printing the contents of the dialogue, and updating the breakup of the dialogue. These tools are described in the Trains technical note, "Dialogue Transcription Tools" (ftp://ftp.cs.rochester.edu/pub/papers/ai/94.tn1.Dialogue_transcription_tools.ps.Z)

and are available through ftp, as well as on the CD-ROM. The toolset itself is available in a tar file at ftp://ftp.cs.rochester.edu/pub/packages/dialog-tools/toolset.tar.gz.

Usability:

Used in the TRAINS system.

The collected dialogues have played an integral part in the Trains project. They have also been used to train a parser that uses statistical preferences, and to train a part-of-speech tagger that models speech repairs (cfr. Heeman and Allen, 1994) (ftp://ftp.cs.rochester.edu/pub/papers/ai/94.heeman.ARPA_HLT.ps.Z)

Contact person:

James Allen (james@cs.rochester.edu)

A short description

The TRAINS project is to be mentioned as an example of how the exigencies of spoken language can be accommodated in software development. In particular, the TRAINS project is especially relevant for our purposes in that it adopts an integration vs. normalization strategy (see. the section 5.1.1 in the report).

The traditional approach consists in removing disfluencies before they reach the parser or in having the parser skip over such material. However reasonable, this approach not only abstracts from real data but also neglects the important roles such segments can play in the dialogue structure. Repairs, for example, can contain referents that are needed to interpret subsequent text (e.g., Take the oranges to Elmira, uh, I mean, take them to Corning).

In contrast to the above strategy, the alternative adopted in TRAINS is a parser-level approach that includes in phrase structure those disfluencies (such as repairs, hesitations and overlapping backchannel acknowledgments) that constitute a common problem for parsers for mixed-initiative dialogues.

To handle the disfluencies in mixed-initiative dialogues caused by repairs, hesitations and acknowledgments, the dialogue parser uses metarules that allow the chart of a dialogue parser to contain parallel syntactic structures (what was first said and its correction) in the case of repairs, and interleaved syntactic structures in the case of interruptions.

The editing term metarule allows constituents to skip over words signaling turn keeping (um, ah) and repairs (I mean).

In the structure allowed by the metarule a constituent may be interrupted between two subcostituents by one or more editing terms, and a constituent can be interrupted in more than one location.

In the case of overlapping acknowledgments and continuation prompts, such as 'okay', 'right' etc. uttered by the second speaker in overlap with the 'main' talk, the continuation metarule allows a constituent to overlap or be embedded inside another constituent to which it is unconnected. In this way, a constituent can be built across tracks.

An interruption metarule is used to deal with interjected corrections, questions, and comments separately from any repair that may follow. An example of interruption is the following:

u: then e1 will have

s: oh e1

u: right

two boxcars of oranges

In the case of repairs, a repair metarule operates on what is being corrected (or reparandum) and the correction (or the alteration), to build parallel phrase structure trees: one with the reparandum and one with the alteration. For example, for an utterance such as "Take the ban- um the oranges", the repair metarule would build two VPs: take the ban- and take the oranges.

This parsing framework has two relevant consequences. First, it allows the parser to accommodate disfluency phenomena, thus leaving important aspects of dialogue structure untouched. In addition, in this way the parser has information about the syntactic structure of the utterance and the range of allowed structures. These sources of information are absent from preprocessing, normalizing routines, and the dialogue parser can still use acoustic cues, pattern matching, and other sources of information used in preprocessing techniques.