Evaluation of NLP technology for AAC using logged data
Ann Copestake and Dan Flickinger
Center for the Study of Language and Information (CSLI), Stanford University
We discuss our experiences with evaluation of NLP techniques in an AAC system using data logged from a user as the primary source for comparison. We show that logged data can be used in simulations to quickly compare a range of prediction algorithms on a realistic data-set. It can also be used to compare the actual performance of an algorithm with its theoretical performance. We argue that development of novel AAC techniques can be guided by examination of collected material, and that some initial evaluation is possible in advance of testing with a user. There are, however, several potential problems with the use of logged data, which we will discuss. We will also describe how audiotaping reveals extra facets of the interaction with an AAC user and thus to some extent complements the use of logged data.
The aim of this paper is to illustrate an approach to evaluating NLP techniques for use in research on AAC systems. Evaluation of NLP technology in general is a notoriously thorny issue (for a summary of approaches see Sparck Jones and Galliers, 1995). However our immediate goals are relatively restricted. Firstly, we are currently primarily interested in evaluation up to the level of research prototypes: thus rather than looking at complete systems, we are using formal and informal evaluation techniques to compare the performance of different NLP algorithms on the same task, and also to screen possible NLP techniques for their likely utility in tackling problems in AAC. Evaluation of techniques in isolation is much simpler than evaluation of complete systems, since it involves looking at specific criteria relevant to particular aspects of performance. Secondly our current aims in evaluation are primarily internal (that is, directed at development within a project) rather than external (allowing comparison between sites). We will return to some of the issues involved in extending this work to external evaluation at the end of the paper.
In what follows, we will first describe the methodology we adopted for data-logging, then discuss how the logged data was used to evaluate prediction algorithms. The techniques used in the prediction experiments and the results obtained have been reported previously (Copestake, 1996, 1997), so our purpose here is to discuss the use of logged data in more detail and to evaluate its advantages and disadvantages. We then describe more informal evaluation using logged data as a way of looking at recurrent conversational situations and some additional information revealed by audio-taping experiments. We will conclude with a discussion of how our data-logging methodology might be improved, and the role data-logging plays within a wider evaluation context.
The experiments described make use of data collected from JL, who has been using a prototype AAC system that was developed at CSLI as his main communication aid. JL has lost speech due to amyotrophic lateral sclerosis (ALS or Lou Gehrig's disease). The prototype system comprises software designed to aid text input to a text-to-speech generator (JL uses an external DecTalk). The system runs on a standard laptop while still allowing the use of other software (email, Web browser etc.). It incorporates word prediction, based on word frequencies trained dynamically on the user's input, and also a small number of user-defined fixed phrases, accessible via dedicated keys. A much larger set of fixed phrases is available via a menu interface, but we will not discuss this here, since JL does not use this facility. More details about the system are given in Copestake (1996, 1997). We have logged JL's data for about three years. During most of that time, his condition was relatively stable, and he was able to enter text by operating a keyboard with one finger, with numeric keys used to select prediction menu options.
Initially we simply logged the text that was passed to the speech synthesizer for output, using a format of one utterance per line. We also logged commands to the synthesizer, such as changes in rate and volume. It became apparent that we also needed information about the timing of each utterance, so later versions of the logger included time-stamps. The most recent version of the logging software also records data from the prediction engine at the points at which a menu item was selected. For example:
16:41:07::{8,w,ill }{8,,it }{5,,to }{2,tr,ansfer }{6,the,n }{5,the,se }{2,fi,les }{5,,to }{7,n,ew }{0,co,mputer }16:42:31::
i will use it to transfer these files to my new computer
This indicates that the utterance was begun at 16:41:07 and finished at 16:42:31 (i.e., the rate was about 8.5 words per minute, although we cannot tell whether JL was typing continually or not). Some words were typed in full (I, use, my). In other cases the prediction menus were used: for instance w was typed but will was then selected from the menu (key 8), it was selected before any letters were input, and so on. Selecting then was presumably an error, corrected to these. The menu ordering was set up so that the preference order on selections was 0,9,8 etc., rather than 1,2,3, because JL was operating the keyboard with his right hand, and the higher digits were therefore more accessible. JL had the option to turn off data-logging if a conversation was particularly sensitive; however, he very rarely used this.
As we will discuss below, this data-logging was not perfect - for instance, it would have been useful to collect data about every keystroke (although this would have been difficult to implement). However, as we will describe in the next sections, it gave us a good basis to compare prediction algorithms and also to decide which NLP techniques appeared to be the most promising avenues for research for this type of AAC.
Though prediction is only ever a small component of an AAC system, evaluating prediction algorithms is non-trivial. There are multiple dimensions on which we might measure performance of algorithms. The primary one concerns the accuracy of the prediction - this is normally expressed in terms of keystrokes saved compared to typing the full utterance. Of course, this is only an approximation to the ultimate criteria, which are saving of time and effort on the part of the AAC user. Even if we assume that all aspects of the user interface layout etc. are kept constant when comparing prediction algorithms, keystroke saving will not correlate well with speed/effort saving if the overhead of making the selection differs significantly between algorithms. Prediction is only useful in circumstances where text entry is relatively slow or where the input is somehow artificial and therefore difficult to reproduce correctly, for instance when entering lengthy commands to a computer (see, e.g., Darragh and Witten, 1992).
For the case of someone like JL with advanced ALS for whom movement is very restricted and very tiring, it is reasonable to suppose that the cognitive overhead of using a menu is relatively low compared to the effort of entering a keystroke. As we will discuss in more detail below, JL usually took the option of using the prediction menu when it gave the desired word. Therefore keystroke saving is a useful criterion and we use the following metric:
For example, choosing table after inputting `t' `a' would give a score of
50%, since a space is automatically output after the word
.The prediction testbed uses logged data to train a statistical prediction algorithm and then tests its performance on a fresh set of data. Following standard practice in NLP, we used a 90% / 10% split between training and test data. We should emphasize the importance of keeping the training and test data completely distinct. It is invalid to assume, for instance, that a prediction engine with a sufficiently large lexicon would cover all words in a test set and that it is therefore legitimate to acquire the word-list from the test data, since no matter how large the lexicon, there will still be unseen tokens in any reasonably sized unrestricted test set. For instance, Brown et al (1992) report that an almost 300,000 word vocabulary contained fewer than 90% of the distinct tokens in the Brown corpus. Most prediction systems will operate with a much smaller vocabulary, so the unknown word problem is proportionally more significant.
The basic assumptions behind the use of logged data in training and evaluating prediction algorithms are:
Clearly this is not true of all possible uses of prediction: if the system is designed to help a child with spelling or language difficulties, for example, it should correct errors rather than reinforcing them. However, for a user without language problems, prescriptivism is generally inappropriate.
When using the testbed, we make the initial assumption that we have a `perfect' user: one who always chooses the prediction menu item when it is available. Discrepancies between predicted and actual results are discussed in the next section. One major advantage in using the testbed methodology is that we can implement algorithms very quickly, without too much concern for robustness or for issues such as memory usage. A system that is to be used daily by someone with ALS has to be very reliable, even if it is only a prototype, and this greatly increases the implementation time needed.
In the prediction experiments we reported in detail in Copestake (1996, 1997), we used 26,000 words of text (around three months of JL's data). We did not use an external lexicon as a source of words, although in one experiment we did use external corpus data in order to provide part-of-speech tags for the collected word list. An algorithm based simply on word frequencies constructed from the training set and dynamically updated during processing of the test-set gave 45.3% keystroke saving with a menu size of 10. Adding the best performing adjustment for recency increased this to 46.5%. We also experimented with taking context into account using ngrams: part-of-speech bigrams based on JL's data gave 49.0% keystroke saving. However, there was a decrease in performance when using part-of-speech bigrams derived from a text corpus. Thus the text corpus did not provide a good model for the AAC data even at this very abstract level.
Based on these evaluations, we changed the CSLI prototype to incorporate the recency metric. We did not however add the syntactic bigram technique. Although it might have produced some improvement in actual performance, the testbed experiments showed that this would not be very large, and the increase in code complexity meant that it would have been relatively costly to implement robustly and to maintain. It therefore did not seem to be the most promising place in which to expend implementation effort.
The figures given above are for theoretical performance of prediction with a `perfect' user. Clearly, even with someone like JL, who regards prediction as an essential component of an AAC system, actual results will not be as good. A user may miss a predicted choice, or inadvertently select the wrong menu item. Comparing prediction algorithms for their theoretical performance is only valid if their ranking with respect to actual performance is the same. This will not be universally true: in particular, techniques which result in changing menus are not directly comparable with systems with static menus, and systems with large lexicons which give more menu choices may be more confusing than systems with smaller lexicons.
To some extent, we can use JL's logged data to compare actual and theoretical performance of the algorithm implemented in the CSLI system. Consider the following example, which was logged using the version of the CSLI prototype that adjusted predictions based on recency:
11:55:48::{0,ta,ke }{9,tr,ay }{7,o,ff }{0,a,nd }{8,pl,ug }{1,,me }{4,w,ith }{7,,the }{8,ext,ension }11:57:53::
take tray off and plug me in with the red extension cord
Here 56 keystrokes would have been needed without prediction. JL actually took 32 keystrokes, compared with the theoretical performance for the algorithm of 26 (with the frequencies that were in effect at the time that this sentence was logged). Based on a manual evaluation of a small fraction of the logged data, this is a reasonably representative example: on average the actual keystroke saving was around 80% of the theoretical saving. Obviously this figure will be very dependent on the user and the user interface, and will also be affected by changes in the algorithm. For instance, even though the prototype's algorithm and the more complex algorithm mentioned above used the same word list, and both involved menus which could change, they are not completely comparable, since the menus are less predictable in a more context-dependent system.
In principle, it should have been possible to automatically check the actual performance against the theoretical performance, by running the prediction algorithm in the testbed on the logged data so that it was in the same state as the actual system, and comparing the `perfect user' simulation against the actual choices. In practice, too many variables crept in for this to give accurate results. The prediction order in the simulation did not always exactly match that recorded for the running system, because the testbed did not include all the data which was going into the running predictor. This occurred when JL used the predictor to compose email (he did not wish us to log his email). Since the frequencies were adjusted dynamically, and also affected by recency, the scoring of words that had been used in email messages was higher in the logged data than it was in the simulation. (The manually calculated figures above were obtained by looking at data collected from close to a point where we had downloaded the predictor frequencies, so should not be too affected by this source of error.) Furthermore, JL sometimes used cut and paste to input text, and this was not recorded. Comparing actual and theoretical performance was not our primary aim in carrying out data-logging, but if we had been attempting to evaluate user interface designs, for example, this sort of problem would have been very serious.
To summarize, based on our experiments, we see the following as the main advantages and disadvantages of using logged data to evaluate prediction algorithms:
Advantages
Disadvantages
In contrast to the formal evaluation of variations on prediction algorithms discussed in the previous section, we have also used the logged data informally, as a way of prescreening possible NLP techniques for further investigation. Designing prototypes by introspection into the nature of the sort of conversations that an AAC user might want to have has obvious flaws as a research methodology. While the ideal situation would be to construct prototypes, which could be evaluated by representative users, this can be very expensive. By using logged data from an AAC user, we can get some idea of how important particular phenomena are and whether an NLP technique might be worth applying. We have also been looking at various corpora of spontaneous speech, since the logged data is limited by the capabilities of the AAC prototype.
For instance, one possible technique for improving prediction might be to classify words by subject area so that once a topic is identified (automatically or by the user), those words can be preferred. However when we tried to simulate this technique, using an existing manual classification of some of JL's vocabulary, we found no improvement in performance (actually there was a slight degradation). Examining the collected data more closely, it became apparent that it was relatively unusual for there to be a clear classification of sentences in terms of the concepts we were using. In the cases where there were clear topics, the sentences only used a tiny proportion of the topic-specific vocabulary, and most of the words in the sentences were not topic-specific. For instance, a conversation might be about Wimbledon, which could be classified under the topic `sport'. But the great majority of the sports vocabulary would not be used in the sentences about Wimbledon, and most of the words in a sentence about Wimbledon would not be specific to the sports vocabulary.
We therefore decided that identification of topic was not a likely route for substantial improvements in word prediction. Furthermore, this experience suggested to us that constructing topic-based templates or partial utterances was not very promising either, at least as a technique for facilitating a major proportion of the conversation. The logged data, and experience with speech corpora, suggests that when a conversation is `about something' the topic will be too specific for it to be likely that useful phrases can be constructed in advance, by someone who does not know the AAC user.
Semantic concept classification is clearly sometimes useful in AAC, for instance as a way of aiding retrieval of pre-stored messages (e.g., Langer and Hickey, 1997) or of creating customized vocabulary for particular users in special situations (e.g., for school use, when the topic of a lesson is known in advance, Sinteff, 1998). Also some situations are sufficiently important that a topic-specific set of words and phrases would be useful, even if those situations only account for a small proportion of the AAC user's time. Interacting with a doctor might be one such case. But even in this situation it appeared to us that it would be more promising to provide tools that aid the AAC user or a helper to prepare for the conversation, rather than to build specific topics into the system.
However examination of the logged data does reveal some patterns which we believe we can usefully exploit. For instance, a large proportion of the collected data consists of requests that someone performs some action on JL's behalf. A large proportion of these requests is concerned with routine needs and can be dealt with by user-defined fixed phrases. But there are also many cases where the request concerns a non-routine action: mending something, helping with a computer problem, asking some third party about something, etc. The vocabulary concerning the topic of the request is not predictable, but there are conventional ways of making requests, which we can exploit, and also likely ways in which the dialogue will proceed. Since we only have one side of the conversation, and are lacking information about the context, there is no way of being sure what is going on, but we can assume that a sequence in the data such as:
please ask jim to come over tomorrow
thanks
indicates that the interlocutor agreed to the request. In contrast, in:
please mail the letter
hall table
the second utterance is likely to be a clarification, probably about the location of the letter. We discuss how we are attempting to exploit this sort of convention in dialogues in Copestake (1997).
One thing that becomes apparent when looking at logged data, is how repetitive it can appear in tone and style. In JL's data, for example, requests for action are usually expressed as imperatives, normally preceded by please. This is, of course, also something that people interacting with AAC users and AAC users themselves comment on. We believe there is considerable scope for varying phrasing according to parameters such as politeness, without very much extra work on the part of the AAC user, again because of the high degree of conventionality.
Because of the one-sided nature of the logged data, we tried audiotaping some of JL's conversations. This did not provide us with nearly as much data as we would have liked, because at the point we carried out the taping, JL was contributing far fewer words to conversations than he had been. However, taping did make evident many aspects of the interaction that we could not have discovered from data-logging alone. One of the most significant was that it became apparent that it was frequently very difficult for JL to break into a multi-participant conversation, even after he had composed some text. The problem was partly the volume of the DecTalk synthesizer --- although this can be set to be quite loud, changing volume requires some intervention on the part of the user and a volume that would be adequate to be guaranteed to be overheard over a loud conversation would be unpleasant under other circumstances. The other problem is timing: breaking into a conversation effectively requires that the timing of the interruption be very precise, which is problematic if the utterance is being made by a text-to-speech synthesizer. We hope to investigate ways in which we might alleviate these problems by using speaker identification technology in order to allow the volume and precise timing of an utterance to be controlled automatically. Another interesting observation was the extent to which JL was using his computer resources and the Internet as part of his interaction with visitors, particularly by playing music from Web sites and by showing people Web sites that he had found. A promising line of investigation would be to see if this sort of interaction could be integrated into an AAC device (see Copestake and Flickinger, 1998).
As we discussed in the introduction, the evaluations reported here are purely internal. One of the problems that we have found with work on prediction is that there is no accurate way of comparing our results with those reported by other groups, partly because of the lack of a common test corpus. Ideally, data logged from a user of an AAC device would form a component of such a corpus but confidentiality is clearly a problem. One area for future work is to see whether some of the recently distributed speech corpora (e.g., CALLHOME, CALLFRIEND) have sufficiently similar properties to AAC data that they could be used as a realistic common testbed.
In addition to the functionality of the current version of the data-logger, there are at least two facilities that we would aim to incorporate into any future system:
The techniques we have described, which are based on data logging are, of course, only part of a full evaluation process. The collected data has proved invaluable to us as a resource in this early stage of our research on AAC, both as a way to allow us to make more informed choices as to the sort of NLP techniques which might be most promising, and as a way of evaluating slight variations on the well-established techniques of word and phrase prediction. There are other aspects that we have not discussed in detail here. For instance, our proposed cogeneration technique makes use of an extensive grammar of English, and the logged data provides us with one source of information from which to construct test suites to allow us to ensure that the grammar has adequate coverage. However, the ultimate test has to be user evaluation of a full prototype system.
Acknowledgments
We are very grateful to Greg Edwards who implemented the CSLI AAC system and the data logger described here. We would also like to thank Ai Kato, who conducted the audiotaping experiments and transcribed the results. Mistakes etc. are the responsibility of the authors. This material is based upon work supported by the National Science Foundation under grant number IRI-9612682.
References
Brown, P. F., S.A. DellaPietra, V.J. DellaPietra, J.C. Lai and R.L. Mercer (1992). `An estimate of an upper bound for the entropy of English’. Computational Linguistics, 18(1), pages 31-40.
Copestake, A. (1996). `Applying natural language processing techniques to speech prostheses’. Proceedings of the AAAI Fall Symposium on developing assistive technology for people with disabilities, MIT, Cambridge, MA.
Copestake, A. (1997). `Augmented and alternative NLP techniques for augmentative and alternative communication’. Proceedings of the ACL workshop on Natural Language Processing for Communication Aids, Madrid, pages 37-42.
Copestake, A. and D. Flickinger (1998). `Enriched language models for flexible generation in AAC systems’. Proceedings of the Technology and Persons with Disabilities Conference (CSUN-98), Los Angeles, CA.
Darragh, J. J. and I. H. Witten (1992). `The reactive keyboard’. Cambridge University Press.
Langer, S. and M. Hickey (1997). `Automatic message indexing and full text retrieval for a communication aid’. Proceedings of the ACL workshop on Natural Language Processing for Communication Aids, Madrid, pages 9-16.
Sinteff, B. (1998) `Applying vocabulary search strategies in augmentative communication: Dynavox and Dynamyte products’. Proceedings of the Technology and Persons with Disabilities Conference (CSUN-98), Los Angeles, CA.
Sparck Jones, K. and J.R. Galliers (1995) `Evaluating Natural Language Processing systems: an analysis and review’. Springer-Verlag.