Proceedings of HF 2002, Nov. 25-27, 2002, Melbourne, Australia
Aesthetic Appeal versus Usability: Implications for User Satisfaction
Gitte Lindgaard & Cathy Dudek
Human Oriented Technology Lab (HOT Lab), Carleton University
gitte_lindgaard@carleton.ca, cdudek@chat.carleton.ca
Keywords: Satisfaction, Usability, Aesthetics, Appeal
Abstract
People judge incoming sensory stimuli immediately by how pleasant or unpleasant these ‘feel’. When judging a web site seen for the first time, this judgment is based on visual appearance. At the same time, people tend to be reluctant to revise a judgment once it is made, resulting in a so-called confirmation bias. In this study we investigated the existence and the robustness of this bias by requiring subjects to complete a usability test containing serious usability problems after exploring a high- or a low-usability site. Both sites were high in aesthetic appeal. Results suggest that subjects are sensitive to different levels of usability and that they do revise their original satisfaction judgment after completing the test. They also suggest that aesthetics is judged independently of usability.
1. Introduction
When you meet a person for the first time, you know instantly whether you like her or whether she makes you feel uncomfortable. The immediacy of this decision to like or dislike is striking, but what is perhaps even more interesting is our resistance subsequently to revise that first impression formed within milliseconds upon greeting someone (LeDoux, 1996). It seems to linger on over time even as we gather more evidence, some of which is likely to disconfirm our first impression. This tendency towards a ‘confirmation bias’ (Mynatt, Doherty & Tweeney, 1977) has long been known in the human decision making and judgment literature, and it holds for many types of judgment, even for expert professional ones such as medical diagnosis (Eddy & Clanton,1982; Lindgaard, 1985) as well as for social decisions made in everyday life (Anderson, 1981; Nisbett & Ross, 1980). A confirmation bias is a tendency to seek information that confirms our initial hypothesis and ignore disconfirmatory evidence. So, if you don’t like what you see, hear, smell or feel right away, it takes a great deal more contradictory evidence for you change your mind than if your first impression were positive or even neutral.
Traditionally, usability has concentrated on effectiveness and efficiency, even though user satisfaction is also mentioned in the ISO 9241-11 usability standard (ISO, 1997). Researchers have, however, begun to question the importance of the link between hedonic and other qualities of products. Jordan (1998), for example, investigated the feelings products evoke in their owners over time in a series of semi-structured interviews with product users who were asked to report their feelings towards specific products of their choice that were either pleasurable or unpleasurable to use. In another study, Tractinsky, Katz & Ikar (2000) investigated the link between aesthetic appeal, perceived and actual usability in a range of ATM prototype interfaces presented on a screen. They found that subjects were influenced more by aesthetics than by usability. Interfaces perceived to be high in aesthetics were seen also to be highly usable regardless of the actual usability levels, and judgments of interfaces perceived to be low in aesthetics before the usability test improved after the test in the high-usability condition. Thus, judgments were predominantly driven by the aesthetic appeal.
In the ISO 9241 standards (ISO, 1997) user satisfaction is referred to as ‘comfort and acceptability’ of product use, which emphasises attitudinal rather than experiential aspects of satisfaction. This attitudinal component of usability is thus concerned with avoiding negative feelings and is usually measured in 5-point or 7-point rating scales. It is not surprising then, that the few satisfaction measurement instruments available to the HCI community (Chin, Diehl & Norman, 1988; Kirakowski, 1996; Kirakowski, Claridge & ‘Whitehead, 1998) overwhelmingly focus on such attitudinal aspects of the user experience.
Our interest is in the experiential, rather than in the attitudinal, aspects of user satisfaction. Based on content analyses of the stories subjects tell us about an interactive experience with e-commerce web sites (Dudek & Lindgaard, 2002; Lindgaard & Dudek, 2002a; 2002b), our research into user satisfaction so far suggests that it is a complex construct comprising several dimensions including aesthetics and perceived usability. Details of these dimensions are reported in Lindgaard & Dudek (2001, 2002a). In one of these studies we showed that the interactive experience was very positive for two web sites that were both very high in aesthetic appeal, as reflected in subjects’ high satisfaction scores for both sites. At the same time, subjects were also sensitive to differences in usability levels, as reflected in their perceived usability scores, which were higher for the high-usability than for the low-usability web site. These results suggested that user satisfaction with e-commerce web sites might be driven predominantly by the immediate impression of aesthetic appeal.
In those early experiments, subjects were not required to complete a usability test, so their impression of usability as well as their satisfaction scores were based entirely on a reasonably brief, informal interaction with the web sites. In the research reported here, one half of the subjects completed a usability test as well, in order to address three specific issues. First, we questioned the robustness of the confirmation bias by investigating whether subjects revise their initial opinion after completing a set of tasks designed specifically to expose serious usability problems. If user satisfaction after prolonged interaction is entirely driven by the immediate impression, then satisfaction should not change as a function of exposure to severe usability problems. That is, user satisfaction should be equally as high after as before completing a usability test. This would confirm the robustness of the confirmation bias. Along similar lines, two sites that are both high in aesthetic appeal should give rise to equally satisfying experiences even though they varied in usability.
By contrast, to the extent that satisfaction arising from prolonged interactive experience is driven, at least in part, by usability, it should change after exposure to serious usability problems, and it should be lower for the low-usability than for the high-usability site. That is, satisfaction scores derived before a usability test should be higher than after the test, and they should also differ between the two sites varying in usability levels. This would suggest that the confirmation bias is not as robust as earlier research suggests.
The second question addressed here asked whether perceived usability and aesthetic appeal change as a result of subjects encountering severe usability problems. In the face of a very robust confirmation bias, subjects may be very lenient in their judgment of perceived usability even after experiencing usability problems. Thus, their initial usability judgments should remain unchanged. Likewise, one would not expect aesthetic appeal to change. If, however, subjects revise their initial judgments after a usability test, one would expect perceived usability scores to be lower after than before the usability test. If aesthetics is judged independently of usability, subjects may or may not change their mind on this dimension. If however, the two covary, one would expect judgements of aesthetics also to change after a usability test.
The third issue concerned task demands. We asked whether subjects anticipating a usability test are more aware of usability problems than subjects who are browsing a web site without such expectations. It is feasible to argue that the former will work through the site systematically to ensure they see as many pages of the site as possible before embarking on the usability test and thereby pay distinct attention to usability problems, whereas the latter will explore less systematically and pay less attention to such problems. These differences should be reflected in perceived usability scores: the more usability problems subjects note, the lower the perceived usability scores, and the fewer usability problems they note, the higher one would expect the perceived usability scores to be. Thus, perceived usability scores should be lower for subjects expecting a usability test than for those who do not.
2. Method
Two commercial web sites were used, both of which had been shown earlier to be high in aesthetic appeal and to vary in perceived usability (Lindgaard & Dudek, 2002a). An heuristic evaluation was subsequently performed on both sites, resulting in 157 unique instances of moderate (n = 45) to severe (n = 112) usability problems in the low-usability site, and 27 problems of which 11 were moderate and seven were severe in the high usability site. By our definition, severe problems prevent users entirely from completing a task, and moderate ones hinders user performance substantially but does not stop them completely from accomplishing their task goal. Based on a selection of these problems, eight usability tasks representing typical tasks one would expect users to perform were designed for each web site.
2.1 Design
Subjects completed one of two experimental conditions. In one they browsed the site for 10 minutes before being interviewed and filling in the WAMMI (Web site Analysis MeasureMent Inventory). The WAMMI is a standardized measurement tool comprising 20 questions using 5-point rating scales to measure the perceived usability of web sites (Kirakowski, Claridge & Whitehead, 1998). The second condition resembled the first, but these subjects also completed a set of eight usability tasks after which they were interviewed and asked to fill in the WAMMI again. Presentation of the WAMMI and the interview were counter- balanced to avoid serial position effects. The procedure was identical for the two web sites.
User satisfaction scores were derived from the interviews by counting all positive and negative statements made about the web site and summing them, counting repeated statements within an interview profile only once. The proportion of positive statements determined the level of satisfaction [satisfaction = positive/(positive + negative)]. A score of 0.5 meant the satisfaction level was neutral, whereas scores above 0.5 were positive, and below 0.5 were negative, and the further removed a score was from 0.5, the more extreme it was, either positive (close to 1.0) or negative (close to 0).
2.2 Subjects
A total of 80 subjects participated. Of these, 40 were assigned at random to the first, and 40 to the second experimental condition. All were native English speakers, none had previous knowledge of usability, and none had seen the web sites before. There was an equal number of males and females, but age was not controlled for. Subjects were paid $10 or $15 for their time depending on the experimental condition to which they were assigned.
2.3 Procedure
Subjects were asked to browse the site for 10 minutes to ‘find any suitable gift to give a friend as an apology’, and were informed that this would be followed by an interview about their experiences as well as a rating scale. Subjects who were also given a usability test were told so at the outset. No questions were asked during browsing. The experimenter sat behind the subject taking notes, and the entire session was audiotaped. At the end of the 10 minutes browsing time subjects were interviewed and given the WAMMI. Subjects in condition (1) were then debriefed, paid and excused. Subjects in condition (2) were given the usability tasks, all performed in the same sequence. They were then interviewed again and asked to fill in the WAMMI before being debriefed, paid, and excused.
3. Results
Overall, the results confirmed that the two sites differed in actual usability as assessed by subjects’ performance in the usability test. On average, subjects completed 6.1 tasks successfully on the high-usability, and 3.85 on the low-usability site. Thus, the assertion that the two sites differed along the usability dimension was justified.
3.1 Confirmation bias robustness
3.1.1 User satisfaction before and after the usability test
Satisfaction scores obtained from the interviews on both web sites before and after the usability test are shown in Figure 1. These data include only scores from subjects who completed the usability tests.

Figure 1: Mean satisfacton scores for both web sites (‘Low’-usability, ‘High’-usability) before (BUT) and after (AUT) the usability test
The Figure suggests that satisfaction scores were higher before than after the usability test for both web sites. The 2 x (2) mixed-design ANOVA comparing before/after scores and web site supports this by a main effect for the before/after comparison, F(1,38) = 39.53, p < 0.001. Satisfaction scores were higher overall for the high-usability than for the low-usability site as evidenced by a main effect for web site, F(1,38) = 5.05, p < 0.05. There was no interaction. Apparently, subjects revised their opinion after encountering serious usability problems and site usability did influence user satisfaction also.
3.1.2 Perceived usability before and after the usability test

Figure 2: Mean perceived usability scores for two web sites (‘Low’-usability, ‘High’-usability) before (BUT) and after (AUT) the usability test
Perceived usability, measured by the mean WAMMI scores, are shown in Figure 2.
As the Figure shows, perceived usability mirrored the satisfaction scores by
differing between web sites (F(1,38) = 15.95, p < 0.001) and in the before/after
comparison (F(1,38) = 21.32, p < 0.001). There was no interaction.
3.1.3 Aesthetics judgments before and after the usability test

Figure 3: Mean aesthetics scores for the two web sites (‘Low’-usability, ‘High’-usability) before (BUT) and after (AUT) the usability test
To calculate the before/after score on the aesthetics dimension, all statements
relating to aesthetics were analysed separately. Aesthetics statements were
those that referred to what the user was seeing on the page, for example “the
graphics are great” and “it was beautiful”. As was done for the satisfaction
scores, the proportion of positive aesthetics scores was calculated. The outcome
of this analysis is shown in Figure 3 below.
The Figure suggests that aesthetics scores remained unchanged after the usability test. Unfortunately, it was not possible to analyse the data using an ANOVA because many subjects made no aesthetics statements in the second interview. Subjects who failed to make any aesthetics statements in the second interview were excluded from the calculation, resulting in unequal sample sizes. However, the available data enabled us to perform a t-test for related samples, for each site, comparing the before and after conditions. Neither of these was significant (low-usability site, t(13) = -0.13, p > 0.05; high-usability site t(9) = 0.60, p > 0.05). A t-test for independent samples, comparing aesthetics scores before the test, showed that there was no difference between the sites (t(38) = 1.93, p > 0.05). Clearly, more data are required to test the assertion that aesthetics judgments do not change when subjects encounter serious usability problems, but the above results suggest that this is the case.
3.2 Do task demands determine exploration patterns?
3.2.1 Satisfaction scores

Figure 4: Mean satisfaction scores for both web sites (‘Low’-usability, ‘High’-usability) for ‘browsing only’ (BRO) and ‘before usability test’ (BUT) subject-groups
Satisfaction scores were derived from pooling positive and negative interview
statements from the first interview and calculating the proportion of positive
statements. These data including both web sites and both subject-groups
(usability test to come/no usability test to come) are shown in Figure 4 below.
The figure suggests that the satisfaction scores were higher for the high-usability than for the low-usability site. A 2 x 2 ANOVA performed for subject-group (‘browsing only’ and ‘before usability test’) and web site confirmed this observation with a main effect for web site F(1,76) = 6.40, p < 0.05. Thus, usability did influence satisfaction scores when none of the subjects had as yet performed the usability test. The main effect for subject-group was marginally significant F(1,76) = 3.89, p < 0.052, suggesting that task demands may have influenced awareness of usability problems. There was no interaction.
3.2.2 Perceived usability scores

Figure 5: Mean perceived usability scores for the two web sites (‘Low’-usability, ‘High’-usability) for ‘browsing only’ (BRO) and the ‘before usability test’ (BUT) subject-groups
If task demands determine awareness of usability, leading subjects who
anticipate a usability test to explore the site more systematically than
subjects who do not, the former should detect more usability problems than the
latter. This should, in turn, result in lower perceived usability (measured by
the WAMMI) following the exploration of the site. Perceived usability scores are
shown in Figure 5 for both web sites and for subjects who did and did not
anticipate a usability test.
Figure 5 shows that the WAMMI scores mirrored the satisfaction scores shown above in Figure 4. The 2 x 2 ANOVA for web site and subject- groups resulted in a main effect for site F(1,76) = 42.17, p < 0.001, with scores being higher for the high-usability than for the low-usability site. The main effect for subject-group did not approach a level of significance, and there was no interaction.
4. Discussion
The web sites employed in this study had earlier been evaluated as being aesthetically very appealing (Lindgaard & Dudek, 2002a), so we assumed the initial impression would be very positive. Scores on the aesthetics dimension supported this contention, with scores for both sites lying well above the neutral 0.5 point. Those same results also suggest that aesthetic appeal did not change after subjects encountered severe usability problems, although more data need to be collected to strengthen this evidence. Taken on their own, these data would support the existence of a confirmation bias.
Contrary to the aesthetics judgments, the satisfaction scores were lower for the low-usability than for the high-usability site, suggesting that usability does impact overall satisfaction even when subjects are not anticipating a usability test. This effect of usability on satisfaction was exacerbated in the comparison of satisfaction scores obtained before and after the usability test; after the test, scores from both web sites dropped well below the 0.5 neutral point, indicating negative satisfaction. Experience, it would seem, did encourage subjects to revise their original judgment of satisfaction, which disconfirms the presence of a confirmation bias.
Perceived usability scores reflected the high and low usability levels of the two sites. The mean score on the low-usability site was below neutral even before the usability test. This suggests that subjects were quite aware of usability problems throughout the experiment. However, these had a greater impact on perceived usability judgments after the test than before. This is contrary to Tractinsky and his colleagues (2000) who found that highly aesthetic interfaces were perceived to be high in usability regardless of actual usability levels. We believe that the difference may lie in the seriousness of the usability problems in the two studies. In Tractinsky et al’s (2000) study, users were not prevented from completing their tasks, which they were in ours.
Taken together, these results suggest that satisfaction is at least in part driven by perceived usability, and that the confirmation bias, if it exists in data such as these, may be less pronounced than early research led us to expect. However, while usability affects satisfaction, subjects appear to assess aesthetic appeal quite independently of usability, as these scores did not change with usability experience. Given that both the sites investigated here were high in aesthetics, it was not possible to test whether low levels of aesthetics would improve after encountering a high-usability interface, as in Tractinsky et al.’s (2000) findings. Clearly, this needs to be tested. We are currently isolating web sites that are low in aesthetics and variable in usability to test this possibility.
Regarding the question as to whether or not task requirements affect awareness of usability problems, the above data provide only very weak confirmatory evidence. It is thus unlikely that usability problems receive a greater share of subjects’ attention while exploring a site as a function of task demands, although this should be tested in a range of different web sites, tasks, and with different subjects.
In conclusion, it appears that satisfaction is partly usability-driven and that aesthetic appeal is judged independently from perceived usability, at least on occasion. Since this does not quite match other researchers’ results, it is important also to test web sites that are low in aesthetics.
5. References
· Anderson, N.H. (1981). Foundations of information integration theory, Academic Press, London.
· Chin, J. P., Diehl, A. D., & Norman, K. L. (1988).Development of an instrument measuring user satisfaction of the human-computer interface, In Proceedings of the CHI’88 Conference on Human Factors in Computing ACM, New York.
· Dudek, C. & Lindgaard, G. (2002). Measuring User Satisfaction on the Web: The Stories People Tell, Proceedings, International Conference on Design and Emotion, Loughborough, UK, July 1-3.
· Eddy, D.M. & Clanton, C.H. (1982). The art of diagnosis: Solving the clinicopathological exercise, New England Journal of Medicine, 306, 1263-1268.
· ISO (1997). ISO/DIS 9241-11. Ergonomic requirements for office work with visual display terminals (VDTs): Guidance on usability.
· Jordan, P. W., (1998). Human factors for pleasure in product use, Applied Ergonomics, 29, 1, p.25-33.
· Kirakowski, J., (1996). The software usability measurement inventory: Background and usage, in P. Jordan, B. Thomas & B. Weerdmeester (Eds), Usability evaluation in industry, Taylor & Francis, London.
· Kirakowski, J., Claridge, N. & Whitehead, R., (1998). Human centred measures of success in web site design, Proceedings Our Global Community, http://www.research.att.com/conf/hfweb/proceedings/
· LeDoux, (1996). The Emotional Brain, The Mysterious Underpinnings of Emotional Life, Simon & Schuster: New York.
· Lindgaard, G. (1985). Weighting of individuating information elements and base rate in a nursing decision making task involving non-diagnostic case information, Unpublished Masters Thesis, Department of Psychology Monash University, Clayton, Australia.
· Lindgaard, G. & Dudek C. (2001). Is a great experience merely satisfying and does appeal equate high subjective usability?, Proceedings, Affective Human Factors Design, Singapore, 27-29 June, pp. 373.
· Lindgaard G. & Dudek, C. (2002a). What is this evasive beast we call user satisfaction?, Interacting with Computers, in press.
· Lindgaard G. & Dudek, C. (2002b). User Satisfaction, Aesthetics and Usability: Beyond Reductionism, Proceedings, International Federation of Information Processing, (IFIP2002), Montreal, 25-30 August
· Mynatt, C.R., Doherty, M.E. & Tweney, R.D. (1977). Confirmation bias in a simulated research environment: An experimental study of scientific inference, Quarterly Journal of Experimental Psychology, 29, 85-95.
· Nisbett, R.E., & Ross, L., (1980). Human Inference:Strategies and Shortcomings of Social Judegment, Prentice-Hall, Englewood Cliffs, New Jersey.
· Tractinsky, N., Katz, A. & Ikar, D. (2000). What is Beautiful is Usable, Interacting with Computers, 13, p.127-145.
6. Acknowledgments
We would like to thank Jurek Kirakowski for allowing us to use the WAMMI scale.