Adjusting to a Messy World: Donald Broadbent Lecture 2016

Just posted at http://www.HFEinPractice.wordpress.com. Human Factors and Ergonomics in Practice: Adjusting to a Messy World: Donald Broadbent Lecture 2016 with Claire Williams at CIEHF Ergonomics and Human Factors 2016.

Human Factors and Ergonomics in Practice

On 21 April 2016, we co-presented the Donald Broadbent lecture at Ergonomics and Human Factors 2016 (Daventry, UK) summarising some of the themes in ‘Human Factors and Ergonomics in Practice’. In this post, we summarise aspects of the book, slide by slide, with a grateful acknowledgement to every author who collaborated.


Slide 1 – Welcome

This lecture is about the messy world in which we live, and how this affects human factors and ergonomics in practice. The lecture material is derived from a book that we have edited called Ergonomics and Human Factors in Practice: Improving Performance and Wellbeing in the Real World.

Slide01


Slide 2 – How do practitioners really work?

We met about ten years ago, at this conference, when we were presenting papers on issues of practice. We had both been practitioners for about ten years at that time, and were beginning to realise HF/E professionals did not talk…

View original post 5,762 more words

Posted in Uncategorized | Leave a comment

Never/zero thinking

CcJXsKhWEAA0We0.jpg_large

“God save us from people who mean well.”
― Vikram Seth, A Suitable Boy

There has been much talk in recent years about ‘never events’ and ‘zero harm’, similar to talk in the safety community about ‘zero accidents’. ‘Never events’, as defined by NHS England, are “serious incidents that are wholly preventable as guidance or safety recommendations that provide strong systemic protective barriers are available at a national level and should have been implemented by all healthcare providers”. The zero accident vision, on the other hand, is a philosophy that states that nobody should be injured due to an accident, that all accidents can be prevented (OSHA). It sounds obvious: no one would want an accident. And we all wish that serious harm would not result from accidents. But as expressed and implemented top-down, never/zero is problematic for many reasons. In this post, I shall outline just a few, as I see them.

1. Never/zero is not SMART

We all know that objectives should be SMART </sarcasm>:

  • Specific – target a specific area for improvement.
  • Measurable – quantify or at least suggest an indicator of progress.
  • Assignable – specify who will do it.
  • Realistic – state what results can realistically be achieved, given available resources.
  • Time-related – specify when the result(s) can be achieved. (Wikipedia)

Never/zero fails on more that one SMART criteria. You could say ‘harm’ and ‘accidents’ are specific. ‘Never events’ are so specific that there are lists – long and ever-changing lists (in NHS England, apparently beginning at eight, growing to 25, and recently shrinking to 14, with an original target of two or three). This in itself may become a problem. Someone can always think of another. So what about the ones not on the list? When thinking about zero harm, what about the harm that professionals might need to do in the short term to improve outcomes in the longer term?

You could say ‘harm’ or ‘accidents’ are measurable, and that never/zero is the target.  There are of course problems with measures that become goals. One problem is expressed in Goodhart’s Law: “When a measure becomes a target, it ceases to be a good measure.” A measure-as-goal ceases to become a good measure for a variety of reasons, explained elsewhere, but (among other factors) targets encourage blame and bullying, distort reporting and encourage under-reporting, and sub-optimise the system, introducing competition and conflict within a system. There is much evidence for each of these claims. Even if never/zero is seen by some as a way of thinking, it is inevitably treated as a numerical goal, and inevitably generates a bureaucratic burden.

As for assignability, well, you could assign the never/zero goal to a safety/quality department, or the CEO, or the front-line staff…or everyone (but it that really assigning?).  But what are we assigning exactly? Are we assigning not having an accident to individuals, or never/zero to the organisation as a whole? Or perhaps the putting in place of specified safeguards? (And if so, can they always be implemented as specified?) What are the consequences for those to whom never/zero is assigned when an does accident occur, aside from the immediate physical and emotional consequences (see Point 6)?

By now we can see that zero/never is not realistic given available resources (having probably never been achieved in any safety-related industry), but probably given any resources, unless all activity were to stop (e.g. no flying, no surgical procedures). But then other harms result, as we saw following 9/11 with increased road deaths associated with reduced flying. If never/zero is unrealistic, then the time factor is neither here nor there, but knowing that it is unrealistic, people usually avoid specifying when never/zero must be achieved. And if they do, it is demotivating when it does not happen (see point 8 below).

2. Never/zero is unachievable

This is obvious from the above but it is worth repeating because it is not all that obvious to those removed from the front-line. There will never be never. There are, at present, several ‘never events’ a week in English hospitals. The chance of zero (by <no time>) is zero. It is a dream, a wish. For some, it is a utopia, but perhaps it is lost on those people that utopia comes from the Greek: οὐ (“not”) and τόπος (“place”) and means “no-place”. Never/zero is nowhere. In no place does it exist.

3. Never/zero is avoidant

Leaving aside the counterfactual inherent in never/zero definitions, never/zero focuses our attention on an anti-goal (harm, accidents, ‘avoidable deaths’). We may wish for a never/zero utopia, but with a focus on anti-goals the strategy obviously becomes avoidance. The anti-goal itself gives no information on how to go about this, and a focus on avoidance may, paradoxically, lead you into the path of another anti-goal as you run up against another constraint or complication with a limited focus of attention.

There are many potential ways to avoid an anti-goal, which may take you in slightly different directions and perhaps toward different things, which may or may not be desirable. In air traffic control, controllers do not train primarily by thinking of all the things not to do, and do not work by practising avoiding all the things that should be avoided. The focus of training is to learn to think about what to do (goal), and how to do it (strategy and tactics). It is well known, for instance, that thinking of a flight level (e.g. FL270 – 27,000ft) that is occupied by another aircraft or otherwise unavailable can lead you to issue that very flight level in an instruction. Thoughts lead to actions, even thoughts about what not to do. To part-quote Gandhi: “Your thoughts become your words, Your words become your actions”. It does not necessarily follow that focusing on not doing something will result in that thing not being done.

4. Never/zero is someone else’s agenda

No-one wants to have an accident, by definition. If they did, it wouldn’t be an accident. But the sum of individual wishes does not equal consensus on an agenda. Staff have usually not come together and decided on a never/zero agenda. It is usually decided from another place.

There are a variety of goals, there are complications inherent in every goal, and there are difficulties in balancing conflicting goals, especially in real-time, and at the sharp-end of operations. Compromises and trade-offs have to be made, strategically and tactically. None of these can be simplified to never/zero.

5. Never/zero ignores ‘always conditions’

All human work activity is characterised by patterns of interactions between system elements (people, tools, machinery, software, materials, procedures, and so on). These patterns of interactions achieve some purpose under certain conditions in a particular environment over a particular period of time. Most interactions involving human agency are intentional but some are not, or else the consequences are not intended. At the sharp-end, in the minutes or seconds of an adverse event as it unfolds, things do not always go as planned or intended. But nobody ever intended for things to go wrong.

We tend to use labels such as ‘human error‘ (and various synonyms) for these sorts of system interactions, but there is nearly always more to it that just a human. For instance, there may be confusing and incompatible interfaces, similar or hard-to-read labels, unserviceable equipment, missing tools, time pressure, a lack of staff, fatiguing hours of work, high levels of stress, variable levels of competence, different professional cultures, and so on. In other words, operating conditions are nearly always degraded. We ask for never/zero, and yet we ask for this in degraded ‘always conditions‘. Perhaps a new vision of ‘never conditions‘ (never degraded) or ‘always conditions’ (always optimal) would focus the minds of policy-makers closer to home, since it it would bring the trade-offs and compromises closer to their own doorstep.

It makes sense to detect and understand patterns in unwanted events, and to examine, test and implement ways to prevent such events (the basic idea behind never events), with the field experts who do the work. The problem comes with a never/zero expression and all of the implications of that.

6. Never/zero leads to blaming and shaming

It is inevitable. As soon as you label something ‘never/zero’ – as soon as you specify never/zero outcomes that are closely tied in time or space to front-line professionals – those professionals will be blamed and shamed, either directly or indirectly, by individuals, teams, the organisation, the judiciary, the media, or the public. The shame may be systematised; someone will have the bright idea to publish de-contextualised data, create a league table of never/zero failures, ‘out’ individuals, etc. The associated unintended consequences of these sorts of interventions are now well-known. So we have to acknowledge that simultaneous talk of never/zero and ‘just culture’ is naive at best. It is at odds with our understanding of systems thinking, human factors and social science. This understanding is lacking among the public, and this is sadly evident in the language of the media and, for instance, the Patients’ Association’s latest press release, which attached terms such as “disgrace”, “utter carelessness”, “unforgivable” to never events. Never/zero adds to the psychology of fear in organisations (see here for a good overview). Nobody goes to work to have an accident, but never/zero treats people as if they do.

7. Never/zero makes safety language even more negative

These emotive words illustrate how words matter, especially when lives are involved. Never/zero adds to an already negative safety nomenclature, which  limits our thinking about work, and out ability to learn. Inevitably, this language, even if intended in a technical sense, is used in the media and judiciary in a very different sense: ‘human error’ is used, then abused. Error becomes inattention. Inattention becomes carelessness. Carelessness becomes recklessness. Recklessness becomes negligence. Negligence becomes gross negligence. Gross negligence becomes manslaughter. If that sounds dramatic, it is this more or less the semantic sequence that has ensnared Spanish train driver Francisco José Garzón Amo, who – over two years on – is still facing 80 charges of manslaughter by professional recklessness after the accident at Santiago de Compostela, in July 2013.

8. Never/zero cultivates cynicism

It is obvious to those on the front-line of services that never/zero is unachievable, and sadly it inspires cynicism. There are probably a few reasons for this. Aside from ignorance of ‘always conditions’ (Point 5), it illustrates a profound misunderstanding of human motivation. Never/zero is the worst kind of safety poster message (along with ‘Safety is out primary goal‘ ), not only because it is unrealistic or unachievable, but because it assumes that people’s hearts and minds are not in the job, so they need to be reminded to ‘be careful’. Yet any ‘accident’ would almost inevitably harm the front-line workers who were there, at least emotionally, and at least for a time (hence why some organisations have implemented critical incident stress management, CISM).

I know an organisation that set a zero/never goal for a certain type of safety incident. It was widely publicised, and the incident occurred in the first few weeks. So then what? Is the goal null and void, or do we reset the clock? Never/zero can confirm what front-line staff always knew, that never/zero is unachievable (Point 2).

9. Never/zero will probably lead to burnout

It’s tiring, chasing rainbows. And because never/zero is unachievable, because it is a negative, because it cultivates blame, shame and cynicism, because it is someone else’s agenda, it is more likely to lead to burnout of professionals. If not the chronic stress variant, then the burning out of one’s capacity, willingness and motivation to take the goal seriously and to pursue the goal any more. Try never/zero thinking as a public health practitioner (they already did). Burnout is inevitable.

What then for safety? Is safety just about never/zero? And if never/zero is unachievable, then is safety worth pursuing at all? There is precious little enthusiasm for traditional safety management (outside of those whose salary depends on it), so is it wise to extinguish the flame altogether with a  never/zero blanket?

10. Never/zero does not equal good safety

What’s the difference between a near miss and a mid-air collision? A little piece of blue sky, and there’s a lot of blue sky out there. So, what if an organisation has lots of near misses but zero collisions? Never/zero focuses on outcomes that can be counted and measured instead of the messy ‘always conditions’ that shape performance, but cannot be measured. Never/zero, then, is a trade-off after all, but a blunt-end trade-off. Because it is easier to set a never/zero goal than to understand how things really work.

If not never/zero, then what?

No-one wants an accident or never event. That’s obvious. It’s not a useful goal though, and it’s not a useful way of thinking either. Never/zero is the stuff of never-never land. You can’t swear off accidents.

There are alternative ways of thinking. There is of course harm reduction, long preferred in public health. There is as low as reasonably achievable (ALARA) or practicable (ALARP) in safety-critical industries where there are major accident hazards.

And then there is Safety-II and resilience (e.g. resilient healthcare). Rather than thinking only about counterfactuals and seeking only to avoid that things go wrong, Safety-II involves thinking about work-as-actually-done, and how to ensure that things go right. This means we have to ask “what is right for us (at this time)?”, i.e. what matters to us and what are our goals? Goals, especially when not imposed externally, promote attraction instead of simply avoidance, and imply trade-offs, since goals are obviously not all compatible. A focus on goals means that we must think about effectiveness, which includes safety (safe operations)  among other things such as demand, capacity, flow, sustainability, and so on. Focus on a goal makes us think of ways toward the goal, not just ways to avoid an anti-goal.

So perhaps instead of a never/zero focus, we should think of goals that we would like to achieve, the conditions and opportunities that that are necessary to achieve those goals, and the assets that we have and may help us to achieve the goals. ‘Always’ is probably as unachievable as ‘never’, but we can always try, knowing that we will not always achieve.

Posted in Human Factors/Ergonomics, Safety, systems thinking | Tagged , , , , , , , , , , , , | 6 Comments

Alarm design: From nuclear power to WebOps

Me, myself and TMI

Imagine you are an operator in a nuclear power control room. An accident has started to unfold. During the first few minutes, more than 100 alarms go off, and there is no system for suppressing the unimportant signals so that you can concentrate on the significant alarms. Information is not presented clearly; for example, although the pressure and temperature within the reactor coolant system are shown, there is no direct indication that the combination of pressure and temperature mean that the cooling water is turning into steam. There are over 50 alarms lit in the control room, and the computer printer registering alarms is running more than 2 hours behind the events.

This was the basic scenario facing the control room operators during the Three Mile Island (TMI) partial nuclear meltdown in 1979. The Report of the President’s Commission stated that, “Overall, little attention had been paid to the interaction between human beings and machines under the rapidly changing and confusing circumstances of an accident” (p. 11). The TMI control room operator on the day, Craig Faust, recalled for the Commission his reaction to the incessant alarms: “I would have liked to have thrown away the alarm panel. It wasn’t giving us any useful information”. It was the first major illustration of the alarm problem, and the accident triggered a flurry of human factors/ergonomics (HF/E) activity.

But the problem was never solved. A familiar pattern has resurfaced many times in the 35 years post-TMI in several industries, from oil and gas (Texaco Milford Haven, 1994) to aviation (Qantas Flight 32, 2010). And now, WebOps is starting to see similar patterns emerging, as e-commerce experiences its own ‘disasters’. In web engineering and operations, things don’t explode, leak, melt down, crash, derail, or sink in the same sense that they do in nuclear power or transportation, but the business implications of a loss of availability of systems that support the stock market or major e-commerce sites are enormous.  As Allspaw (in press) noted, “outages or degraded performance can occur which affect business continuity at a cost of millions of dollars“.

In such events, the alarm scenario is likely to be highly variable between companies, because alarm design is not subject to the same kind of formal process as is more commonplace in (other) safety-critical industries, which often have alarm philosophies and design processes and criteria. Unlike system users in the nuclear and aviation domains, in web engineering and operations, the users are also designers (Allspaw, in press); they can design their own alarms. This is a double edged sword. As systems become increasingly automated and complicated, the number of monitoring points and associated alarms tends to proliferate. And ‘common sense’ design solutions often spell trouble. At this point, it’s worth going back to basics.

Alarm design 101

First, what are alarms for? The purpose of alarms is to direct the user’s attention towards significant aspects of the operation or equipment that require timely attention. This is important to keep in mind. Second, what does a good alarm look like? The Engineering Equipment and Materials Users Association (EEMUA) (1999) summarise the characteristics of a good alarm as follows:

  • Relevant – not spurious or of low operational value.
  • Unique – not duplicating another alarm.
  • Timely – not long before any response is required or too late to do anything.
  • Prioritised – indicating the importance that the operator deals with the problem.
  • Understandable – having a message that is clear and easy to understand.
  • Diagnostic – identifying the problem that has occurred.
  • Advisory – indicative of the action to be taken.
  • Focusing – drawing attention to the most important issues.

These characteristics are not always evident in alarm systems. Even when individual alarms may seem ‘well-designed’ they may not work in the context of the system as a whole and the users’ activities.

This post raises a number of questions for consideration in the design of alarm systems, framed in a model of alarm-handling activity. The questions may help in the development of an alarm philosophy (one of the first steps in alarm management), or in discussion of an existing system. The principles were originally derived from evaluations of two different control and monitoring systems for two ATC centres (see Shorrock and Scaife, 2001; Shorrock et al, 2002, p178); environments which share much in common with web engineering and operations. The resultant principles are included in this article as questions for consideration for readers as designers and co-designers, structured around seven alarm-handling activities (Observe, Accept, Analyse, Investigate, Correct, Monitor, and Co-ordinate). This is illustrated and outlined below. In reality, alarm handling is not so linear; there are several feedback loops. But the general activities work for a discussion of some of the detailed design implications.

alarm model

Model of alarm initiated activities (adapted from Stanton, 1994).

Design questions are raised for each stage of alarm handling. The questions may be useful in discussions involving users, designers and other relevant stakeholders. They may help to inform an alarm philosophy or an informal exploration of an alarm system from the viewpoint of user activity. In most cases, the questions are applicable primarily in one stage of alarm handling, but also have a bearing on other stages, depending on the system in question. It should be possible to answer ‘Yes’ to most questions that are applicable, but a discussion around the questions is a useful exercise, and may prompt the need for more HF/E design expertise.


Observation is the detection of an abnormal condition or state within the system (i.e., a raised alarm). At this stage, care must be taken to ensure that coding methods (colour and flash/blink, in particular) support alarm monitoring and searching. Excessive use of highly saturated colours and blinking can de-sensitise the user and reduce the attention-getting value of alarms. Any use of auditory alarms should further support observation without causing frustration due to the need to accept alarms in order to silence the auditory alert, which can change the ‘alarm handling’ task to an ‘alarm silencing’ task.

1. Is the purpose and relevance of each alarm clear to the user?
2. Do alarms signal the need for action?
3. Are alarms presented in chronological order, and recorded in a log (e.g. time stamped) in the same order?
4. Are alarms relevant and worthy of attention in all the operating conditions and equipment states?
5. Can alarms be detected rapidly in all operating (including environmental) conditions?
6. Is it possible to distinguish alarms immediately (i.e., different alarms, different operators, alarm priority)?
7. Is the rate at which alarm lists are populated manageable by the user(s)?
8. Do auditory alarms contain enough information for observation and initial analysis, and no more?
9. Are alarms designed to avoid annoyance or startle?
10. Does an indication of the alarm remain until the user is aware of the condition?
11. Does the user have control over automatically updated information, so that information important to them at any specific time does not disappear from view?
12. Is it possible to switch off an auditory alarm independent of acceptance, while ensuring that it repeats after an appropriate period if the problem is not resolved?
13. Is failure of an element of the alarm system made obvious to the
user?


Acceptance is the act of acknowledging the receipt and awareness of an alarm. At this stage, user acceptance should be reflected in other elements of the system that are providing alarm information. Alarm systems should aim to reduce user workload to manageable levels; excessive demands for acknowledgement increase workload and unwanted interactions. For instance, careful consideration is required to determine whether cleared alarms really need to be acknowledged. Group acknowledgement of several alarms (e.g. via using click-and-drag or a Shift key) may lead to unrelated alarms being masked in a block of related alarms. Single acknowledgement of each alarm, however, can increase workload and frustration, and an efficiency-thoroughness trade-off can lead to alarms being acknowledged unintentionally as task demands increase. It can be preferable be to allow acknowledgement of alarms for the same system.

14. Has the number of alarms that require acceptance been reduced as far as is practicable?
15. Is multiple selection of alarm entries in alarm lists designed to avoid unintended selection?
16. Is it possible to view the first unaccepted alarm with a minimum of action?
17. In multi-user systems, is only one user able to accept and/or clear alarms displayed at multiple workstations?
18. Is it only possible to accept an alarm from where sufficient alarm information is available?
19. Is it possible to accept alarms with a minimum of action (e.g., double click), from the alarm list or mimic?
20. Is alarm acceptance reflected by a change on the visual display (e.g. visual marker and the cancellation of attention-getting mechanisms), which prevails until the system state changes?


Analysis is the assessment of the alarm within the task and system context, leading to the prioritisation of that alarm. Alarm lists can be problematic, but, if properly designed, they can support the user’s preference for serial fault or issue management. Effective prioritisation of alarm list entries can help users at this stage. Single ‘all alarm’ lists can make it difficult to handle alarms by shifting the processing debt to the user. However, a limited number of separate alarm lists (e.g., by system, function, priority, acknowledgement, etc.) can help users to decide whether to ignore, monitor, correct or investigate the alarm.

21. Does alarm presentation, including conspicuity, reflect alarm priority with respect to the severity of consequences of delay in recognising the problem?
22. When the number of alarms is large, is there a means to filter the alarm display by appropriate means (e.g. sub-system or priority)?
23. Are users able to suppress or shelve certain alarms according to system mode and state, and see which alarms have been suppressed or shelved? Are there means to document the reason for suppression or shelving?
24. Are users prevented from changing alarm priorities?
25. Does the highest priority signal always over-ride, automatically?
26. Is the coding strategy (colour, shape, blinking/flashing, etc) the same for all display elements?
27. Are users given the means to recall the position of a particular alarm (e.g. periodic divider lines)?
28. Is alarm information (terms, abbreviations, message structure, etc) familiar to users and consistent when applied to alarm lists, mimics and message/event logs?
29. Is the number of coding techniques at the required minimum? (Dual coding [e.g., symbols and colours] may be needed to indicate alarm status and improve analysis.)
30. Can alarm information be read easily from the normal operating position?


Investigation is any activity that aims to discover the underlying factors order to deal with the fault or problem. At this stage, system schematics or other such diagrams can be helpful. Coding techniques (e.g., group, colour, shape) again need to be considered fully to ensure that they support this stage without detracting from their usefulness elsewhere. Displays of system performance need to be designed carefully in terms of information presentation, ease of update, etc.

31. Is relevant information (e.g. operational status, equipment setting and reference) available with a minimum of action?
32. Is information on the likely cause of an alarm available?
33. Is a usable graphical display concerning a displayed alarm available with a single action?
34. When multiple display elements are used, are individual elements visible (not obscured)?
35. Are visual mimics spatially and logically arranged to reflect functional or naturally occurring relationships?
36. Is navigation between screens, windows, etc, quick and easy, requiring a minimum of user action?


Correction is the application of the results of the previous stages to address the problem(s) identified by the alarm(s). At this stage, the HMI must allow timely and error-tolerant command entry, if the fault can be fixed remotely. For instance, any command windows should be easily called-up, user memory demands for commands should be minimised, help or instructions should be clear, upper and lower case characters should be treated equivalently, and positive feedback should be presented to show command acceptance.

37. Does every alarm have a defined response and provide guidance or indication of what response is required?
38. If two alarms for the same system have the same response, has consideration been be given to grouping them?
39. Is it possible to view status information during fault correction?
40. Are cautions used for operations that might have detrimental effects?
41. Is alarm clearance indicated on the visual display, both for accepted and unaccepted alarms?
42. Are local controls positioned within reach of the normal operating position?


Monitoring is the assessment of the outcome of the Correction stage. At this stage, the HMI (including schematics, alarm clears, performance data and message/event logs) needs to be designed to reduce memory demand and the possibility of interpretation problems (e.g., the ‘confirmation bias’).

43. Is the outcome of the Correction stage clear to the user? (A number of questions primarily associated with observation become relevant to monitoring.)


Co-ordination between operators is required to work collaboratively. This may involve delegating authority for specific issues to colleagues, or co-ordinating efforts for problems that permeate several different parts of the overall system.

44. Are shared displays available to show the location of operators in system, areas of responsibility, etc?


Alarm design and you

The trouble with alarm design is that it seems obvious. It’s not. Control rooms around the world are littered with alarms designed via ‘common sense’, which do not meet users’ needs. Even alarms designed by you for you may not meet your needs, or others’ needs. There are many reasons for this. For example:

  • We are not always sure what we need. Our needs can be hard to access and they can conflict with other needs and wants. What we might want, such as a spectrum of colours for all possible systems states, is not what we need in order to see and understand what’s going on.
  • We find it hard to understand and reconcile the various needs of different people in the system. There is an old joke among air traffic controllers that if you ask 10 controllers you will get 10 different answers…minimum. Going up and our to other stakeholders, the variety amplifies.
  • We find it hard to understand system behaviour, including its boundaries, cascades and hidden interactions. We all tend to define system boundaries, priorities and interactions from a particular perspective, typically our our own.
  • We are not all natural designers. Many alarm systems show a lack of understanding of coding (colour, shape, size, rate…), control, chunking, consistency, constancy, constraints, and compatibility…and that’s just the c’s! What looks good is not necessarily what works good…

This is where the delicate interplay of design thinking, systems thinking and humanistic thinking come into play. Understanding the nature of alarm handling, and the associated design issues, can help you – the expert in your work – to be a more informed user, helping to bring about the best alarm systems to support your work.

References

Allspaw, J. (in prep). HF/E Practice in Web Engineering and Operations. In Shorrock, S.T. and Williams, C.A.W. (Eds). (in prep). Human Factors and Ergonomics in Practice. Ashgate.

Kemeny, J.G. (Chairman) (1979). President’s Commission on the Accident at Three Mile Island. Washington DC.

Shorrock, S.T., MacKendrick, H. and Hook, M., Cummings, C. and Lamoureux, T. (2001). The development of human factors guidelines with automation support. Proceedings of People in Control: An International Conference on Human Interfaces in Control Rooms, Cockpits and Command Centres, UMIST, Manchester, UK: 18 – 21 June 2001.

Shorrock, S.T. and Scaife, R. (2001). Evaluation of an alarm management system for an ATC Centre. In D. Harris (Ed.) Engineering Psychology and Cognitive Ergonomics: Volume Five – Aerospace and Transportation Systems. Aldershot: Ashgate, UK.

Shorrock, S.T., Scaife, R. and Cousins, A. (2002). Model-based principles for human-centred alarm systems from theory and practice. Proceedings of the 21st European Annual Conference on Human Decision Making and Control, 15th and 16th July 2002, The Senate Room, University of Glasgow.

Stanton, N. (1994). Alarm initiated activities. In N. Stanton (Ed.), Human Factors in Alarm Design. Taylor and Francis: London, pp. 93-118.

This post is adapted from an article due to appear in Hindsight 22 on Safety Nets in December 2015 (see here). Some of the article is adapted from Shorrock et al, 2002. Many thanks to Daniel Schauenberg (@mrtazz) and Mathias Meyer ( for super helpful comments on an earlier draft of this post.

Posted in Human Factors/Ergonomics, Safety | Tagged , , , , , , | 2 Comments

Toad’s Checklist

Sometimes lessons about work come about from the most wonderful places. Arnold Lobel’s ‘Frog and Toad’ books for children from the 1970s is one such place. Frog and Toad are friends, and share many everyday adventures. One of their adventures is recounted in a story called ‘A List’. The story starts with Toad in bed. Toad had many things to do and decided to make a list so that he would remember them. He started the list with “Wake up”, which he had already done and so he crossed it out. He then wrote other things to do. He followed his list and crossed each off, one by one.

After he got dressed (and crossed this off the list), Toad put the list in his pocket and walked to see Frog (and crossed this off the list too). The two then went for a walk together, in accordance with the list. During the walk, Toad took the list from his pocket and crossed out “Take walk with Frog”.

At that moment, a strong wind blew the list out of Toad’s hand, high up into the air.

DSC_0001

“Help!” cried Toad.

“My list is blowing away.

What will I do without my list?”

“Hurry!” said Frog.

“We will run and catch it.”

“No!” shouted Toad.

“I cannot do that.”

“Why not?” asked Frog.

“Because,” wailed Toad,

“running after my list

is not one of the things

that I wrote

on my list of things to do!”

So Frog ran after the list, over the hills and swamps, but the list blew on and on.

DSC_0002Frog returned without the list to a despondent Toad.

“I cannot remember any of the things

that were in my list of things to do.

I will just have to sit here

and do nothing,” said Toad.

Toad sat and did nothing.

Frog sat with him.

Eventually, Frog said that it was getting dark and they should go to sleep, and Toad remembered the last item on the list:

Go to sleep.

Checklists are common in both everyday life and in complex and hazardous industries, such as aviation and medicine. They are sometimes seen as a panacea. They are certainly helpful, for many reasons: aiding memory, encouraging thoroughness and consistency, incorporation mitigations from risk assessments and investigations, and co-ordinating teamwork. But checklists cannot account for the total variety of situations that may arise, especially rare problem situations that perhaps have never been thought possible. In such cases, checklists may encourage an unhelpful dependency when fundamental knowledge and experience, pattern recognition, and indeed creativity may be required. Such events are sometimes referred to as ‘black swans’ (Taleb, 2010). Two of the characteristics of black swan events, according to Taleb, are:

“First, it is an outlier, as it lies outside the realm of regular expectations, because nothing in the past can convincingly point to its possibility. Second, it carries an extreme ‘impact’.”

One such black swan event occurred on 4 November 2010. Just four minutes after take off, climbing through 7,000ft from Singapore Changi Airport, an explosion occurred in one of the engines of QF32, a Qantas Airbus A380. Debris tore through the wing and fuselage, resulting in structural and systems damage. The crew tried to sort through a flood of computer-generated cockpit alerts on the electronic centralized aircraft monitor (ECAM), which monitors aircraft functions and relays them to the pilots. It also produces messages detailing failures and in certain cases, lists procedures to undertake to correct the problem. Nearly all the non-normal procedures are on the ECAM.

They crew recalled an “avalanche” of (sometimes contradictory) warnings relating to engines, hydraulic systems, flight controls, landing gear controls, and brake systems. David Evans, a Senior Check Captain at Qantas with 32 years of experience and 17,000hrs of flight time, was in an observer’s seat during the incident. Interviewed afterwards, he saidWe had a number of checklists to deal with and 43 ECAM messages in the first 60 seconds after the explosion and probably another ten after that. So it was nearly a two-hour process to go through those items and action each one (or not action them) depending on what the circumstances were” (Robinson, 8 December 2010).

The Pilot in Command, Captain Richard de Crespigny (15,000hrs) wrote, “The explosion followed by the frenetic and confusing alerts had put us in a flurry of activity, but Matt [Matt Hicks, First Officer, 11,000hrs] and I kept our focus on our assigned tasks while I notified air traffic control … ‘PAN PAN PAN, Qantas 32, engine failure, maintaining 7400 and current heading’”… “We had to deal with continual alarms sounding, a sea of red lights and seemingly never-ending ECAM checklists. We were all in a state of disbelief that this could actually be happening.” (July 21 2012).

In an article in the Wall Street Journal, Andy Pasztor wrote that “Capt. Richard de Crespigny switched tactics. Rather than trying to decipher the dozens of alerts to identify precisely which systems were damaged, as called for by the manufacturer’s manuals and his own airline’s emergency procedures, he turned that logic on its head—shifting his focus to what was still working“. “We basically took control” said de Crespigny, “Symbolically, it was like going back to the image of flying a Cessna“. This strategy is reminiscent of Safety-II and appreciative inquiry, rather than focusing only on what is going wrong, focus on what is working, and why, and try to build on that.

When he was asked if he had any recommendations for Qantas or Airbus concerning training for ECAM messages in the simulator, Captain David Evans noted, “We didn’t blindly follow the ECAMs. We looked at each one individually, analysed it, and either rejected it or actioned it as we thought we should. From a training point of view it doesn’t matter what aeroplane you are flying airmanship has to take over. In fact, Airbus has some golden rules which we all adhered to on the day – aviate, navigate and communicate – in that order“. Similarly, Captain Richard de Crespigny notedI don’t trust any checklist naively.” Rather than getting lost in checklists that no longer worked, he made sure that his team were focusing on what systems were working. On an Air Crash Investigations programme Qantas Flight 32 – Emergency In The Sky, he said “We sucked the brains from all pilots in cockpit to make one massive brain and we used that intelligence to resolve problems on the fly because they were unexpected events, unthinkable events.”

Checklists have been an enormous benefit in a number of sectors, especially aviation and now medicine, but we must remember that they can never represent all possibilities, as  work-as-imagined will tend to deviate from work-as-done – through ineffective consultation, design, and testing, though incremental changes and adjustments, or through rare surprises and emergent system behaviour. That being the case, we must not forget about the importance of supporting and maintaining our human capacities for anticipation, insight, sensemaking, flexibility, creativity, adjustment and adaptation. In practical terms, this might mean:

  • refresher training that is meaningful and challenging, and allows for experimentation in a (psychologically and physically) safe environment,
  • information designed such that it meets our needs and does not exceed our capacity to process it properly,
  • procedures that reflect the way that work is actually done, having been developed with users and tested in a range of conditions and re-checked over time – acknowledging that there will be a need for flexibility,
  • a collective way of being that is fair and acknowledges the need to respond in unthoughtof ways, and
  • some basic principles of work that everyone can agree on and fall back on.

With this in mind, we can make the most of checklists without beeing blindsided by them, like Frog and Toad. Instead, we can stay in control, or at least retain the ultimate fallback mode: human ingenuity.

Human ingenuity saved QF32 when the black swan flew by.

Posted in Human Factors/Ergonomics, Safety | 1 Comment

‘Human error’ in the headlines: Press reporting on Virgin Galactic

“No situation is too complex to be reduced to this simple, pernicious notion. ‘Human error’ has become a shapeshifting persona that can morph into an explanation of almost any unwanted event” Image: Steven Shorrock https://flic.kr/p/woKmc5 CC BY-NC-SA 2.0

Again, a familiar smoke pattern has emerged from the ashes of a high-profile accident. The National Transportation Safety Board held a hearing in Washington D.C. on 28 July 2015 on the Virgin Galactic crash over California on October 31, 2014. VSS Enterprise, an experimental spaceflight test vehicle, crashed in the Mojave Desert folowing a catastrophic in-flight breakup during a test flight. Michael Alsbury, the co-pilot, was killed and Peter Siebold, the pilot, was seriously injured.

The NTSB’s press release headlined “Lack of Consideration for Human Factors Led to In-Flight Breakup of SpaceShipTwo” and opened the press release with “The National Transportation Safety Board determined the cause of the Oct. 31, 2014 in-flight breakup of SpaceShipTwo, was Scaled Composite’s failure to consider and protect against human error and the co-pilot’s premature unlocking of the spaceship’s feather system as a result of time pressure and vibration and loads that he had not recently experienced.” (07/28/2015).

In the NTSB’s Human Performance presentation, various “Stressors Contributing to Copilot’s Error” were cited, including memorisation of tasks, time pressure (to complete tasks within 26 seconds and abort at 1.8 Mach if feather not unlocked), and a lack of recent experience with SS2 vibration and loads. And concerning “Lack of Consideration for Human Error“: the system was not designed with safeguards to prevent unlocking feather; manuals/procedures did not have warnings about unlocking feather early; the simulator training did not fully replicate operational environment, and; hazard analysis did not consider ‘pilot-induced hazards’.

Board member Robert Sumwalt stated that “A single-point human failure has to be anticipated,” and that Scaled Composites, an aerospace company, “put all their eggs in the basket of the pilots doing it correctly.” The NTSB also criticised the FAA, which approved the experimental test flights, for failing to pay enough attention to human factors or to provide necessary guidance to the commercial space flight industry on the topic. NTSB chairman Christopher Hart said, “Many of the safety issues that we will hear about today arose not from the novelty of a space launch test flight, but from human factors that were already known elsewhere in transportation.” Pressure – a consistent feature in accidents – was cited from some FAA managers to approve experimental flight permits quickly, without a thorough understanding of technical issues or the details of the spacecraft – an efficiency-throughness trade-off (Hollnagel, 2009).

As is so often the case, it’s a complex and messy picture. But most people are fairly unlikely to view PowerPoint presentations by the NTSB, listen to public hearings or read official press releases. Instead, they refer to the news media, who nearly never go into the same sort of detail, and instead point to ‘human error’. No situation is too complex to be reduced to this simple, pernicious notion. “‘Human error’ has become a shapeshifting persona that can morph into an explanation of almost any unwanted event” (Shorrock, 2013).

Journalists tend to write in a particular way in order to catch and keep reader attention. Unlike other stories (such as novels), news articles usually use an inverted pyramid format, where the most important moment is right at the start in the headline and ‘lede’ – the first and most important paragraph. The headline attracts reader’s attention and the lede gives them the main points. It must do this in as few words as possible, typically a sentence or two – less than 40 words. In today’s 140 character economy, the headline and lede are probably more important than ever, because it is very easy not only to turn the page, but to click to a different news outlet altogether – losing the reader.

From a psychological point of view, this approach makes a lot of sense. Very early psychological research on human memory found that the first few items in a list of words are recalled more frequently than the middle items. This phenomenon is known as the primacy effect. It has been repeated in several fields (including clicking URLs in a list – Murphy, Hofacker and Mizerski, 2006). In journalism this is more important than the recency effect – the tendency to recall recall items at the end of the list best, since people less often get to the end of a news story.

Additionally, initial reports of news stories are more likely to be read and remembered. These ‘first stories’ (Cook, Woods and Miller, 1998) are also likely to focus on surface features of events, such as ‘human error’. In advertising and public communications, the law of primacy in persuasion holds that the side of an issue presented first will have greater effectiveness than the side presented subsequently. It makes sense, then, to look at some early reports of Virgin Galactic spacecraft’s deadly crash, focusing on the headline and lede (including the first paragraph). So consider the following, which cite ‘pilot error’ as the ’cause’ (or led to, due to, traced to) of Virgin Galactic. Some of the ledes do not mention any other factors.


CBS

NTSB Confirms Co-Pilot Error As Cause Of 2014 Virgin Galactic Tragedy – LOS ANGELES (CBSLA.com) — The National Transportation Safety Board on Tuesday confirmed co-pilot error as the cause of the deadly failure of Virgin Galactic’s SpaceShipTwo spacecraft over the Mojave Desert in October. 

BBC

Virgin Galactic spacecraft crash ‘due to braking error’ – Investigators say a Virgin Galactic spaceship crash was caused by structural failure after the co-pilot unlocked a braking system early.

USA Today

NTSB finds pilot error in Virgin Galactic spaceship crash – The Virgin Galactic spaceship crash last year occurred after a co-pilot prematurely unlocked a flap assembly that’s only supposed to be deployed at almost twice the speed, well past the speed of sound, the National Transportation Safety Board reported Tuesday.

New York Times

Virgin Galactic SpaceShipTwo Crash Traced to Co-Pilot Error – A single mistake by the co-pilot led to the fatal disintegration of a Virgin Galactic space plane during a test flight in October, the National Transportation Safety Board concluded Tuesday, and the board strongly criticized the company that designed and manufactured the plane for not building safeguards into the controls and procedures.

LA Times

Virgin Galactic: ‘Single human error’ led to catastrophic crash, NTSB says – Pilot error — and a lack of safeguards in place to prevent it — was the main cause of the deadly crash of Virgin Galactic’s SpaceShipTwo in a test flight last fall, federal safety officials said Tuesday.

The Register

Virgin Galactic SpaceShipTwo crackup verdict: PILOT ERROR – Feathercock unlock fingered as cause of Mach 1 midair breakup The National Transport Safety Board has today given its verdict on last year’s fatal Virgin Galactic crash, suggesting the cause of the tragic accident was human error and a lack of safety training.

The Guardian

Virgin Galactic crash: co-pilot unlocked braking system too early, inquiry finds – A nine-month investigation by the National Transportation Safety Board has found human error and inadequate safety procedures caused the violent crash


Other reports used the word ‘blamed’, which may have been used innocently, but conveys a more emotive message.


Sky News

Braking Error Blamed In Virgin Galactic Crash – Investigators find the co-pilot prematurely unlocked a system which repositions the spacecraft’s wings to slow its re-entry. An official investigation into a Virgin Galactic spacecraft’s deadly crash over California last October has found a co-pilot unlocked a braking system too early.

NBC

NTSB Blames Co-Pilot Error in Virgin Galactic SpaceShipTwo Crash – A co-pilot error caused Virgin Galactic’s SpaceShipTwo to crash last year, federal transportation officials said on Tuesday, adding that the mishap might have been prevented with better training and technical safeguards.

CNN

Virgin Galactic crash blamed on human error – Human error was responsible for the catastrophic 2014 crash of an experimental Virgin Galactic rocket ship that left the co-pilot dead. Specifically, the “probable cause” of the crash was the “pilot’s premature unlocking of the SpaceShipTwo feather system,” the National Transportation Safety Board said Tuesday. The feather locks are essentially a braking system designed to allow the rocket to safely descend from space.


Some media channels chose a more mysterious click-bait headline, revealing the ‘human error’ cause in the lede.


Business Insider

We finally know why Virgin Galactic’s spacecraft disintegrated in mid-air – The National Transportation Safety Board has finally determined what caused Virgin Galactic’s SpaceShipTwo to crash into the Mojave Desert last October. Pilot error.

CBS

Cause of Virgin Galactic spaceplane crash identified – The fatal in-flight breakup of Virgin Galactic’s futuristic SpaceShipTwo rocket plane during a test flight last October was the result of pilot error, possibly triggered by a high workload, unfamiliar vibration and rapid acceleration, the National Transportation Safety Board concluded Tuesday.


Other headlines did not mention ‘human error’, but this dominated the lede.


The Times

Safety flaws that doomed Branson’s spaceship – Pilot error and a failure to protect against it by the company making Virgin Galactic’s SpaceShipTwo rocket ship led to its catastrophic break-up over the Mojave Desert, investigators concluded last night.

Wired

Blame a Catastrophic Blindspot for the Fatal Virgin Galactic Crash – The deadly crash of the spaceship designed to carry wealthy Virgin Galactic customers into space can be pinned on pilot error, the National Transportation Safety Board said this morning.


In human factors, we do not to think so simplistically, and tend to concern ourselves with the second story. And it is fair to say that the study of ‘human error’ (at least according to some definitions) has helped the development of the understanding of human performance. But with ever increasing system complexity and obscure causation, might our focus on ‘human error’ as an anchoring object in our analyses and narratives have unintended consequences, perhaps serving to reinforce a populist view of ‘human error’?

This is one way in which a focus on ‘human error’ might be seen as the ‘handicap of human factors‘. The argument here is not an ontological one concerning the reality of ‘human error’ as a separate and clearly definable phenomenon. That is a different issue, and there is much written about that. For the moment, there is a need to reflect on how our language and way of thinking is conveyed, translated and perceived by the press, by others in industry and by society. Our subtle interpretations and arguments are of little interest to the media, who tend to strip the concept down to a person view (e.g. carelessness or a lapse of some kind), devoid of systemic narrative – at least in the all-important headline and lede of initial reports.

So how can we encourage a different narrative? Here are three things that those of us with an interest in human factors and ergonomics (HF/E) can do:

  • Consider your mindset: Reflect on how you react internally to failure, and how you subsequently make sense of what situations. You are human after all! It is natural to have some very human reactions. But how do these translate into judgements? Explore systems thinking. HF/E is a systems discipline, but writings by Deming, Ackoff, Meadows and other systems thinkers are rarely part of HF/E syllabi.  Consider alternative or complementary perspectives such as Safety-II (Hollnagel, 2014), but remain aware of your worldview – mind your mindset.
  • Mind your language: Reflect on how you talk or write about failure events. Enrich your vocabulary when writing reports or using social media. You don’t have to avoid using the term ‘error’ or variants, but consider others which may describe the situation: performance variability, ordinary variability,  adjustment, adaptation, trade-off, compromise – or just ‘decision’ or ‘action’. If you use the term ‘human error’ (or ‘pilot error’), use scare quotes/inverted commas. This gives a signal that there is more to it. Avoid phrases such as ‘the pilot failed to’ or ‘the surgeon neglected to’, since these words have heavy emotive connotations. Bear in mind that some words translate quite differently to other languages, sometimes with more personal connotations.
  • Take action: Contact journalists and reporters via social media, or write to the letters sections, to challenge simplistic narratives. Look for influential journalists who report on matters concerning safety, psychology and human factors. Cite simplistic reporting or language to highlight the problem within your industry, and on social media. If all else fails, consider installing Jon Cowie‘s Chrome Plug-in, which replaces instances of ‘human error‘ on webpages with ‘a complex interaction of events and factors‘. Mostly, that is what we are talking about…

We can all play a part in encouraging better reporting of major accidents and other adverse events. This can improve the awareness of many stakeholders (e.g. public, policy makers, regulators) by exposing real system weaknesses and help to improve safety by refocusing attention on the system. It is only fair to those caught up in systems accidents, whose natural and to-be-expected (and required) performance variability is so often inadequately considered in system design and operation. And often, they are the victims, as was the case with Virgin Galactic.

Posted in Human Factors/Ergonomics, Safety, systems thinking | Tagged , , , , , , , , , | Leave a comment

Reducing ‘the human factor’

“We need not reduce ourselves to bad outcomes, human errors, cognitive abstractions, or emotional aberrations.” Image: Pascal https://flic.kr/p/99GyCR

If you work in an industry such as transportation or healthcare – where human involvement is critical – you have probably heard people talk about ‘the human factor’. This elusive term is rarely defined, but people often refer to reducing it, or perhaps mitigating it.

This is a common frustration in human factors and ergonomics. Perhaps it is time that we discuss it further. The question is, when we talk about reducing ‘ the human factor’, what are we actually reducing?

Are we reducing the human to bad outcomes? The ‘human factor’ in safety nearly always seems to be a negative thing. So perhaps we associate people with bad outcomes. After all, when accidents happen, there are people there. That is a consistent finding! But when accidents don’t happen, there are also people there. People don’t go to work to have accidents. And people are associated with both normal, uneventful work (most of the time), and abnormal situations – accidents and notable or exceptional successes (rarely). The principle of equivalence posits that success and failure come from the same source – ordinary work: “When wanted or unwanted events occur in complex systems, people are often doing the same sorts of things that they usually do – ordinary work. What differs is the particular set of circumstances, interactions and patterns of variability in performance. Variability, however, is normal and necessary, and enables things to work most of the time.” Reducing unwanted outcomes is obviously a good thing to do, as is increasing wanted outcomes (which achieves the same effect, and more besides). But it is not straightforward, and means looking at the socio-technical system as a whole, not just ‘the human factor’.

Are we reducing the human to ‘human error’? ‘Human error’ might be defined asSomeone did (or did not do) something that they were not (or were) supposed to do according to someone.” Reducing ‘the human factor’ by focusing myopically on ‘human errors’ may constrain and reduce the opportunity for necessary performance adjustments and variability to such a degree that there is no room for flexibility when it is needed. Providing the ability to prevent and recover from unintended interactions is a good thing, but only as part of a wider systems view that addresses other important human needs.

Are we reducing the human to a faulty information processing machine? Information processing models have been popular in cognitive psychology and human factors for decades. They have helped us to make sense of experimental data and get a handle on aspects of human functioning and performance, such as attention, perception, and memory and (arguably to a lesser extent) decision making. They have helped us to understand the limits and quirks of our cognitive abilities. But a person is not model. These seductively simple models – boxes, lines, arrows and labels – are engineering approximations of human experience. Their geometric orderliness and linearity can never hope to capture our lived experience. A person is a unique individual, in a constantly changing context. What we see and hear, what we pay attention to and remember, what we decide and do, are dynamically and subjectively enmeshed in that context. And so is what we feel.

Are we reducing the human to emotional aberrations? Human emotion is not given nearly as much attention as cognition in either safety or psychology, though there is plenty of evidence that thoughts and feelings are interdependent. But where emotion is thought to influence human behaviour at work, such as in emergencies or in very boring work situations, emotion-as-abberation comes info focus as some something to be removed or reduced. But emotions cannot be cleanly sorted into ‘good’ or ‘bad’ piles, and humans cannot be reduced to their emotions.

Are we reducing human involvement in socio-technical systems? With such thinking about people at work, we risk squashing them out of the system. There has long been a mindset that people are a source of trouble in industrial environments – that if only it weren’t for the people, the world would be an engineer’s (or manager’s) paradise. Some would wish to automate people out, and it is happening at an ever faster pace. Aside from the societal cost, this often just changes the nature of human involvement – often for the worse. While some jobs are so dangerous that human involvement should be reduced, in most cases, the people are what makes the imperfect system work as a whole.

We are human. We need not reduce ourselves to bad outcomes, human errors, cognitive abstractions, or emotional aberrations, else we are on a road to reducing our involvement altogether. In doing that, we reduce our choice and control, our responsibility and accountability, our capacity and potential, and our meaning and value.

Posted in Human Factors/Ergonomics, Humanistic Psychology, Safety, systems thinking | Tagged , , , , , , , | 2 Comments

Systems Thinking for Safety: From A&E to ATC

This article summarises a EUROCONTROL Network Manager White Paper called Systems Thinking for Safety: Ten Principles. The White Paper was a collaboration of EUROCONTROL, DFS, nine other air navigation service providers and three pilot and controller associations. The purpose is to encourage a systems thinking approach among all system stakeholders to help make sense of – and improve – system performance. This article was originally published in Safety Letter, a magazine of the German air traffic services provider Deutsche Flugsicherung (DFS) and can be downloaded here.

Only human

Imagine you are a health professional – a doctor or a nurse – working in a busy accident and emergency (A&E) department. You are highly trained and are experienced in your job, and you are motivated to do your best. But you work within an hospital A&E department that faces many challenges. Demand is very high, and it can vary significantly and quickly depending on the time of day, the day of the week, and the time of the year. At the busiest times, there are many patents waiting to receive care. There have been government funding cuts to social services and it can take days to get an appointment with family doctors/general practitioners, so patients come to A&E when they feel it necessary, often when their sickness has advanced significantly. There are often many older people, with a complicated range of conditions and medications.

You are constrained by a lack of capacity in the hospital. You have to transfer patients to other parts of the hospital, but there are not enough beds, so they have to stay in A&E for longer, leading to a build-up of demand. You only have 10 beds in the A&E department, and occasionally ambulances have to wait outside, as there are no beds available for the patients. Despite this, you have a 4 hour target to meet – 95% of patients must be seen, treated, admitted or discharged in under four hours. The pressure is very high. Sometimes, there are not enough nurses or doctors.

You often have to work very long hours with little sleep and few breaks. You feel fatigued, too much of the time. You interact with a wide variety of equipment to diagnose, test and monitor, and some of it can be tricky to use. The manufacturers vary and the user interfaces can be confusing, or just subtly different. You have to prescribe a huge range of medicines, and some of these have similar labels, so that two very different doses or two different medicines actually look very similar. You are under time pressure most of the time, but each patient and nurse needs some of your time. Since there are only 10 beds, trade-offs have to be made. For instance, patients are sometimes kept in corridors temporarily. It is a difficult job, but everyone does their best to make things work.

This is not a fictional scenario. It is a scenario faced daily by many hospital A&E departments. It serves here to remind us that we may be highly skilled, experienced and motivated, but we only human, and we work in a system that has a powerful effect on our possibilities – for good or bad.

The Fundamental Attribution Error

When we take the time to think about work situations in this kind of way, the difficulties and compromises faced by people in highly dynamic and complex systems become more clear. Yet often don’t think in this way; we are steered first by our first impressions or reactions. These can lead us astray when we seek to explain situations, events or occurrences that involve other people. Research has shown that we tend to emphasise internal or personal factors to explain someone else’s behaviour, while ignoring or minimising external, contextual or system factors. When things go wrong in life – such as a tragic major accident, a near miss or an annoying minor occurrence – this bias seems to kick in. We tend to focus automatically on the person at the sharp end. We seem to think that they should really have done better, perhaps been more careful, and then things would have been OK. The phenomenon is known as the ‘fundamental attribution error’, and explains why we so often blame other people for things over which they had little control, or were influenced by powerful systemic factors that affect most people.

Part of the reason for this is that individuals are ‘obvious’ – we can see them, they are relatively unchanging. The context or system, on the other hand, is not so obvious and it changes quickly – it seems to be ‘background’. So when we try to explain events, we focus on what we can see – the person. This is our automatic or gut reaction – it is easy, and takes little effort to explain away an event as being something to do with an individual. (Additionally, it seems easier to direct our frustration, anger or outrage toward a person.) But considering the situational constraints is a complex activity which requires deliberate inquiry and conscious effort. Indeed, we slip into internal explanations when under greater ‘cognitive load’ and lack the energy or motivation to consider situational factors. We may also think, without even realising, that the world is ‘just’ and we all have control over our own lives, so people get what they deserve. Unsurprisingly, when we judge our own behaviour, we are more likely to explain negative outcomes in terms of the context or situation (and positive outcomes in terms of our own disposition) – sometimes known as the self-serving bias.

In organisations, there are a few problems with focusing on the person, especially when things go wrong. One is a problem of fairness. Contrary to our first impression, the context and ‘system’ in which we work has an enormous effect on behaviour. This is why the celebrated management thinker, statistician and quality guru W. Edwards Deming wrote that “most troubles and most possibilities for improvement…belong to the system”, and are the “responsibility of management”. When we dismiss the influence of the system – including the system goals, resources, constraints, incentives, flow of information, etc – we are unfairly blaming an individual for a much bigger issue. The second, and perhaps more important problem, is more practical. When we focus our attention on the individual, we miss opportunities to address future problems and improve the system. Put another way, “We spend too much of our time ‘fixing’ people who are not broken, and not enough time fixing organization systems that are broken” (Rummler and Brache, 1995). What is surprising about the accident and emergency scenario above is not that there are accidents and incidents. What is surprising is that there are so few. Pit against a system with such high demand, with limited resources, with such difficult constraints, people somehow manage to make it work most of the time. But if we want to make improvements, we cannot simply rely on front-line staff to always compensate in this way. If we want to improve how things work – for safety but also for productivity, efficiency, human wellbeing – we must think about the system and how it works. This relates to the idea of ‘systems thinking’.

It’s the system, stupid!

To make sense of situations, problems and possibilities for improvement, we need to make sense of systems. ‘System’ is a word we sometimes use to describe a technical system, but here we use the word much more generally, especially to refer to systems in which humans are an integral part. An A&E department is a system, and is part of a bigger system (a hospital), which is part of a bigger system (e.g. a private or government health service), and interacts with others systems (e.g. police service, transport system). In ATM/ANS, one might consider the working position or sector, an Ops room or equipment room, a centre, an organisation, the airspace, the ATM/ANS system, the aviation system or the transport system. Systems exist within other systems, and exist within and across organisational boundaries. In practice, the boundaries of a system are where we choose to draw them for a purpose.

A system can be also characterised in terms of its purpose, the components and the patterns of interactions between the components (which produce characteristic behaviour). The purpose or goal of the system is critical, yet we often take this for granted, or recite some ‘official’ purpose (“Safety First!”) without thinking too much about it. In practice, there are several interdependent goals, but we can say that these relate to the customers or system users and their needs. These needs are not simple, however, and often conflict. For instance, as passengers, we need to arrive at our destination safely, but we also feel a need to arrive on time.

A system comprises a number of components. Some components are obvious and visible, for instance people, equipment and tools, buildings, infrastructure, and the like. Others system components are less visible, typically organisational components (such as goals, rosters, competency schemes, incentives, rules, norms) and political and economic components (such as pressures relating to runway occupancy, noise abatement, and performance targets). All of these have a powerful influence on how the system works – on what the people within the system do – and yet because we cannot readily ‘see’ them, we sometimes don’t realise this. These components – both obvious and less obvious – interact in characteristic ways or patterns, within an environment which may be more or less changeable, in more or less predictable ways.

‘Complex systems’

The term ‘complex system’ is often used in aviation (and other industries), and it is important to consider what is meant by this. According to Snowden and Boone (2007), complex systems involve large numbers of interacting elements and are typically highly dynamic and constantly changing with changes in conditions. Their cause-effect relations are non-linear; small changes can produce disproportionately large effects. Effects usually have multiple causes, though causes may not be traceable and are socially constructed. This means that we jointly construct and negotiate an understanding of reality, its significance and meaning, though our models of the social world and through language. In a complex system, the whole is greater than the sum of its parts and system behaviour emerges from a collection of circumstances and interactions. Complex systems also have a history and have evolved irreversibly over time with the environment. They may appear to be ordered and tractable when looking back with hindsight. But in fact, they are increasingly unordered and intractable. It is therefore difficult or impossible to decompose complex systems objectively, to predict exactly how they will work with confidence, or to prescribe what should be done in detail. This state of affairs differs from, say, an aircraft engine, which we might describe as ‘complex’ but is actually ordered, decomposable and predictable (with specialist knowledge). Some therefore term such systems ‘complicated’ instead of complex (though the distinction is not straightforward). While machines are deterministic systems, organisations and their various units are purposeful ‘sociotechnical systems’.

If only everybody just did their job!

Despite these complex interactions and the nature of the wider system, the way that we try to understand and manage sociotechnical system performance (i.e. people and technology) is as a collection of components – a bunch of stuff that will all work together so long as each part does its job. This focus on ‘components’ – a person or a piece of equipment – is common in many standard organisational practices. At an individual level, it includes incident investigations that focus only on the controller’s performance, behavioural safety schemes that observe individual compliance with rules, individual performance reviews/appraisals, individual incentive schemes, individual performance targets, etc. This focus is a management trade-off, since it is easier to focus on individuals than complex system interactions!

You may be wondering, “how is this a problem?” Surely, if everybody does his job, it will be all right! A counter-intuitive truth seems to defy this idea. Deming observed that “It is a mistake to assume that if everybody does his job, it will be all right. The whole system may be in trouble”. That may seem like a curious statement. How can it be so? The famous strategy of industrial action known as ‘work to rule’ gives one answer. In his book ‘Images of organisations’ Gareth Morgan wrote: “Take the case of the old state-owned British Rail. Rather than going on strike to further a claim or address a grievance, a process that is costly to employees because they forfeit their pay, the union acquired a habit of declaring a “work to rule”, whereby employees did exactly what was required by the regulations developed by the railway authorities. The result was that hardly any train left on time. Schedules went haywire, and the whole railway system quickly slowed to a snail’s pace, if not a halt, because normal functioning required that employees find shortcuts or at least streamline procedures.” Fancy that! If everyone does only his or her job, exactly and only as officially prescribed, then the system comes to a grinding halt! As well as this, it is not possible to design exactly the interactions between all system components (people, procedures, equipment), so people – as the only flexible component – have to ‘take up the slack’; they adapt to the needs of the situation (the same situation that we often miss when looking from the outside, with hindsight). People are always needed to fill in the holes in between work-as-prescribed; to adjust and adapt to changing, unforeseen and unforeseeable conditions. These adjustments and adaptation are the reason why systems work.

Yet sometimes, such adjustments and adaptations are defeated by changing system conditions. While we accept the adjustments so long as the system is working, we decry them when the system fails. Going outside of the organisation, this is reinforced in the justice system, which seeks a person to put on trial when accidents occur (e.g. the 2013 train crash at Santiago de Compostela). The unwritten assumption is that if the person would try harder, pay closer attention, do exactly what was prescribed, then things would go well. Ironically, even when the individual and their sharp-end performance has been found to the ‘primary cause’, the recommendation of major investigations is rarely to remove the person from the system. The recommendation is increasingly to change the system itself (e.g. the train crash at Santiago de Compostela). This is of course very unfair, but while we cannot yet seem to shake off our simplistic causal attributions, we seem more capable of understanding that possibilities for improvement belong to the system. We seem to know, at some level, that simply replacing the individual with another individual is not a solution.

How to suboptimise a system

A second counter-intuitive truth comes from organisational theorist Russell Ackoff, who wrote that “it is possible to improve the performance of each part or aspect of a system taken separately and simultaneously reduce the performance of the whole” (1999, p. 36). A practical example can be seen when each department or division of an organisation seeks to meet its own goals or improve at the expense of another, for instance creating internal competition for resources. In such cases, the emphasis may be on the function instead of the interactions and flow of work to satisfy customer needs. Ackoff quipped that “Installing a Rolls Royce engine in a Hyundai can make it inoperable”.

Organisations are not cars, obviously. Yet we often manage these complex social systems as if they were complicated machines. As well as making changes at the component level, we also have a habit of:

  • assuming fixed and universal goals – the purpose of a car engine is fixed. Not so with organisations.
  • using reductionist methods – we can break an engine down into parts, analyse each and model and test the result, with fairly reliable results. Not so with organisations.
  • thinking in a linear and short-term way –we can think linearly for simple and complicated machines. Not so with organisations.
  • judging against arbitrary standards, performance targets, and league-tables – it might make sense to have a league table of engines, and to have standards and performance targets for each part. Not so with organisations.
  • managing by numbers and outcome data – everything in a car engine can be meaningfully quantified in some way. Not so with organisations.

We also tend to lose sight of the fact that our world is changing at great speed, and accelerating. This means that the way that we have responded to date will become less effective. Ackoff (1999) noted that organizations, institutions and societies increasingly interconnected and interdependent, with changes in communication and transportation, and that our environments have become larger, more complex and less predictable. We must therefore find ways to understand and adapt to this changing environment.

Treating a complex sociotechnical system as if it were a complicated machine, and ignoring the rapidly changing world, can distort the system in several ways. First, it focuses attention on the performance of components (staff, departments, etc.), and not the performance of the system as a whole. We tend to settle for fragmented data that are easy to collect. Second, a mechanical perspective encourages internal competition, gaming, and blaming. Purposeful components (e.g. departments) compete against other components, ‘game the system’ and compete against the common purpose. When things go wrong, people retreat into their roles, and components (usually individuals) are blamed. Third, as a consequence, this perspective takes the focus away from the customers/ service-users and their needs, which can only be addressed by an end-to-end focus. Fourth, it makes the system more unstable, requiring larger adjustments and reactions to unwanted events rather than continual adjustments to developments.

A systems viewpoint

A systems viewpoint means seeing the system as a purposeful whole – as holistic, and not simply as a collection of parts. We try to “optimise (or at least satisfice) the interactions involved with the integration of human, technical, information, social, political, economic and organisational components” (Wilson, 2014, p. 8). Improving system performance – both safety and productivity – therefore means acting on the system, as opposed to ‘managing the people’ (see Seddon, 2005).

As design and management becomes more inclusive and participatory, roles change and people span different roles. Managers, for instance, become system designers who create the right conditions for system performance to be as effective as possible. System, actors, such as front line staff, also become system experts, providing crucial information on how the system works, helping to make sense of it, and providing the necessary adjustments.

The ten principles that follow give a summary of some of the key tenets and applications of systems thinking for safety that have been found useful to support practice. The principles are, however, integrative, derived from emerging themes in the systems thinking, systems ergonomics, resilience engineering, social science and safety literature.

The principles concern system effectiveness, but are written in the context of safety to help move toward Safety-II (see EUROCONTROL, 2013; Hollnagel 2014). Safety-II aims to ‘ensure that as many things as possible go right’, with a focus on all outcomes (not just accidents). It takes a proactive approach to safety management, continuously anticipating developments and events. It views the human as a resource necessary for system flexibility and resilience. Such a shift is necessary in the longer term, but there is a transition, and different perspectives and paradigms are needed for different purposes (see Meadows, 2009).

Each principle is described along with some practical advice and questions for reflections, that apply to various safety-related activities. But following are some tips from a EUROCONTROL White Paper just published, Systems Thinking for Safety: Ten Principles. Each of these is described in much more detail in the White Paper, and in a set of Learning Cards. You can download both at SKYbrary, via http://www.bit.ly/ST4SAFETY.

The Foundation: System Focus

Most problems and most possibilities for improvement belong to the system. Seek to understand the system holistically, and consider interactions between elements of the system

Principle 1. Field Expert Involvement

The people who do the work are the specialists in their work and are critical for system improvement. To understand work-as-done and improve how things really work, involve those who do the work.

Q. When trying to make sense of situations and systems, who do we need to involve as co-investigators, co-designers, co-decision makers and co-learners? How can we enable better access and interaction between system actors, system experts/designers, system decision makers and system influencers?

Principle 2. Local Rationality

People do things that make sense to them given their goals, understanding of the situation and focus of attention at that time. Work needs to be understood from the local perspectives of those doing the work.

Q. How can we appreciate a person’s situation and world from their point of view, both in terms of the context and their moment-to-moment experience? How can we understand how things made sense to those involved in the context of the flow of work, and the system implications? How can get different perspectives on events, situations, problems and opportunities, from different field experts?

Principle 3. Just Culture

People usually set out to do their best and achieve a good outcome. Adopt a mindset of openness, trust and fairness. Understand actions in context, and adopt systems language that is non-judgmental and non-blaming.

Q. How can we move toward a mindset of openness, trust and fairness, understanding actions in context using non-judgmental and non-blaming language?

Principle 4. Demand and Pressure

Demands and pressures relating to efficiency and capacity have a fundamental effect on performance. Performance needs to be understood in terms of demand on the system and the resulting pressures.

Q. How can we understand demand and pressure over time from the perspectives of the relevant field experts, and how this affects their expectations and the system’s ability to respond?

Principle 5. Resources and Constraints

Success depends on adequate resources and appropriate constraints. Consider the adequacy of staffing, information, competency, equipment, procedures and other resources, and the appropriateness of rules and other constraints.

Q. How can we make sense of the effects of resources and constraints, on people and the system, including the ability to meet demand, the flow of work and system performance as a whole?

Principle 6. Interactions and Flows

Work progresses in flows of inter-related and interacting activities. Understand system performance in the context of the flows of activities and functions, as well as the interactions that comprise these flows.

Q. How can we map the flows of work from end to end through the system, and the interactions between the human, technical, information, social, political, economic and organisational elements?

Principle 7. Trade-offs

People have to apply trade-offs in order to resolve goal conflicts and to cope with the complexity of the system and the uncertainty of the environment. Consider how people make trade-offs from their point of view and try to understand how they balance efficiency and thoroughness in light of system conditions.

Q. How can we best understand the trade-offs that all system stakeholders make with changes in demands, pressure, resources and constraints?

Principle 8. Performance Variability

Continual adjustments are necessary to cope with variability in demands and conditions. Performance of the same task or activity will vary. Understand the variability of system conditions and behaviour. Identify wanted and unwanted variability in light of the system’s need and tolerance for variability.

Q. How can we get and understanding of performance adjustments and variability in normal operations as well as in unusual situations? How can we detect when the system is drifting into an unwanted state over the longer term.

Principle 9. Emergence

System behaviour in complex systems is often emergent; it cannot be reduced to the behaviour of components and is often not as expected. Consider how systems operate and interact in ways that were not expected or planned for during design and implementation.

Q. How can we ensure that we look at the system more widely to consider the system conditions and interactions, instead of always looking to identify the ‘cause’? How can we get a picture of how our systems operate and interact in ways not expected or planned for during design and implementation, including surprises related to automation in use and how disturbances cascade through the system? How can we make visible the patterns of system behaviour over time, which emerge from the various flows of work?

Principle 10. Equivalence

Success and failure come from the same source – ordinary work. Focus not only on failure, but also how everyday performance varies, and how the system anticipates, recognises and responds to developments and events.

Q. How can best observe and discuss how ordinary work is actually done? Can we use a safety occurrence as an opportunity to understand how the work works and how the system behaves? How can we best observe, discuss and model ‘normal work’?

The principles do not operate in isolation; they interrelate and interact in different ways, in different situations. For instance, imagine an engineering control and monitoring position. There is variability in the way that alarms are handled, and some important alarms are occasionally missed. This must be understood in the context of the overall ATM/CNS system (Foundation: System Focus). Since success and failure come from the same source – everyday work – it is necessary to understand the system and day-to-day work in a range of conditions over time (Principle 10: Equivalence). This can only be understood with the engineers and technicians who do the work (Principle 1: Field Experts). They will view their work from their own (multiple) perspectives, in light of their experience and knowledge, their goals at their focus of attention, and how they make sense of the work (Principle 2: Local Rationality).

In particular, it is necessary to understand how performance varies over time and in different situations (Principle 8: Performance Variability). For this, we must understand demand over time (e.g. the number, pattern and predictability of alarms) and the pressure that this creates in the system (time pressure; pressure for resources) (Principle 4: Demand and Pressure). Through observation and discussion, it is possible to understand the adequacy of resources (e.g. alarm displays, competency, staffing, procedures), and the effect of constraints and controls (e.g. alarm system design) (Principle 5: Resources and Constraints) on interactions and the end-to-end flow of work (Principle 6: Interactions and Flow) – from demand (alarm) to resolution in the field.

It will likely become apparent that engineers must make trade-offs (Principle 7: Trade-offs) when handling alarms. Under high pressure, with limited resources and particular constraints, performance must adapt. In the case of alarms handling, engineers may need to be more reactive (tactical or opportunistic), trading off thoroughness for efficiency as the focus shifts toward short-term goals.

Through systems methods (see http://bit.ly/1DG1odH), observation, discussion, and data review, it may become apparent that the alarm flooding emerges from particular patterns of interactions and performance variability in the system at the time (Principle 9: Emergence), and cannot be traced to individuals or components. While the alarm floods may be relatively unpredictable, the resources, constraints and demand are system levers that can be pulled to enable the system to be more resilient – anticipating, recognising and responding to developments and events.

A systems perspective, and the ten principles outlined above, encourage a different way of thinking about work, systems, events and situations. Anyone can use the principles in some way, and you may be able to use them in different aspects of your work. We encourage you to do so.

References

Ackoff, R. (1999). Ackoff’s best: His classic writings on management. John Wiley.

Deming, W.E. (2000). Out of the crisis. MIT Press.

EUROCONTROL (2013). From Safety-I to Safety-II (A white paper). EUROCONTROL.

EUROCONTROL (2014). Systems thinking for safety: Ten principles (A white paper). EUROCONTROL.

Hollnagel, E. (2014a). Safety-I and Safety-II. The past and future of safety management. Ashgate.

Meadows, D. & Wright. (2009). Thinking in systems: A primer. Routledge. Rummler, G. A. and Brache A. P. (1995). Improving performance: how to manage the white space in the organization chart. Jossey Bass Business and Management Series

Seddon, J. (2005). Freedom from command and control (Second edition). Vanguard.

Skybrary (2014). Toolkit:Systems Thinking for Safety. http://www.bit.ly/ST4SAFETY

Snowden, D.J. & Boone, M.E. (2007). A leader’s framework for decision making. Harvard Business Review, November, pp. 7679.

Authors

Steven Shorrock is Project Leader, Safety Development at EUROCONTROL and the European Safety Culture Programme Leader. He has a Bachelor degree in psychology, Master degree in work design and ergonomics and PhD in incident analysis and performance prediction in air traffic control. He is a Chartered Ergonomist and Human Factors Specialist, and a Chartered Psychologist, with a background in practice and research in safety-critical industries. Steve is also Adjunct Senior Lecturer at the University of New South Wales, School of Aviation.

Jörg Leonhardt is Head of Human Factors in Safety Management Department at DFS – Deutsche Flugsicherung – the German Air Navigation Service provider. He holds a Master degree in Human Factors and Aviation Safety from Lund University, Sweden. He co-chairs the EUROCONTROL Safety Human Performance Sub-Group and is the Project leader of DFS- EUROCONTROL “Weak Signals” project.

Tony Licu is Head of Safety Unit within Network Manager Directorate of EUROCONTROL. He leads the support of safety management and human factors deployment programmes of EUROCONTROL. He has extensive ATC operational and engineering background and holds a Master degree in avionics. Tony co-chairs EUROCONTROL Safety Team and EUROCONTROL Safety Human Performance Sub-group.

Christoph Peters spends half his time as an Air Traffic Controller for Düsseldorf Approach and the other half as a Senior Expert in Human Factors for the Corporate Safety Management Department at DFS – Deutsche Flugsicherung. He completed a degree in psychology and is member of the EUROCONTROL Safety Human Perfomance Sub-Group and the DFS-EUROCONTROL “Weak Signals” project.

Posted in systems thinking | Tagged | Leave a comment