Fit for purpose.

Introduction.

This post is intended to discuss the concept of fit criteria – a key concept of the Volere process: what they are, why they are useful and what makes a good fit criterion. It starts off with some background: this is not essential reading if you want to know more about the details of fit criteria in which case you may want to skip directly to that section. The final section (the devil in the detail) looks at some examples to illustrate the difficulties of writing really good fit criteria.

Background.

Some years ago at a seminar I heard Professor David Parnas discuss the question of mathematical certainty in software. Professor Parnas, always an excellent and interesting speaker, pointed out that a software program consists of a series of mathematical transformations, and so that if you express the requirements for the system formally and unambiguously – i.e. in effect mathematically – you can in principle prove whether or not the designed, and eventually the developed, system will fulfil those requirements or not. Professor Parnas, throughout his writings, often comes back to the theme that a major problem with software based systems is that the requirements are very rarely expressed formally and unambiguously.

As an example, he cited an experience with the safety-critical software in the Darlington, Ontario nuclear power plant. (The story he recounted is more or less identical to the report given in (Petersen, 1991)) He, along with his team, was asked to validate this extremely important part of the control system. He was reluctant to get involved, especially as it seemed to involve stepping into an argument between the operator and the regulator, but finally agreed to look at a sample of the requirements.

He concentrated initially on a feature requiring an automatic shutdown of the pumps if the water level exceeded a certain figure. The specification, to some, would appear clear:

Shut off the pumps if the water level remains above 100 metres for more than 4 seconds.

This seems, on the surface, quite precise: there is a specific and measurable requirement. But what does it actually mean? Should it be the average water level taken over any four second period? If so, which average? It could be:

  • The mean level: say measurements are polled every 0.1s, then the water levels over each 40 samplings are summed and divided by 40, if this figure ever exceeds 100m then the shut off is triggered.
  • The median level: if 20 out of any 40 consecutive samplings are over 100m then shut-off is triggered.

There are other possibilities as well: Parnas suggested three to Ontario Hydro, the operating company that specified the software: it turned out that the implementers had interpreted the specification in a fourth way – that the level remained above 100m for all of the samplings in a 4s period. If you think about it, by that time, depending on the variability, the mean level might be considerably higher!

Ontario Hydro, keen to get the review over with, thanked David Parnas for “finding the last bug”: this was something of a red rag to a bull and Parnas was persuaded that he really should do a thorough review – that he lived close to a plant may have influenced his decision!

(Petersen, 1991) sums up Professor Parnas’s overall view in his own words:

“These two facts [that computer programs are both complex and error-prone] are clearly related,” says Parnas. “Errors in software are not caused by a fundamental lack of knowledge on our part. In principle, we know all there is to know about the effect of each instruction that is executed. Software errors are blunders caused by our inability to fully understand the intricacies of these complex products.”

Practically no one expects a computer system to work the way it should the first time out. “A new chair collapses, and we’re surprised,” Parnas says. In contrast, “we accept as normal that when a computer system is first installed, it will fail frequently and will only become reliable after a long sequence of revisions.

Few professionals would deny that software is pretty much the only area where such sloppiness is tolerated, but it may be that this is at least in part because we can change software relatively easily, tweaking it until it does what we want. Companies – particularly those that produce consumer software – may be less worried about reliability than time to market, and a heuristic approach – keep tweaking until its acceptable – may have evolved as a result. This philosophy is in many ways the basis of Agile methodologies, and it is powerful not only because it is relatively cheap – usually – to change software but also because a product that fulfils every requirement, exactly as specified, is still only as good as the requirement itself and often, when presented with this apparently perfect product, our customer will tell us that while it may be what they asked for, it is not actually what they want. Under such circumstances an iterative approach with regular customer feedback will often work better (part of the genius of the agile movement is that they made a virtue out of what was happening anyway, formalising it and making sure it happened much earlier in the lifecycle. Rather than rejecting V1.0 as inadequate after a year, the customer got to make constructive criticisms after a couple of weeks).

Safety-critical software is not an area where such an approach would be acceptable: it will, we hope, be used very rarely and it must work first time. But there are other, more graded areas, where the approach is (broadly) a trade-off between the cost of failure and the cost of development: I say broadly because at this point I don’t propose to get into other issues, such as the opportunity cost of not having your product out as quickly as your rival. The higher the cost of failure, the better the payback for a more formal approach to specification – always provided we specify the right product.

But we can turn this around: the lower the “cost” of a more precise specification, the cheaper it will be to produce a better product and reduce eventual maintenance costs. This is where the concept of fit criterion comes in.

Fit criteria.

The Volere process is a requirements’ engineering process developed by Suzanne and James Robertson and described in their book Mastering the Requirements’ Process (Robertson, et al., 2006)). Here I am interested in one specific innovation that they introduce (and that I would consider a major Unique Selling Proposition of their process), the Fit Criterion.

The fit criterion acknowledges the usefulness of a plain language specification: despite its potential ambiguity this is a statement of a need that is readily, if sometimes fuzzily, understood by all stakeholders without the need for any special training.

But, it says, maybe we can take that further. While a simple plain language description is easy to understand, it is also easy to mis-understand (as the Darlington Power Station example shows). So can we enhance this with a formalised, quantifiable, statement?

Yes, say the Robertsons, and this formalised, quantifiable and unambiguous enhancement to the requirement text is called the fit criterion: as they say in the book once you measure the requirement – that is, express it using numbers – there is very little room for misunderstanding. This fits extremely well with the concept of test-first development (double pun not actually intended), albeit at a slightly different level. It provides the measure by which a test may be considered to pass, or fail, while not actually describing the test case in detail.

The devil in the detail.

So far, so good. The reader may have gathered that I regard this as a really really useful feature of the Volere process. But as with many good ideas the devil resides deep in the details, and many of the examples of fit criteria that I have seen – not least in the Robertsons’ book – fail to achieve the goal they have set themselves. Briefly here I would like to discuss some examples, why they might be ambiguous or inadequate and how we might improve them further. The examples will be taken from (Robertson, et al., 2006) and from (Open University, 2008). I have no desire to be over-critical in either of these cases: just to illustrate that the derivation of a genuinely unambiguous and useful fit criterion is not a trivial task even for experts, and indeed, even for those who came up with such a good idea in the first place!

(Robertson, et al., 2006) identify two different approaches to fit criteria: a binary approach for functional requirements – the requirement is either fulfilled or not – and a more graded approach for non-functional requirements.

Fit criteria for functional requirements.

(Robertson, et al., 2006) uses a case study called IceBreaker: a product that predicts when and where ice will form on roads and that schedules trucks to treat the roads with de-icer. This is where the example requirements – and so fit criteria – come from: the details are not particularly important here.

Given a requirement – the product shall record the weather station readings is an example given – we can see a number of “sub-” criteria. Firstly the product needs to record a set of values. Secondly, this set of values must agree with an authority – the external system responsible for the values, in this case the weather station. So if, at every point in time, the lists of values held by the weather station and that held by the product coincide then the requirement is fulfilled.

This leads to a fit criterion (again in (Robertson, et al., 2006)) that reads:

The recorded weather station readings shall be identical to the readings as recorded by the transmitting weather station.

So far so good. But this statement itself also includes a number of ambiguities and assumptions. There is an assumption (that may be covered in a separate, linked, requirement) that the weather station transmits values: that is, the product does not poll for them. At what point is the “measurement” – the check as to whether they agree or not – taken: if it is at specific times then is there is a risk that there might be a communication break leading to a delay in the update?

If we are being really picky we can say, well, what do we mean by “identical”? They must represent the same temperature, but if one is dealing in Celsius and the other in Kelvin we will not want the numerical values to be the same. I say this is being picky, but this is the sort of problem that is easily forgotten and has destroyed more than one product (including the Mars Climate Orbiter).

Nor does it tell us the required reliability. While we hope the product will be 100% reliable, we can never guarantee this in advance through testing (because a single failure, any time in the life of the product, would falsify it!). We can test it to a specified level of reliability close to 100% but never to 100%.

So a refined fit criterion might say:

The recorded readings for a specified weather station shall represent the same temperatures as the readings recorded for the same polling period (as identified by the UTC time stamp) by the appropriate transmitting weather station and this shall be verified to an accuracy of 99.9% with a better than 95% confidence.

Even this is not watertight: I haven’t mentioned boundary conditions or intrinsic accuracy and I am sure there are other factors that I have missed. (In a future post I plan to come back to ways of analysing the requirement text to avoid such circumstances).

Fit criteria for non-functional requirements.

Many non-functional requirements may be quite subjective in nature: the product shall be easy to use or the product shall respond quickly. The fit criterion makes this objective, e.g.

90/100 users that fit the typical user profile shall report that it was “easy” or “very easy” to use the product to perform a set of standard tasks.[assumption: both the “typical user profile” and one or more “sets of standard tasks” have been defined].

Again, this is undoubtedly better than just saying the product should be easy to use. But again, there is always room for improvement. We need to know who decides that a given test user fits the typical user profile, and the “set of standard tasks” needs to be carefully defined so that some rare but extremely important task (for example, resetting a password) isn’t made horrendously difficult and this is not noticed because the fit criterion only covers a part of the overall requirement!

Most importantly, we need to quantify the actual measure we need and not rely on the absence of its perceived opposite. , I’m going to take another example from (Robertson, et al., 2006):

Requirement description: only engineers using category A logins shall be able to make any additions, updates or deletions to any weather station data.

Fit criterion: of 1000 additions, updates or deletions to any weather station data, none shall be from other than category A engineer logins.

At first sight, this may seem reasonable: after all, we have a clear and testable measure. Naturally some of the terms need formal definition, but apart from that it’s alright, isn’t it?

Think about it. What proportion of attempts to update the data are going to be from non-authorised users? Do we have a guarantee of even one in those 1000 updates? No. The fact is that this fit criterion can be fulfilled without a single attempted unauthorised access being blocked!

The number of authorised accesses, in fact, is unlikely to bear any relationship to the number of unauthorised attempts: these may be accidental (users of other categories) or malicious (external hackers) (which category the requirement is aimed at may be better identified from the rationale) and are likely to be rare. So really we need to turn this around:

Fit criterion: of 1000 attempts by a user other than a category A user to add, update or delete to the data relating to any weather station, none shall succeed.

Or, if we are more worried about external threats:

Fit criterion: of 1000 attempts by group of professional, external, security testers to force access to the system and add, update or delete to the data relating to any weather station, none shall succeed.

These fit criteria are better because they measure what we actually want to know – the number of unauthorised accesses which we want to be zero – rather than a measure that, though it appears reasonable, is actually unrelated – the number of authorised accesses.

Conclusion.

Fit criteria are a powerful tool, but the old adage still applies: a fool with a tool is a fast fool. There are pitfalls, and the fit criterion is not a silver bullet – we need to make sure it is a genuine quantification and not just a restatement of the requirement in different words, that it covers the requirement and that it doesn’t lull us into a false sense of security – but it does provide a mechanism by which we can navigate between the ease of use of a natural language solution and the mathematically precise nature of the software – indeed, there is no reason why the fit criterion could not be rendered mathematically. Possibly, without using maths, a fit criterion will never be able to remove all the ambiguity but it can go a long way towards it.

Bibliography

Open University. 2008.
M883: Software requirements for business systems Study Guides for Chapters 4 to 9. Milton Keynes : Open University Press, 2008. 978 07492 4857 4.

Petersen, Ivars. 1991. Finding fault: the formidable task of eradicating software bugs. Science News. Feb, 1991, Vol. 1991, Feb.

Robertson, Suzanne and Robertson, James. 2006.
Mastering the Requirements Process. Westford, Massachusetts : Pearson Education, Inc, 2006. 0-321-41949-1.

Posted in Volere | Tagged , , , , | 5 Comments