Reliability and Validity of Exams

The Trainer’s Handbook

Chapter 5 - Evaluating Your Effectiveness

Examinations.Exams test for “book” learning. They cannot really test practical experience. In fact, because most colleges rely very heavily on exams as a means of evaluation, a common criticism of recent college graduates is that they have a good academic, but no hands-on, background. In a business environment, case histories and projects, assessment sessions, and, of course, on-the-job training are much better means of evaluating actual performance.
What exams can tell you is the extent to which the trainees have learned and can recall what they have been taught; used for this level one purpose, exams are effective measuring tools. Where content, vocabulary, formats, formulas, and the like need to be mastered, the exam is a vital evaluation mechanism for both trainer and trainee.

Reliability and Validity of Exams. It is important to consider two defining factors about exams: reliability and validity. Reliability means that the test you’ve created gets consistent results over a period of time, with similar groups of trainees. It means the results are probably quite accurate. The more times a good test is given, the more reliable it becomes because each administration increases the database against which an individual’s or group’s performance is measured. Reliability is a statistical function.
To test for reliability, record all of the raw (actual) scores for each group of trainees who take the examination. Figure 5-2 [in the book] is a graph showing the scores on a scale from 1 to 100. Each x represents a trainee’s score. That is, for each trainee who scores a 99 on the test, make an x above 99 on the graph. For each score of 98, place an x above 98, and so on. Stack each x for a particular score on top of the previous x. When the tops of columns of x’s are connected, you have what statisticians call a standard distribution, or bell curve. In any large population, a small percentage of the people will score very high, a slightly larger percentage will score quite high, still more will score high; most will score in the middle range; and fewer will score below average, still fewer well below average, even fewer yet will be poor; a small percentage (about equal to the group that scored highest) will score very poorly.
Remember, you are not being graded yourself. No one will see these results but you. You need only use a rule-of-thumb measurement to keep on track and establish consistency. It is possible, even desirable if you are a statistician, to make a detailed analysis incorporating a standard deviation and correction for possible errors and with means, norms, and so forth. (See Chapter 4 for a detailed description of how to do this analysis.) But it is not necessary. I have informally tracked hundreds of exams over the years and found them as reliable as those tested by sophisticated analyses. If you like statistics and work for a boss who thrives on them, compile the information. If you don’t, and your boss cares only that the results be accurate, you will find my simple system more than adequate.
Record the results each time you give the test. Once you have a dozen or so instances, chart them on a master bell curve. Each time you give the exam, compare the results with your master curve. The same test should get approximately the same spread of scores, forming the same basic curve. If so, you have a sufficiently reliable test.
If you fail to get consistent results, either adjust your teaching or change the exam. A single aberrant score for an otherwise reliable test indicates either an exceptional group or a change in content emphasis. Such a score can act as a signal to you that you are changing your emphasis in the training and need to get back to basics in order to maintain your consistency. If you have a number of trainers teaching the same class material, such a score can indicate the need for one or more of your trainers to get back on track. On the other hand, if the new emphasis is desired and intentional, you will need to change the test to accommodate the new thrust. To fine-tune a test that scores outside the normal range you expect, make the questions more difficult if you want to lower the scores and easier if you want to raise them.
Reliability refers to whether you can depend on test results to accurately reflect performance. Validity, on the other hand, indicates whether what you are testing is directly related to the material you have taught. For example, standard intelligence quotient (IQ) tests are among the most reliable ever devised. Thousands upon thousands of people have scored in classic distribution curves, and each score is ranked in relation to other scores. However, no one has yet proved that an IQ test is a valid indication of intelligence. There is only marginal proof that a high IQ score indicates a tendency to earn high grades in middle-class American schools. This is why the U.S. Equal Employment Opportunity Commission has long regarded them with suspicion, and requires prospective employees to be tested directly and legitimately for the job for which they are applying.
Validity establishes what is being tested. The process of validating a test can be complex and time-consuming, or relatively easy. All you really need is what is called surface validity, meaning that the questions you ask are directly related to the material you have taught.
To establish acceptable validity for a test, make each question relate to one of your written objectives. The best way I know to accomplish this is to write the test before you create the lesson plan. Set your objectives, create a test that challenges learning for each of them, then flesh out the lesson plan so that you are teaching to pass the test. Academics have always frowned on this practice, but their goals are different. An academic test is designed, among other things, to separate the best from the less than best, to grade responses on a scale. On the contrary, the purpose of training tests is to let learners and instructors know how well they are doing. In training, ideally everyone should get 100 percent. That is your goal, so teach to fulfill the rigors of the test.
If you are teaching from a manual, write the questions to relate to specific statements in that manual when planning tests. If you are working with a particularly detailed manual, include the page number where the answer can be found at the end of each question. This learning and study aid facilitates self-correction.
Be prepared to accept discussion from trainees (in fact, you might want to solicit it) on how fair or difficult your test was. Another approach is to run the test by several area experts or other trainers for their feedback.
Maintaining both reliability and validity records will provide you with (1) hard data on the effectiveness of your training, and (2) documented evidence of your fairness in evaluating trainees, should a dispute or EEO lawsuit arise.

> Types of Exam Questions

Excerpts from Chapter 5, The Trainer's Handbook

Evaluating Effectiveness

Short-term Evaluation

Project Sessions

Case Histories and Practice Sessions

Examinations

Types of Exam Questions

Assessment Sessions

Self-evaluation

On-the-Job Evaluation

Long-term Evaluations

Bottom-line Evaluation

© 1998, 1993, 1987 AMACOM, a division of
American Management Association, New York.
All rights reserved.
Published by AMACOM Books
https://www.amacombooks.org
Division of American Management Association
1601 Broadway,
New York, NY 10019
Customer Service: 1-800-262-9699
More learning articles:

The miracle of learning

Orienting new employees

More about the role of the trainer

Learning Organizations

Knowledge Management

The value of lectures

Human Resource Management

The Trainer’s Handbook

Chapter 5 - Evaluating Your Effectiveness

Excerpts from Chapter 5, The Trainer's Handbook