These are crucial aspects of testing. There is perhaps a constant dynamic tension and balance between validity and reliability. Validity is perhaps the central quality. Reliability is a necessary but not sufficient condition for validity: valid tests are necessarily reliable, but reliable tests are not necessarily valid.
A test is said to be valid when it measures what it is supposed to measure and no other unintended abilities. There are several types of validity.
Content validity is when the content of a test is representative of the language and skills it is designed to test. For example, a test of grammar should test grammar, and not, for example, knowledge of specialist [or technical] vocabulary. It is also necessary that the content should reflect what is to be tested in proportional terms. This is linked closely to the issue of direct versus indirect testing. Clearly, in order to be able to say something about a candidate’s ability to use the target language, the tasks we set should reflect the candidate’s use of language under conditions that are as authentic as possible.
Criterion Related Validity is related to the extent to which the results of the test correlate with results provided by other independent measures.
Construct Validity refers to whether the underlying skill [the construct] that is being sampled in the test is actually representative of the ability we wish to test. This does not normally pose too much of a problem with direct testing, but severe difficulties may arise with indirect testing. Hughes [1989: 15] illustrates this problem with an example from the TOEFL writing test.
Identify which of the following is erroneous or inappropriate in formal standard English:
At first the old woman seemed unwilling to accept anything that was offered her by my friend and I.
The candidate is expected to identify that my friend and I is incorrect in standard written English…
Since the TOEFL Test is all multiple-choice (that ıs, pre iBIT), it is clearly not possible for writing to be tested directly. The theory of writing indicates that there are a number of sub-abilities involved in writing [such as control of punctuation, grammatical accuracy, accuracy of spelling, sensitivity to style, and so on]. We would then, as TOEFL does, construct items to measure these discrete sub-abilities. Our test would only have construct validity if we were able to state with confidence that the test was in fact testing whether a candidate could write well or not.
Weir [1990: 22] identifies construct ability as superordinating all other types of validity. In a sense, it is the most important, because if we cannot be sure how our test relates to the target skill being measured, then we are wasting our time writing or giving the test.
Face Validity is concerned with whether or not a test looks like a proper test in the eyes of the users [the teachers and the students]. If it does, [if it is seen to look like a test] it will be treated with respect and cooperation. If it does not, the results may reflect the candidates’ lack of willingness to take the test seriously.
All the above types of validity contrast with what has been termed faith validity or the intuitive belief by the test writer that the test is a good one.
It is clearly not possible for a test to be a proper measure if the results vary every time the test is administered. Therefore a test is said to be reliable if it measures consistently. Results that are not very reliable affect the degree of validity of a test. Reliability represents the degree of confidence that can be placed in the results of a test and the consistency with which a test gives the results expected.
For a test to be reliable, the results of the test should not vary [significantly] when:
- the test is marked by two different markers [inter-marker reliability]
- the test is marked by the same marker at different times [intra-marker reliability]
- the time when or the place where the test is administered are changed [test-retest reliability]
Reliability is an indication of the degree of dependability, consistency, or stability of scores on a test. It is the consistency with which a test measures a given variable. It concerns three main areas:
- The consistency of test results over time and space.
- The consistency of test results over a number of markers.
- The consistency of test results over a number of questions.
Sources of unreliability
What factors might affect the reliability of a test?
- There are various ways in which a test may be made more reliable.
- What recommendations could be made to test designers and administrators in order for reliability to be maximized?
- What do you think is the relationship between reliability and validity?
- Are they mutually exclusive or can the two concepts be reconciled in the same test, and if so, under what conditions
A further consideration in test design is its efficiency or practicality. It is clearly pointless designing a well-principled test if resources to administer the test are lacking. For example, in some parts of Africa, the common practice of giving listening comprehension tests on ACR would not be feasible as many schools lack both ACRs and have an unreliable electricity supply at best. Other considerations apply to teacher time. How much information does the test provide about the candidates in relation to the amount of time spent writing, administering and processing the results of the test? In this respect, oral testing is notoriously problematic. The time and cost of trying to organize oral tests for large numbers is often more than many administrations are willing to bear.