THE  A-Z OF TESTING

 

R

 

RANGE

 

The highest and lowest scores on a test

 

RAW SCORE

 

The score on a test before it has been translated into a STANDARD SCORE. Two examples are:

 

  • on an ability test, where there are right and wrong answers, your raw score would simply be the number of questions you got right.
  • on a personality test where there are no right or wrong answers you may be asked to react to a statement by saying whether it’s

 

VERY LIKE YOU,

QUITE LIKE YOU,

NOT SURE,

QUITE UNLIKE YOU

VERY UNLIKE YOU.

 

Each of these responses will be given a numerical value and, at the end, these values will be added up, either for the whole test, or more likely for subscales within the test. This is your raw score.

 

Obviously you can use raw scores to compare peoples performance on a measure. If you’re simply trying to find out whether the person knows the answers to the specific questions in the test, a mark of 10/10  is better than 9 / 10. But past this simple judgement, raw scores don’t tell you much.

 

STANDARD SCORES are created because we know what they mean and a comparison between them tells you something useful about a person’s performance.

  

 

RELIABILITY

 

Do you get the same results if you measure the same thing in the same way two or more times ? 

 

If you do, then the measure is said to be reliable and is giving accurate scores. If not then the measure is contaminated with error.

 

REPORTS

 

See “ narratives”: the interpretation of test scores. Increasingly these are generated by rule bases in computer systems but expert test users can adapt these to reflect their more immediate knowledge of individual test-takers.

 

S

 

SCALE

 

Tests are often made up of scales. Thus a general ability test might have separate scales for Numerical , Verbal, Abstract and Spatial Reasoning on which any candidate can score high or low scores.

 

A personality test might present a huge number of questions. When scored these load onto separate scales for different FACTORS such as AGGRESSION or OPEN-MINDEDNESS. In this case, the test user will not know which items are contributing to which scale.

 

SELF-REPORT

 

Self report measures ask candidates to answer questions about themselves. They are thus open to lying or manipulation ( though there are techniques for highlighting where this may have happened ). They also reflect – obviously – the test takers view of themselves.

 

SPEED TEST

 

A timed test which has fairly easy items, so it looks as the number of questions attempted within the time limit.

 

SPLIT-HALF RELIABILITY

 

A technique used by test constructors to assess whether a measure is RELIABLE. The items from a test in development are split in half at random and scores on each half are correlated.

 

STANDARD DEVIATION

 

A measure of how much spread there is in the scores of a test.  This is an extremely important concept in psychometrics because it allows you to measure peoples’ characteristics in a way which is universal and which allows you to compare a person’s performance with the mean or average.

 

So, you’ll find statements like “ X scored two standard deviations over the mean on this test”. This contains useful information which is not contained in the statement “ x scored 5 out of 10 on this test” and tells us how x’s performance compares with the average of the test and how well he did as compared with other people that took it. In a personality test it tells us how strong a typical form of behaviour is – empathy or sociability for instance.

 

STANDARD SCORE

 

We use the statistical techniques of psychometrics including the normal curve of distribution to turn RAW SCORES into STANDARD SCORES such as Z SCORES, T SCORES and STENS.

 

We’re getting into some of the statistical areas of testing here and it isn’t possible in a short glossary to give a detailed explanation of these concepts.  Certain types of training can give you this knowledge.

 

However, the reasons we use STANDARD scores go right to the heart of testing’s usefulness.

 

As we mentioned in the entry on RAW SCORES, if you just stick down a set of 10 questions and get lots of people to answer them who get different scores you’re really none the wiser ! Unless you know about the characteristics of the questions ( how difficult they are: what they’re ACTUALLY measuring ) you can’t really tell if 8 right answers are a lot better than 6; or if the difference between 4 and 6 and 4 and 2 is equivalent or, indeed, whether you’re measuring what you think you’re measuring.

 

With standard scores you can start making useful inferences; you know what differences in scores mean, what the “ average is” and whether someone is above or below it. This is crucially important in comparing people with each other on a test

 

STANDARD ERROR OF MEASUREMENT

Let’s say a candidate took the same test repeatedly, with no change in his level of knowledge and preparation ( this actually happens to people applying to a number of jobs and in the graduate milk round ). Some of the resulting scores would be slightly higher or slightly lower than the score that precisely reflects the student’s actual level of knowledge and ability ( as we’ve repeated, perhaps ad nauseam , ALL measurements contain error ). The difference between a student’s actual score and his highest or lowest hypothetical score is known as the standard error of measurement.

Again, this helps us understand how far we can rely on a particular score; avoid over-interpretation  and highlight those scores and differences that are significant in crucial decisions

 

STANDARDISATION

 

You will sometimes see particular tests referred to as STANDARDISED INSTRUMENTS.

 

STANDARDISATION is the process developers go through to ensure that a test is measuring what it says it’s measuring, that it’s measuring it accurately, and that they understand how its operating. It also defines the conditions under which the test should be used.

 

Comparing testing to a classic scientific experiment helps here. If a physicist or chemist does an experiment, he or she tries to standardise all the conditions under which the experiment takes place apart from the particular issue he or she is studying. The scientist doesn’t want to temperature of the room, the time of day, the make up of the test tube or the different characteristics of measuring devices to add an area of variability into the findings.

 

So a test is standardised by defining how it should be administered. The User Manual will then give instructions to the test user as to how the test should be administered to replicate “ laboratory conditions” and get an accurate score. That is why the administration instructions in a manual are important.

 

Norms are gathered in these conditions and scores are transferred into STANDARD scores where we know their characteristics.

 

STANINE

 

A STANDARD SCORE with a MEAN of 5 and a STANDARD DEVIATION OF 2

 

STATES

 

See also MOODS as an example.

 

STATES, as opposed to TRAITS, are about feelings or reactions to a particular set of circumstances at a particular time. They tend to change as opposed to underlying and reasonably stable typical feelings.

 

States obviously affect what we do: if we’re exhausted we make more mistakes and may become irritable.

 

Distinguishing STATES from TRAITS is an unacknowledged, sometimes unconscious management task. If someone is basically INTROVERTED – that’s part of their personality – it may have an impact on what sort of job they should be doing. If someone is avoiding company because of a particular situation – a relationship break up, financial problems or the fact that their football team isn’t winning – you will manage them in a different way.

 

STENS

 

A type of STANDARD SCORE with a MEAN of 5.5 and a STANDARD DEVIATION of 2