Validity and Reliability
Validity and Reliability
A test is valid when it measures what it’s supposed to. How valid a test is depends on its purpose—for example, a ruler may be a valid measuring device for length, but isn’t very valid for measuring volume. If a test is reliable, it yields consistent results. A test can be both reliable and valid, one or the other, or neither. Reliability is a prerequisite for measurement validity.
![]() |
![]() |
![]() |
|
reliable, but not valid |
not reliable, not valid | reliable and valid |
Types of Measurement Validity:
Face validity: Does it appear to measure what it’s supposed to measure? There would be low face validity when the researcher is disguising intentions.
Content Validity: Is the full content of a concept’s definition included in the measure? It includes a broad sample of what is being tested, emphasizes important material, and requires appropriate skills. A conceptual definition can be thought of as the ‘space” that contains ideas and concepts.
Criterion Validity: Is the measure consistent with what we already know and what we expect? Two subcategories: predictive and concurrent
Predictive validity: Predicts a known association between the construct you’re measuring and something else.
Concurrent validity: Associated with pre-existing indicators; something that already measures the same concept.
Construct Validity: Shows that the measure relates to a variety of other measures as specified in a theory. For example, if we’re using an Alcohol Abuse Inventory, even if there’s no way to measure “abuse” itself, we can predict that serious abuse correlates with health, family, and legal problems. Subcategory: discriminant validity
Discriminant Validity: Doesn’t associate with constructs that shouldn’t be related.
Note: Sometimes, construct and criterion validity seem to overlap. This isn’t a big deal. The important thing is that the comparison of scores on your measure works like expected in relation to the other measures.
Reliability
If a test is reliable, it yields consistent results.
Inter-observer: There are consistent results among testers or coders who are rating the same information. Measuring agreement using the Inter-observer reliability coefficient is a good rule of thumb: if (Total agreements) / (Total observations) > .80, the data have inter-observer reliability
Test-retest: A measure at two different times with no treatment in between will yield the same results.
Parallel-forms: Two tests of different forms that supposedly test the same material will give the same results.
Split-half reliability: If the items are divided in half (e.g., odd vs even questions) the two halves give the same results
For all forms of reliability, a quantitative measurement of reliability can be used, applied much like the inter-observer reliability coefficient. It should be .80 or higher. However, the coefficient can be lower for averages in a group because individual scores vary.
Threats to internal validity
These are factors that can affect how valid your test is by itself.
History: Outside events occurring during the course of the experiment or between repeated measures of the dependent variable may have an influence on the results. This does not make the test itself any less accurate.
Maturation: Change due to aging or development, either between or within groups.
Instrumentation: The reliability of the instrument may change in calibration (if using a measuring device) or from change in human ability to measure differences (due to fatigue, experience, etc).
Testing: Experience of taking test has an influence on results. Experience refers either to mental or physical changes—a participant’s attitude towards a topic may change because of a survey, which could affect results, or a participant’s physiological response to a test may change after repeated measures.
Statistical regression: Tendency to regress towards mean makes scores higher or lower. If a measure is not extremely reliable, there will be some variation between repeated measures. The chances are that the measurements will move towards a mean instead of towards extremes.
Selection: The participants in groups may be unlike in some way, so they will respond in different ways to the independent variable. This is mostly a risk for quasi-experimental designs, in which non-random assignment is used.
Mortality: Participants drop put of the test, making the groups unequivalent. Also, who drops out and why? (Often it is the people who did most poorly on the test to begin with.)
Interaction: Two or more threats can interact. For example, a Selection-Maturation interaction: difference between ages of groups could cause groups to change at different rates. A group of young people may show more improvement in a test than a group of older people, but that could be because their brains are developing faster relative to their age.
Experimenter bias: Expectations of an outcome may inadvertently influence participant or cause the experimenter to view data in a different way.
Placebo Effect: Improvement due to expectation rather than the treatment itself; can occur when participants receive a treatment that they consider likely to be beneficial.
Hawthorne Effect: When members of the treatment group change in terms of the dependent variable because their participation in the study makes them feel special—so they act differently, regardless of the treatment.
Contamination: When the comparison group is in some way affected by, or affects, the treatment group, causing an increase of efforts. Also known as compensatory rivalry or the John Henry effect.
Threats can be compensated for by using a true experimental design.
Threats to external validity:
These are factors that can affect how well your results apply to the target population.
Can we generalize with confidence that this is true for the target population?
Selection bias: The sample is not representative of the population demographically
Reactive effects of experimental arrangements: Results could be because of the experimental setting, not the treatment. Therefore, the results might not be true for the target population. This is something to consider when controlling confounding variables—there is always a tradeoff between control and external validity. When designing an experiment, you should always explicitly think of what is most important; this varies for every design.
Reactive effects of testing / pretest sensitization: While the sample gets a pre-test to establish a baseline of behavior, the target population isn’t getting a pre-test, so might respond differently to the treatment.
Multiple treatment interference: Giving treatments in the first place means that second treatments could change the participant. Even if the second treatment is effective, it might only be because of the interaction with the first treatment. This can be accounted for by using a Latin square, where all the groups get each treatment, but in different orders.
Back to Research and Design
References:
Patten, Mildred L. (2002). Understanding research methods: An Overview of the essentials (3rd ed.). Los Angeles: Pyrczak Publishing.
Schutt, Russell K. (1999). Investigating the social world: the Process and practice of research (2nd ed.). Thousand Oaks: Pine Forge Press.
Solso, Robert L., Johnson, Homer H., & Beal, M. Kimberly. (1998). Experimental psychology: a case approach (6th ed.). New York: Longman.




