On the Quest for Good Measurement

by Fernanda Gándara
RM&E Director, Global Girls’ Education & Gender Equality
Room to Read

What do we mean by good educational measurement? As a psychometrician, this is one of the essential questions guiding my work. Of course, good measurement serves an intended purpose, yields consistent scores, and conforms to theory. But there is a lot more to it. Here, I offer three characteristics of good measurement that are often missed.

First, good measurement acknowledges consequences. Measurement does not happen in a vacuum and the act of creating scores is not an innocent one. For every test, there will be interpretations, uses and consequences; so, validity must consider social values (Messick, 1980; 1989). We cannot consider tests as isolated artifacts, but rather as situated in a complex network of social relations. As a result, the Standards for Educational and Psychological Testing (AERA, APA, & NCME, 2014) define validity as a property of the interpretations and uses of test scores, rather than of tests themselves. Assessing potential social consequences is a critical step in validation (Sireci & Benítez, 2023). Unfortunately, these ideas have not radically transformed practice. Some practitioners insist on validity being a property of a test, and others simply ignore the literature.

Test consequences are difficult to grasp. Tests provide information about who is considered “adequate enough” and who is not, potentially influencing self and society. A useful way to unpack this complexity is through decolonial theory. Decolonial theory questions the underlying assumptions, motivations and values which inform educational practices (Smith, 2021), providing a good framework to identify and judge test consequences – Why does this test exist? Whose viewpoints are favored in this test? Posing these questions may help uncover the depth and nature of tests’ consequences. Decolonial theory also reminds us that there is no unique way to do this. Riyad Shahjahan, Annabelle Estera and Kirsten Edwards describe the multifaceted nature of decolonial “doing”. Their reflections make me think that “decolonizing” psychometric practice looks like a messy non-linear dialogue with multiple stakeholders, in which we constantly redefine tests to best serve the society we strive for.
Second, good measurement considers context as construct relevant. A construct is what the test is designed to measure. Psychometricians refer to construct-relevant and construct-irrelevant variance, to distinguish variation in scores that is connected to the construct versus exogenous factors. The distinction matters, because test developers try to maximize the former and minimize the latter. The distinction matters even more when one examines the history of educational measurement, which has largely treated the sociocultural identities of minorities as “barriers of inferiority to be mitigated outside of the assessment design process” (Randall et al., 2022, p.171). Fairness problems in measurement partly stem from glorifying a “purified construct representation” (Randall, 2021), which excludes epistemological diversity. To improve fairness, we should dismantle the illusion of context-free measurement (Randall, 2021).

Context needs to be included in the early stages of test development (Randall, 2021). You Yun describes ontological differences regarding selfhood and emotion between East and West. She implies that one cannot simply develop a socio-emotional learning (SEL) test that works across cultures. In psychometrics, the ability to use test scores across cultures is typically evaluated through cross-cultural equivalence. However, equivalence analyses are conducted after trans-adaptation of tools, which You Yun describes as insufficient to address more profound differences. Context cannot be an afterthought.

Third, good measurement focuses on the experiences of test takers. Practitioners must consider the multiple emotions involved in the process of taking a test. These emotions matter to the sense-making of the examinees who are supposedly benefiting from them. As Audrey Bryan describes, understanding emotions may help unravel complex experiences and create new possibilities; this matters to teaching and measurement. An interesting conceptual framework to study examinees’ experiences is provided by Araneda (2022). Bottomline, it is radically different to take a test that you care about, that engages you emotionally, and that speaks to your life than to take a generic assessment. To design tests that are emotionally evocative, we must involve examinees in their design.

These ideas are interconnected. For instance, the focus on emotions must consider context and consequences. Kirsi Yliniva and Audrey Bryan suggest that education’s neuro-affective turn could bring negative consequences. If one disregards power dynamics, focusing on emotions via biodata may lead to a higher reliance on big tech providers, potentially changing the incentives around measurement. The authors also describe how current SEL frameworks may reward students who endure the complexities of life, rather than those who are trying to transform it; without critically questioning underlying ideologies, SEL assessments may not fulfill their promise of improving lives. While good measurement involves more than just focusing on consequences, context and experiences, these perspectives matter enormously to the quality of our work.

September 1, 2024