Science vs. Purpose: The College Board’s New Adversity Index
Claims over what is being measured by a standardized assessment (referred to as a test’s construct) are where claims of scientific rigor can distort understanding of social goals associated with a particular test.
Recently, the College Board announced it was piloting a new Environmental Context Dashboard (referred to in the media as an “Adversity Index”), one which tries to quantify social challenges faced by students applying to college. As described here, the index uses publicly available data regarding a student’s neighborhood and high school, such as median income, unemployment levels, and crime rates, to generate a value between 0 and 100, with higher numbers indicating higher levels of adversity indexed students might face.
Debate over the move has polarized around issues of fairness (why not provide college admissions officers data they can use to contextualize grades and test scores?) and independence (who is the College Board to lump me and everyone else in my neighborhood and school into a cohort summarized by a single numerical value?). Like arguments over the role of standardized testing, discussion of the Adversity Index is confused by a conflation of the scientific nature of assessment and the social purposes evaluations are supposed to serve.
Professional test developers involved with creating tests that measure student academic achievement, as well as college entrance exams (like the SAT and ACT) and exams for certification and licensure, use a variety of design processes and statistical methods to provide evidence that these instruments measure what they purport to measure. The science of test design and validation has evolved considerably over the last hundred years, and our education system would benefit if more educators (especially teachers) were familiar with how to use techniques fine-tuned over the last century to create better, more effective assessments.
Claims over what is being measured by a standardized assessment (referred to as a test’s construct) are where claims of scientific rigor can distort understanding of social goals associated with a particular test. For example, tests of well defined academic standards (like Common Core assessments or tests created by states to measure their own language and math standards) have uncontroversial constructs: attainment of the knowledge, skills, and abilities specified in those standards. One can argue over whether those standards correctly define what it means to be proficient in a subject, as well as challenge the priority our educational system places on standardized testing generally. But such legitimate arguments do not require challenging whether a professionally developed test designed to measure mastery of specific content actually does so.
Things begin to get muddier once construct claims broaden, especially into complex aspects of the human condition. For example, college entrance exams like the SAT and ACT, which primarily measure language and math skills, say that measurement of those skills reliably predicts future college success. Developers of these exams have decades of research to support this correlation, but any number of factors could lead to strong correlation beyond the strength of the test itself. For example, there might be a third factor (such as income or the education level of parents) or a combination of factors that directly correlates to both SAT/ACT performance and college success, meaning college entrance exams are actually a measurement of this common variable. Such an argument would seem to indicate value in the College Board’s new Adversity Index, albeit at the expense of claims of objectivity in college entrance exams generally.
This is where the social purpose of assessment must be considered, especially since such purposes create pragmatic reasons for accepting quantitative measures built to evaluate constructs (like college success) that can never be determined with certainty. For instance, the SAT, and later the ACT, came into existence to give growing numbers of colleges and universities accepting growing number of students from diverse backgrounds some standardized measure to take into account alongside subjective data such as grades (with grades themselves only becoming standardized around the A-F marking system relatively recently). Given the importance of that goal, perfection in measurement (whatever that means) is not required, merely acceptance that one imperfect measure (standardized test scores) alongside other imperfect ones (like grades) provide human actors (like college admissions officers) the data they need to make reasonable and fair choices most of the time.
Trouble starts, however, when the scientific rigor of the test development process is used to strengthen claims about the measurement and predictive power of an assessment beyond what is warranted.
Supporters of standardized tests understandably want to make strong claims about their accuracy and value, especially since weaker or more qualified claims could harm acceptance of particular tests in the marketplace. But highlighting the virtue of an assessment, without discussing why the social goal behind it might justify its limitations, can lead to unintended and potentially damaging consequences.
An important example of this phenomenon is the trajectory of the Tests of General Educational Development or GED. University of Maryland historian Ethan Hutt, one of the authors of the previously linked history of the A-F grading system, describes how in 1942 (a year after America entered World War II and three years before the war ended) a group of educators gathered in Baltimore to try to solve anticipated post-war demobilization problems. Economic recession and political upheaval had accompanied the disbanding of the large US standing army in World War I (just as similar large-scale demobilizations became sources of unrest throughout history), and those educators sought to avoid a repeat of that pattern by giving veterans maximum access to higher education opportunities.
This foresight culminated in landmark legislation like the G.I. Bill, which not only headed off post-war economic and social unrest, but transformed American education, and America, for the better. In order for veterans to enter college, however, they would have to complete high school and those planners in Baltimore recognized that asking 20-something combat vets to return to their seats in high-school classrooms (or even go to school for the first time) to finish work needed to earn a diploma was unrealistic, and potentially unnecessary given everything soldiers might have learned during their years of military service.
One part of the solution was to assign credit to specific military training programs as well as to academic correspondence courses taken by over a million soldiers throughout the war. But a more lasting reform was the creation of a new assessment—the GED—that would measure not specific high-school content (which was enormously diverse across the 50 states) but, as its creator described, would “provide a measure of a general educational development which results from . . . all of the possibilities for informal self-education which military service involves, as well as the general educational growth incidental to military training an experience as such.”
Hutt refers to this as “contextless” assessment which is based on a radical construct, one which says that, regardless of any particular high school curriculum or life experience, there exists a set of essential elements that add up to being a high-school graduate, elements that can be measured through an instrument like the GED.
Had the GED been accepted as a one-time expedient to benefit those who had sacrificed so much during the war, its impact on American education may have been limited. But because the rhetoric surrounding the test claimed that a test of essential, contextless elements was as good, if not better, than meeting the graduation requirements of a particular school system, it was just a matter of time before the GED started to be used to give non-veterans access to the same benefit of a high school diploma without completing high school. One can track acceptance of the test’s radical construct as the GED was accepted by more and more states, and the minimum age to take it dropped from 22 to 17.
If one accepts that certain contextless characteristics are the genuine outcome of school, why not remake the curriculum to support development of those characteristics, rather than specific subject matter? As Hutt points out, we have seen a similar dynamic play out as international tests like the PISA (used to rank nations by educational achievement) rely on a similar (this time global) contextless construct to generate results that have inspired nations to transform their educational systems to more effectively compete internationally (with success measured by growth in PISA scores).
Like the SAT and GED, PISA started with worthy, if humbler, goals, with globalization imbuing that test with new meaning and influence. Might something similar happen if a measure of personal adversity gains “cash value” in the educational marketplace?
The College Board has been careful to gather data from public sources and explain how they came up with and are communicating their new Environmental Context Dashboard to stakeholders. But what is to prevent others from constructing profiles based on personal information at the individual level, or students putting together their own profiles, especially if doing so generates an even more accurate version of something whose value has already been established: a number that defines the struggles you face in life?
Might such privacy trade-offs be worth it to achieve the vital purpose behind the Adversity Index: educational equity? And does that noble goal justify other choices, such as supplementing test scores (which individual students have a role in generating) with a numerical value created not by the individual being measured but by an expert assigning him or her to a category?
Perhaps we should talk more about these issues before we start measuring.