Texas’ annual reading test adjusted its difficulty every year, masking whether students are improving

Texas children’s performance on an annual reading test was basically flat from 2012 to 2021, even as the state spent billions of additional dollars on K-12 education.

I recently did a peer-reviewed deep dive into the test design documentation to figure out why the reported results weren’t showing improvement. I found the flat scores were at least in part by design. According to policies buried in the documentation, the agency administering the tests adjusted their difficulty level every year. As a result, roughly the same share of students failed the test over that decade regardless of how objectively better they performed relative to previous years.

From 2008 to 2014, I was a bilingual teacher in Texas. Most of my students’ families hailed from Mexico and Central America and were learning English as a new language. I loved seeing my students’ progress.

Yet, no matter how much they learned, many failed the end-of-year tests in reading, writing and math. My hunch was that these tests were unfair, but I could not explain why. This, among other things, prompted me to pursue a Ph.D. in education to better understand large-scale educational assessment.

Ten years later, in 2024, I completed a detailed exploration of Texas’s exam, currently known as the State of Texas Assessments of Academic Readiness, or STAAR. I found an unexpected trend: The share of students who correctly answered each test question was extraordinarily steady across years. Where we would expect to see fluctuation from year to year, performance instead appears artificially flat.

The STAAR’s technical documents reveal that the test is designed much like a norm-referenced test – that is, assessing students relative to their peers, rather than if they meet a fixed standard. In other words, a norm-referenced test cannot tell us if students meet key, fixed criteria or grade-level standards set by the state.

In addition, norm-referenced tests are designed so that a certain share of students always fail, because success is gauged by one’s position on the “bell curve” in relation to other students. Following this logic, STAAR developers use practices like omitting easier questions and adjusting scores to cancel out gains due to better teaching.

Ultimately, the STAAR tests over this time frame – taken by students every year from grade 3 to grade 8 in language arts and math, and less frequently in science and social studies – were not designed to show improvement. Since the test is designed to keep scores flat, it’s impossible to know for sure if a lack of expected learning gains following big increases in per-student spending was because the extra funds failed to improve teaching and learning, or simply because the test hid the improvements.

Why it matters

Ever since the federal education policy known as No Child Left Behind went into effect in 2002 and tied students’ test performance to rewards and sanctions for schools, achievement testing has been a primary driver of public education in the United States.

Texas’ educational accountability system has been in place since 1980, and it is well known in the state that the stakes and difficulty of Texas’ academic readiness tests increase with each new version, which typically come out every five to 10 years. What the Texas public may not know is that the tests have been adjusted each and every year – at the expense of really knowing who should “pass” or “fail.”

The test’s design affects not just students but also schools and communities. High-stakes test scores determine school resources, the state’s takeover of school districts and accreditation of teacher education programs. Home values are even driven by local schools’ performance on high-stakes tests.

Students who are marginalized by racism, poverty or language have historically tended to underperform on standardized tests. STAAR’s design makes this problem worse.

What still isn’t known

I plan to investigate if other states or the federal government use similarly designed tests to evaluate students.

My deep dive into Texas’ test focused on STAAR before its 2022 redevelopment. The latest iteration has changed the test format and question types, but there appears to be little change to the way the test is scored. Without substantive revisions to the scoring calculations “under the hood” of the STAAR test, it is likely Texas will continue to see flat performance.

The Texas Education Agency, which administers the STAAR tests, didn’t respond to a request for comment.

The Research Brief is a short take on interesting academic work.