←  Library
Essay 05

What is this system actually measuring?

The evaluation gap in higher education’s AI moment.

Chaitanya Ramineni, PhDMay 31, 20268 min read
Cover illustration for What is this system actually measuring?

The question that gets skipped

There is a gap I keep noticing. Universities have become literate in two questions about AI: should we use it, and what are the rules for using it. Those are the questions a policy answers and a committee debates, and they are necessary. But they are not the question that determines whether a given AI system (the deployed model plus the workflow it is sitting inside) is doing its job. That question is narrower and harder: does this specific system do what we claim it does?

It is an easy question to skip. A tool gets adopted because it is plausible, because a vendor demonstrated it well, because a respected peer institution uses it, because a pilot felt successful. None of those is evidence that the system measures or predicts what it claims to. Adoption and policy have outrun evaluation. We have built the scaffolding for governing AI and left out a load-bearing pillar.

What seven years of scoring engines taught me

I spent seven years at the Educational Testing Service evaluating AI-driven scoring systems — the engines that score essays and spoken responses on large-scale assessments. That work taught me something I have never been able to un-see, and it is the reason this gap worries me.

When you build an automated scoring model, the obvious way to judge it is agreement: how often does the machine’s score match a trained human rater’s score? It is a clean number, and it is reassuring. It is also not sufficient. A model can agree with human raters at a high rate and still be measuring only part of what it claims to measure. It can learn that longer essays tend to score higher, and quietly come to lean on length. It can rest on vocabulary, sentence count, surface fluency: features that correlate with writing quality without constituting it. The scores look right. The agreement statistics look right. And underneath, the system is measuring something narrower than its label suggests.

The discipline of measurement science exists, in large part, to catch exactly that. The question it trains you to ask is the one I have carried into every kind of data work since: what is this system actually measuring, and does that match what we say it measures? Not whether the output looks plausible, but whether the thing being measured is the thing we intended. An automated scoring engine that earns its agreement primarily through length is not measuring writing quality directly. It is measuring a feature that correlates with writing quality. The difference looks small in aggregate. It is decisive for the writers the correlation doesn’t hold for.

The harder question

The methodological alternative is older than machine learning, and it is what measurement science was built on. A test like GRE Writing is not, at its core, asking whether an AI can match a human rater on a 30-minute timed essay. The GRE Writing test asks students to write two essays under timed pressure, an Issue essay and an Argument essay. It is making a claim about the relationship between performance on those timed tasks and performance on something quite different in shape: the longer, drafted-and-revised writing students produce over weeks in a first-year graduate course. Two different formats. Different rubrics. Different human evaluators. Two ways of capturing the same underlying writing construct, with the test asserting a relationship between them.

Validating an AI scoring engine against that relationship is a different question than validating it against immediate rater agreement on the timed essay itself. The relationship question is whether the AI’s score on the timed essay predicts instructor evaluations of the student’s actual coursework writing. Both questions involve human judgment. The difference is where it sits: at the immediate output, where the human is the rater the AI is trained to match, or at the downstream construct expression, where the human is the instructor evaluating what the test was built to predict. The first is reliability, often pursued because it is faster and cheaper. The second is validity, and it is what the test claims to do in the first place. The AI-scoring conversation has mostly been running on the easier question.

There is a second reason it has stayed there, and it is honest to name. The harder validity work was historically expensive. It required gathering downstream outcomes, running instructor evaluations of subsequent coursework, tracking students longitudinally. Reliability-against-human-raters was what could be done at scale. The cost economics favored the easier question. That economics has changed. The same AI capability that made faster scoring possible, the cheap compute and cheap storage and cheap data integration of the last decade, has also lowered the cost of running the harder question. The validity work that was once prohibitively expensive is newly affordable. The methodology was built for an older cost structure. The cost structure has moved. The methodology hasn’t.

This is not an argument against keeping humans in the loop. It is an argument against confusing two different roles humans play in that loop. Humans as decision-makers are the people who act on a score, who decide what an early-alert flag means in a specific student’s life, who weigh the AI’s output against the rest of what they know. They should stay, and should stay clearly in charge. Humans as the immediate-output validation target, the rater the AI is trained to match, is the harder question. That rater is always a proxy for the construct, not the construct itself. Validating against the downstream criterion still involves human judgment, but a human judgment anchored at what the test is built to predict, not at the score itself. Keep humans deciding. Anchor the validation at the prediction target (the coursework) not at the score.

Every system makes a claim

Every AI system a university adopts carries a label of its own, a claim about what it measures, and most of those claims are never written down. An early-alert model claims to identify students at academic risk. An advising assistant claims to surface the guidance a student needs. An admissions-support tool claims to predict yield, or fit, or success. A staff-facing assistant claims to produce work accurate enough to act on. Each is a statement about an intended outcome. And each can be wrong in the specific, quiet way an automated scoring engine can be wrong, tracking a surface signal while missing the substance, because the claim was implicit and no one was assigned to check it.

The early-alert model is the cleanest example. Built without care, it can learn that the strongest predictor of risk in the historical record is a demographic pattern, or a single missed assignment, or enrollment in one difficult course. It will flag students, and the flags will even be partly accurate. But a model that flags students by proxy is not measuring academic risk; it is measuring the proxy, and routing the institution’s attention and resources accordingly. No one set out to build that system. It is what results when a tool is adopted on plausibility and never asked the intended-outcome question.

Generative and agentic tools make the problem harder, not easier. A predictive model at least produces a score that can be tested against an outcome. A generative assistant produces fluent, confident prose whose quality is difficult to assess at a glance, and fluency is itself a proxy the human eye is inclined to reward. The 2026 enterprise-AI research is consistent on this point: only a small share of organizations report a mature model for governing autonomous AI agents, and the real constraints on scaling AI are rarely the technology itself. They are data quality, security, and the absence of evaluation discipline. The newer the system, the more easily plausibility substitutes for proof.

The discipline already exists

This is the missing discipline inside AI governance. EDUCAUSE’s 2026 priorities name the human edge of AI, and data analytics for institutional decision-making, among the issues that matter most. University technology leaders have been clear that the next phase of AI work is operational, moving from written policy to running practice. Evaluation is the part of that practice most easily skipped, because it is invisible when it is working and expensive to do well. It is also the part that decides whether everything else is real.

Applying the discipline does not mean slowing adoption, and it does not mean another layer of bureaucracy. It means a small set of hard questions, asked consistently: before a system is trusted, and periodically after. Is the system measuring the intended construct, or a proxy for it? When it is wrong, what happens downstream, and to whom? Does it perform consistently across the different groups of people it touches, or does its accuracy concentrate where the training data was richest? What human decision is the system meant to support, and does its output improve that decision? None of these questions is esoteric. They are the ordinary questions of measurement. A university that has an institutional research office and an assessment culture already employs people who know how to ask them. Those people have not yet been pointed at the AI systems moving into administrative use.

Seeing the student whole

There is a deeper version of the intended-outcome question, and in a university it is the one that matters most. When we ask what a system is actually measuring, we are often asking whether it sees a person whole. An early-alert model that optimizes a retention number is not the same as one that helps an institution understand and support a student. The first reduces the student to the outcome the institution wants to protect. The second treats the number as a signal that points back toward a person, one with a context, a trajectory, and reasons. Asked seriously, the intended-outcome question is a check against the quiet drift toward measuring students as proxies for the metrics we happen to collect. A university, of all institutions, should want its systems to see students whole. That is not a sentiment. It is an evaluation standard, it is answerable, and it is the standard worth holding AI to.

An old discipline, a new set of systems

The institutions that handle this moment well will not be the ones with the most AI, or the fastest adoption, or the longest policy. They will be the ones that can tell the difference between AI that works and AI that only looks like it works, and tell it on purpose, through a discipline, rather than discovering it after a system has been shaping decisions, unnoticed, for two years.

That discipline does not need to be invented. Higher education has spent decades building the science of measuring hard things well and holding the measurements accountable to what they claim. The same rigor that asks whether an essay score reflects writing or length can ask whether an early-alert flag reflects risk or a proxy for it. The question travels intact; it is the same question. Higher education’s AI moment does not need a new framework so much as it needs to turn an old and well-tested one toward a new set of systems, and to ask, of every system it adopts, the plain and demanding question: what is this actually measuring, and is that what we meant?

Written May 2026 for the Analytic Bytes Library. The argument draws on measurement-science practice and is intended to outlast specific AI products and platforms.

Analytic Bytes
From fragmented to decision-ready.

Questions, pushback, or a problem that looks like this one? Write to chai@analyticbytes.systems.