What is this system actually measuring?

In brief

By the start of 2026, most universities had done the visible work of putting AI on the institutional agenda. They had written policies on student and faculty use, stood up AI committees and working groups, and run pilots — assistants for student services, drafting tools for administrative staff, models that flag students who might be slipping. The scaffolding went up quickly, and under genuine pressure: a contracting enrollment as the demographic cliff arrives, public skepticism about the return on a degree, tightened federal funding and tax conditions, and a sector that Deloitte’s 2026 higher education outlook describes as moving from a long period of growth into one of disciplined focus, with the business model under scrutiny and risk management demanding tighter coordination across offices that once operated siloed. AI arrived in the middle of all of it, as both another pressure and a promised relief.

The role of the technology executive has shifted with it. In Deloitte’s 2026 Global Technology Leadership Study, based on responses from more than 660 tech leaders, the large majority of CIOs described their primary job as implementing AI or evangelizing for it across the institution, a shift the report frames as moving from keeping the lights on to lighting the way forward. That shift is real and, on balance, healthy. But it carries a quiet cost: when the mandate becomes adoption, evaluation tends to be assumed rather than performed.

The question that gets skipped

There is a gap I keep noticing. Universities have become fluent in two questions about AI: should we use it, and what are the rules for using it. Those are the questions a policy answers and a committee debates, and they are necessary. But they are not the question that determines whether a given AI system (the deployed model plus the workflow it is sitting inside) is doing its job. That question is narrower and harder: does this specific system do what we claim it does?

It is an easy question to skip. A tool gets adopted because it is plausible, because a vendor demonstrated it well, because a respected peer institution uses it, because a pilot felt successful. None of those is evidence that the system measures or predicts what it purports to. Adoption and policy have outrun evaluation. We have built the scaffolding for governing AI and left out a load-bearing beam.

What seven years of scoring engines taught me

I spent seven years at the Educational Testing Service evaluating AI-driven scoring systems — the engines that score essays and spoken responses on large-scale assessments. That work taught me something I have never been able to un-see, and it is the reason this gap worries me.

When you build an automated scoring model, the obvious way to judge it is agreement: how often does the machine’s score match a trained human rater’s score? It is a clean number, and it is reassuring. It is also not sufficient. A model can agree with human raters at a high rate and still be measuring the wrong thing. It can learn that longer essays tend to score higher, and quietly reward length. It can lean on vocabulary, sentence count, surface fluency: features that correlate with quality without being quality. The scores look right. The agreement statistics look right. And underneath, the system is measuring something other than what its label claims.

The discipline of measurement science exists, in large part, to catch exactly that. The question it trains you to ask is the one I have carried into every kind of data work since: what is this system actually measuring, and does that match what we say it measures? Not whether the output looks plausible, but whether the thing being measured is the thing we intended. An automated scoring engine that earns its agreement by rewarding length is not a writing-quality measure. It is a length measure with a writing-quality label. The difference looks small in aggregate. It is decisive for the writers the correlation doesn’t hold for.

The validity gap — what the system claims to measure vs. what it actually measures — A system can perform beautifully against the proxy and still be wrong for the decision it was deployed to inform. Validity is what tells you when the two have drifted apart.

The harder question

The methodological alternative is older than machine learning, and it is what measurement science was built on. A test like the GRE Analytical Writing measure is not, at its core, asking whether an AI can match a human rater on a 30-minute timed essay. The test asks students to produce a single timed analytical essay — the Issue task — and is making a claim about the relationship between performance on that timed task and performance on something quite different in shape: the longer, drafted-and-revised writing students produce over weeks in a first-year graduate course. Two different formats. Different rubrics. Different human evaluators. Two ways of capturing the same underlying writing construct, with the test asserting a relationship between them.

Validating an AI scoring engine against that relationship is a different question than validating it against immediate rater agreement on the timed essay itself. The relationship question is whether the AI’s score on the timed essay predicts instructor evaluations of the student’s actual coursework writing. Both questions involve human judgment. The difference is where it sits: at the immediate output, where the human is the rater the AI is trained to match, or at the downstream construct expression, where the human is the instructor evaluating what the test was built to predict. The first is reliability, often pursued because it is faster and cheaper. The second is validity, and it is what the test claims to do in the first place. The AI-scoring conversation has mostly been running on the easier question.

There is a second reason it has stayed there, and it is honest to name. The harder validity work was historically expensive. It required gathering downstream outcomes, running instructor evaluations of subsequent coursework, tracking students longitudinally. Reliability-against-human-raters was what could be done at scale. The cost economics favored the easier question. That economics has changed. The same AI capability that made faster scoring possible, the cheap compute and cheap storage and cheap data integration of the last decade, has also lowered the cost of running the harder question. The validity work that was once prohibitively expensive is newly affordable. The methodology was built for an older cost structure. The cost structure has moved. The methodology hasn’t.

This is not an argument against keeping humans in the loop. It is an argument against confusing two different roles humans play in that loop. Humans as decision-makers are the people who act on a score, who decide what an early-alert flag means in a specific student’s life, who weigh the AI’s output against the rest of what they know. They should stay, and should stay clearly in charge. Humans as the immediate-output validation target, the rater the AI is trained to match, is the harder question. That rater is always a proxy for the construct, not the construct itself. Validating against the downstream criterion still involves human judgment, but a human judgment anchored at what the test is built to predict, not at the score itself. Keep humans deciding. Anchor the validation at the prediction target (the coursework) not at the score.

Every system makes a claim

Every AI system a university adopts makes a claim like that label, and most of the claims are never written down. An early-alert model claims to identify students at academic risk. An advising assistant claims to surface the guidance a student needs. An admissions-support tool claims to predict yield, or fit, or success. A staff-facing assistant claims to produce work accurate enough to act on. Each is a statement about an intended outcome. And each can be wrong in the specific, quiet way an automated scoring engine can be wrong, tracking a surface signal while missing the substance, because the claim was implicit and no one was assigned to check it.

The early-alert model is the cleanest example. Built without care, it can learn that the strongest predictor of risk in the historical record is a demographic pattern, or a single missed assignment, or enrollment in one difficult course. It will flag students, and the flags will even be partly accurate. But a model that flags students by proxy is not measuring academic risk; it is measuring the proxy, and routing the institution’s attention and resources accordingly. No one set out to build that system. It is what results when a tool is adopted on plausibility and never asked the intended-outcome question.

Generative and agentic tools make the problem harder, not easier. A predictive model at least produces a score that can be tested against an outcome. A generative assistant produces fluent, confident prose whose quality is difficult to assess at a glance, and fluency is itself a proxy the human eye is inclined to reward. The 2026 enterprise-AI research is consistent on this point: only a small share of organizations report a mature model for governing autonomous AI agents, and the real constraints on scaling AI are rarely the technology itself. They are data quality, security, and the absence of evaluation discipline. The newer the system, the more easily plausibility substitutes for proof.

The discipline already exists

This is the missing discipline inside AI governance. EDUCAUSE’s 2026 Top 10 IT Issues name the human edge of AI, and data analytics for institutional decision-making, among the issues that matter most. University technology leaders have been clear that the next phase of AI work is operational, moving from written policy to running practice. Evaluation is the part of that practice most easily skipped, because it is invisible when it is working and expensive to do well. It is also the part that decides whether everything else is real.

Applying the discipline does not mean slowing adoption, and it does not mean another layer of bureaucracy. It means a small set of hard questions, asked consistently: before a system is trusted, and periodically after. Is the system measuring the intended construct, or a proxy for it? When it is wrong, what happens downstream, and to whom? Does it perform consistently across the different groups of people it touches, or does its accuracy concentrate where the training data was richest? What human decision is the system meant to support, and does its output improve that decision? None of these questions is exotic. They are the ordinary questions of measurement. A university that has an institutional research office and an assessment culture already employs people who know how to ask them. Those people have simply not yet been pointed at the AI systems moving into administrative use.

Seeing the student whole

There is a deeper version of the intended-outcome question, and in a university it is the one that matters most. When we ask what a system is actually measuring, we are often really asking whether it sees a person whole. An early-alert model that optimizes a retention number is not the same as one that helps an institution understand and support a student. The first reduces the student to the outcome the institution wants to protect. The second treats the number as a signal that points back toward a person, one with a context, a trajectory, and reasons. Asked seriously, the intended-outcome question is a check against the quiet drift toward measuring students as proxies for the metrics we happen to collect. A university, of all institutions, should want its systems to see students whole. That is not a sentiment. It is an evaluation standard, it is answerable, and it is the standard worth holding AI to.

One clarification, because the easiest misread of this argument is that it’s anti-proxy. It is not. Institutional modeling at scale has to use proxies; that is how the work runs. The discipline being asked for is not the abandonment of proxies but the validity work underneath them — knowing which construct each proxy stands in for, which part of the construct it actually captures, and where the proxy quietly substitutes itself for the construct it was supposed to serve. Pro-proxy, with the validity work done out loud. That is the standard.

An old discipline, a new set of systems

The institutions that handle this moment well will not be the ones with the most AI, or the fastest adoption, or the longest policy. They will be the ones that can tell the difference between AI that works and AI that only looks like it works, and tell it on purpose, through a discipline, rather than discovering it after a system has been shaping decisions, unnoticed, for two years.

That discipline does not need to be invented. Higher education has spent decades building the science of measuring hard things well and holding the measurements accountable to what they claim. The same rigor that asks whether an essay score reflects writing or length can ask whether an early-alert flag reflects risk or a proxy for it. The question travels intact; it is the same question. Higher education’s AI moment does not need a new framework so much as it needs to turn an old and well-tested one toward a new set of systems, and to ask, of every system it adopts, the plain and demanding question: what is this actually measuring, and is that what we meant?

Operating kit

The AI Evaluation Kit

Three documents that turn the argument above into a leadership-meeting move. The full kit names twelve evaluation questions across signals, intelligence, and execution. The one-page diagnostic is the scorecard a leadership team can run through in a single meeting. The 90-day cadence is the wrapper that turns the diagnostic into a quarterly operating practice. Free, no gate.

The Kit

Twelve questions across the three pillars. The main asset.

PDF DOCX

The Diagnostic

One page. Twelve questions as a leadership-meeting scorecard.

PDF DOCX

The 90-Day Cadence

Implementation wrapper. Turns the diagnostic into operating practice.

PDF DOCX

If this changes how you evaluate AI in your context, I’d love to hear about it — hello@analyticbytes.systems.

Written May 2026 for the Analytic Bytes Library. The argument draws on measurement-science practice and is intended to outlast specific AI products and platforms. The downloadable operating kit (above) is the Q3 2026 v3 release, refreshed for procurement-grade vendor stress-testing and tighter operator voice.

Analytic Bytes

From fragmented to decision-ready.

Questions, pushback, or a problem that looks like this one? Write to chai@analyticbytes.systems.