When Data Meets the Literary Mind

AI systems are now being benchmarked on their ability to score literary essays—and the scores they produce come with a frank admission attached. On the Learning Agency’s ASAP 2.0 benchmark for automated scoring of student-written argumentative essays, the top-performing model achieves a Quadratic Weighted Kappa of 0.585, a level the benchmark’s own documentation describes as only approaching acceptability for non-high-stakes use. The field’s current ceiling, in other words, sits just above the threshold of adequate.

By 2025, a RAND survey found that 54 percent of students and 53 percent of English language arts, math, and science teachers reported using AI for schoolwork—up 15 percentage points in just one to two years. That pace makes data-driven tools in literary education a present-tense problem, not a future-scenario exercise. Whether they help depends almost entirely on how they’re deployed: used diagnostically, to locate specific patterns across attempts and guide targeted revision, or evaluatively, as a single verdict that collapses the full picture into a number. That distinction matters most in English study, where the capabilities examiners reward most are precisely the ones that develop through risk, experimentation, and phases that look nothing like a smooth upward curve.

The Problem With Progress Curves

Three dimensions of literary capability sit at the center of advanced assessment and at the outer edge of what progress curves reliably track. Interpretive confidence—the willingness to commit to a nuanced reading of a demanding text and defend it under time pressure—almost always develops through a phase of deliberate risk. A student pushing beyond safe, obvious readings will initially produce essays that are less controlled than earlier, cautious work. A metric that registers only aggregate output reads this as regression: scores dip at precisely the moment the most ambitious interpretive work begins.

Analytical voice follows a similarly jagged path. Students don’t arrive at a distinctive critical voice by linear refinement of a single style; they get there by experimenting across essays and testing bolder claims, many of which don’t work. At the same time, the integration of vocabulary, argumentative structure, and evidential precision that examiners reward comes only after sustained engagement with difficult texts—work that generates visible struggle long before it produces fluency. Compressing all of this into a performance percentage that tracks only end products erases the internal structure of the learning curve.

These are not objections to data-informed literary education; they are design requirements for it. Analytics that claim to support advanced literary development have to respect that interpretive confidence, analytical voice, and precise evidence use emerge through messy, non-linear cycles of trial, error, and revision. When systems are instead built around smooth, monotone progress curves, they don’t just miss the phases that matter most. They misread them as failure.

Image source

Measuring the Surface, Missing the Depth

Automated essay scoring research shows how easily surface-level proxies for quality can diverge from deeper meaning. In an ETS “e-rater” challenge study reported by Powers and colleagues, invited writers composed essays specifically designed to mislead the system—either by eliciting scores higher than the writing deserved or lower than it deserved. Overall, they were more successful at producing over-scored essays than under-scored ones. When a scoring system built on surface proxies carries a systematic bias, that bias runs in the direction that flatters the writer. A system more easily gamed toward inflation than deflation is telling you exactly what it’s measuring: surface regularity, not interpretive substance.

That proxy gap matters most when the scoring logic is embedded directly in the environments where literary essays are written. Grammarly, Inc.’s AI-powered writing assistant reaches roughly 40 million daily users and operates as an ambient presence in the very documents where many students draft literary analysis. Its core capabilities—automated correctness checks, fluency enhancement, paraphrasing suggestions, and tone optimization—act on surface features of language that are tractable to detect and adjust. The feedback it provides is organized around sentence-level form, not around whether a student’s interpretive argument has become more original, better evidenced, or more conceptually ambitious.

In a literary classroom, that asymmetry becomes visible when a student stretching toward more complex interpretations produces prose that grows temporarily messier—arguments less tightly organized, phrasing more experimental—while Grammarly continues to register surface improvement through smoother constructions and cleaner syntax. Nora K. Rivera, Assistant Professor of technical communication and rhetoric at Texas Tech University, describes the dynamic precisely in a professional roundtable on AI writing tools: “I tell them that I probably accept 25 percent of the changes that Grammarly wants to make… But I want to sound like me! And I want them to sound like them.”

Grammarly isn’t defective as a writing tool; its feedback channels are built for what they’re built for. A polished paragraph and a penetrating one can look identical to a system that reads syntax but not meaning.

From Verdict to Compass

The hazards of collapsing complex performance into a single number are well-documented beyond literary education. US teacher evaluation systems that relied on test-based value-added models (VAMs) offer the most instructive case at policy scale. The American Statistical Association, in a formal statement on these practices, cautioned that “Under some conditions, VAM scores and rankings can change substantially when a different model or test is used, and a thorough analysis should be undertaken to evaluate the sensitivity of estimates…” Teachers typically account for only 1 to 14 percent of variability in student test scores. That’s a narrow slice of a complex outcome. Ranking teachers on it anyway—with compensation, tenure, and dismissal consequences attached—produced exactly the instability the ASA warned against. A number that shifts materially with technical choices isn’t a verdict. It’s a prompt.

Revision Village is an online revision platform for IB Diploma and IGCSE subjects; more than 350,000 students from over 1,500 schools across more than 135 countries use it for high-stakes exam preparation. Its performance analytics dashboards track progress, highlight strengths, and flag topics needing further work as students move through their IB English preparation. That design difference is the operative one: knowing which specific topics need attention is actionable in a way that a declining aggregate score is not. The data is positioned to direct next effort rather than deliver a final accounting—which keeps it oriented toward diagnosis rather than verdict.

For a student navigating IB English preparation, that means reading a flagged topic as a prompt for the next session’s work, not as a judgment on current ability.

What the Scoring Data Actually Tells Us

Precision and limitation tend to arrive together when AI scoring data is examined carefully—and the Learning Agency’s AI and Education Leaderboard is transparent about both. The leaderboard benchmarks large language models on classroom-relevant tasks, including the ASAP 2.0 essay-scoring benchmark, using private test sets to prevent models from memorizing training data. On ASAP 2.0, Gemini 2.5 Pro achieves the highest agreement score among tested models, while Gemini 2.0 Flash records a QWK of 0.562 at a cost of $0.25 per 1,000 essays and a latency of 0.73 seconds. Severe errors—predictions off by two or more points—are identified as a key differentiator across model performance tiers. The top score is described as approaching adequate performance for non-high-stakes contexts. That framing is precise and intentional: adequate-for-low-stakes is a ceiling, not a floor.

Institutions that reckon honestly with what single scores cannot do tend to reach the same structural answer: build a longer record instead. At Michigan Technological University, Maria Bergstrom, Associate Dean for Undergraduate Education in the College of Sciences and Arts, describes requiring every course to contribute at least one artifact and reflection so students can “build a file cabinet of things they have done.” Capturing work and reflection longitudinally across many attempts makes developmental patterns visible in ways that any single session score—whether generated by an AI model or a human grader—cannot replicate.

What the Tension Produces

Graduates of literary programs will increasingly enter professional environments where quantitative feedback on their own writing is simply present—in collaborative documents, email tools, and workplace platforms. That ambient condition is already the daily reality for most IB English candidates and their peers. The professional world doesn’t resolve the tension between surface metrics and interpretive depth; it just relocates it, and asks people to navigate it without a teacher standing nearby.

For students still working through IB English and other literary courses, that fluency shows up in practical choices. Reading a performance dashboard as a compass rather than a verdict means treating flat rubric scores as prompts to ask where interpretive ambition has outpaced control. A Grammarly-polished paragraph is a smoother paragraph. That’s different from a stronger one.

The opening QWK figure—barely approaching adequate for low-stakes essay scoring—is less an indictment of AI than a precise statement of what literary judgment involves. A student who knows what that score can and can’t claim has already practiced the most transferable skill literary education produces: deciding how much to trust a number.