Prompt carefully! ChatGPT displays rule-based insensitivity to contingencies
Authors
Ruiz, F. J., Cardona-Betancourt, V.
Journal
Journal of Contextual Behavioral Science
Abstract
Experimental study applying a contingency-reversal paradigm to examine rule-governed behavior in four frontier large language models (LLMs): GPT 5.2, Claude Opus 4.5, Grok 4.1 Fast, and Gemini 3 Flash. In a 2×2 factorial design (Rule × Reasoning) comprising 640 sessions, the models played a discrimination-learning task whose contingencies reversed without warning at trial 41. All four LLMs displayed rule-based insensitivity to contingencies—replicating the phenomenon documented in humans: providing a rule accelerated acquisition but reduced adaptation after reversal. The magnitude varied markedly (Claude retained higher sensitivity, 56%; GPT 5.2 and Grok barely adapted, 5-9%), differences the authors interpret in light of alignment procedures (pliance vs. tracking). Extended reasoning had a limited impact. This is the first study to demonstrate contingency insensitivity in LLMs.
Detailed Summary
Prompt carefully! ChatGPT displays rule-based insensitivity to contingencies
Full reference: Ruiz, F. J., & Cardona-Betancourt, V. (2026). Prompt carefully! ChatGPT displays rule-based insensitivity to contingencies. Journal of Contextual Behavioral Science, 41, 101012. https://doi.org/10.1016/j.jcbs.2026.101012
Study type: LLM experimental study — contingency-reversal paradigm for the study of rule-governed behavior.
Background and objectives
Rule-governed behavior (RGB) is a cornerstone of behavior analysis. Skinner (1966, 1969) distinguished RGB—controlled by verbal antecedents—from contingency-shaped behavior, which develops through direct contact with consequences. Rules confer clear advantages (they rapidly transmit effective repertoires and guide action toward delayed consequences), but they can be detrimental when contingencies shift. A central finding in the experimental analysis of RGB is that rule-following often shows reduced sensitivity to changes in reinforcement contingencies, a phenomenon known as insensitivity to contingencies. Two functional classes of rule-following are differentially involved: pliance (following rules for socially mediated, arbitrary consequences), which increases insensitivity, and tracking (following rules for natural, nonarbitrary consequences), which promotes sensitivity.
The authors connect this tradition to the behavior of LLMs. Users interact with LLMs primarily through prompts, which from a behavior-analytic perspective can be construed as rules: verbal antecedents that guide the model's behavior without requiring prior contact with consequences. LLMs are trained to follow instructions through alignment procedures. The article proposes a functional parallel: In-Context Learning—the model's ability to adjust its outputs based on the history of contingencies in the current conversation—would be an analog of operant conditioning, whereas RLHF (Reinforcement Learning from Human Feedback) alignment would resemble the development of pliance (reinforcement mediated by arbitrary human preferences), and Constitutional AI would more closely approximate tracking (the reinforcement criterion is the correspondence between output and explicit principles).
The authors emphasize that this is a functional, not a mechanistic, perspective: LLMs are not organisms, their parameters remain fixed during a session, and adaptation is confined to the context window. The experimental question is whether, when a model receives an initial rule (prompt) but environmental feedback subsequently changes, it will track the new contingencies (via In-Context Learning) or persist in following the initial rule (due to its alignment history). To the authors' knowledge, no prior study had explored this issue. They predicted that rule provision would accelerate initial acquisition but induce contingency insensitivity, and explored whether extended (chain-of-thought) reasoning would attenuate that effect.
Method
Models
Four frontier LLMs available in January 2026 were selected: GPT 5.2 (OpenAI), Claude Opus 4.5 (Anthropic), Grok 4.1 Fast (xAI), and Gemini 3 Flash (Google). The authors refer to them as "models" rather than "participants," consistent with the functional framing. Selection criteria were: (a) availability of extended reasoning, (b) accessibility through commercial APIs, and (c) representation of distinct providers. The models differ in their alignment procedures: GPT uses primarily RLHF; Claude, Constitutional AI; Gemini combines RLHF with synthetic preference data; and Grok, RLHF with novel reward modeling techniques.
Experimental design
A 2 × 2 factorial design was employed with four replications (one per model). The factors were Reasoning (standard vs. extended) and Rule (with rule vs. without rule), both manipulated between sessions. Each model completed 40 independent sessions per cell of the Reasoning × Rule design, yielding 160 sessions per model and 640 sessions total. Through counterbalancing, Sam was the initially optimal opponent in half of the sessions and Alex in the other half. The power analysis (R package pwr; Champely, 2020), with α = .05 (two-tailed) and power ≥ 0.95, indicated that within each replication (n = 160), main effects (n = 80 per level) could detect effects of d ≥ 0.57, whereas the interaction and cell comparisons (n = 40 per cell) could detect d ≥ 0.82.
The primary dependent variable was the Normalized Sensitivity Index (NSI), defined as (Acq − Rev)/Acq, where Acq is the proportion of optimal opponent selections during the Acquisition Phase and Rev is the proportion of selections of the initially optimal opponent (now suboptimal) during the Reversal Phase. NSI ranges from 0 (no adaptation) to 1 (complete reversal). The secondary dependent variable was the proportion of optimal opponent selections during the Acquisition Phase (an indicator of In-Context Learning in the Without Rule conditions and of rule adherence in the With Rule conditions).
Materials and apparatus
The experiment was conducted with custom software, "PsychoLab: Web-Based Platform for Contingency Learning Research," a full-stack web application built with Python (FastAPI) on the backend, React on the frontend, and a PostgreSQL database. LLM integration was handled through each provider's APIs. All models were accessed via API in January 2026, with sampling parameters standardized to preserve each model's default stochastic behavior (Supplementary Table S2). Extended reasoning was enabled or disabled following each provider's recommended configuration.
Procedure
The task was a modified rock-paper-scissors game in which, on each trial, the model selected an opponent (Sam or Alex) and then a move (rock, paper, or scissors). Outcomes were determined probabilistically based on the chosen opponent (not the move), transforming the game into a discrimination-learning paradigm. During the acquisition phase (trials 1-40), the optimal opponent provided a 70% win rate, 15% tie rate, and 15% loss rate, while the suboptimal opponent provided the inverse distribution (15% win, 15% tie, 70% loss). At trial 41, contingencies reversed without warning: the previously optimal opponent became suboptimal and vice versa. The reversal phase continued through trial 80. Outcome feedback was delivered as text appended to the conversation history, functioning as accumulated context.
In the Without Rule condition, the initial prompt instructed the model to play several rounds, choosing first an opponent and then a move. In the With Rule condition, the prompt was identical except for the added sentence "Sam/Alex is not very good at this game," designed to be minimally specific while clearly indicating which opponent was easier to beat, paralleling instructions used in human RGB research (Hayes et al., 1986). If the response did not conform to the required format, the model was re-prompted. The complete history was maintained within each session; sessions were independent of one another.
Data analysis
All analyses were computed in JASP 0.95.4.0. To examine whether rule-based insensitivity effects generalized across models, separate 2 (Rule) × 2 (Reasoning) ANOVAs were conducted for each LLM, treating each model as an independent replication. Effect sizes are reported as partial eta-squared (η²p), interpreted following Cohen (1988): 0.01 = small, 0.06 = medium, 0.14 = large. The significance level was set at α = .05.
Results
Acquisition Phase. As expected, all LLMs performed at or near ceiling in the With Rule conditions (mean proportions of optimal selection between 0.98 and 1.00). Performance in the Without Rule conditions was also high for most models (M > 0.92), with the notable exception of Grok 4.1 Fast in the Without Reasoning condition (M = 0.61). The main effect of Rule was statistically significant for all four models, with large effect sizes (η²p from 0.26 to 0.87), indicating that rule provision enhanced acquisition. The effect of Reasoning was significant only for Claude Opus 4.5 (η²p = .14) and Grok 4.1 Fast (η²p = .76). The Rule × Reasoning interaction was significant only for Grok 4.1 Fast (η²p = .75): without a rule, enabling extended reasoning improved acquisition.
Reversal Phase. The rule-based contingency insensitivity effect was replicated visually across all four LLMs, though with wide differences among them. NSI values were lower in the With Rule conditions than in the Without Rule conditions across all models. The main effect of Rule was significant for all four models, confirming that rule provision reduced sensitivity to contingency changes; all effect sizes were large but varied considerably, from Claude Opus 4.5 (η²p = .16) to GPT-5.2 (η²p = .53). The effect of Reasoning was significant only for Grok 4.1 Fast (η²p = .11), in which extended reasoning reduced insensitivity. The Rule × Reasoning interaction was significant only for GPT 5.2 (η²p = .05): extended reasoning improved sensitivity when a rule was provided but impaired it when no rule was given.
Model comparison. Although all models showed significant rule-based insensitivity, its magnitude varied considerably. In the With Rule condition, GPT 5.2 and Grok 4.1 Fast exhibited the strongest insensitivity, with near-zero adaptation (NSI M = 0.05 and M = 0.09, respectively), that is, almost complete persistence in selecting the now-suboptimal opponent. Gemini 3 Flash showed a similar but less extreme pattern (M = 0.21). In contrast, Claude Opus 4.5 maintained substantially higher sensitivity even with the rule (M = 0.56), suggesting greater flexibility in responding to changing contingencies.
Discussion and conclusions
The authors conclude that all four frontier LLMs exhibited rule-based insensitivity to contingencies, replicating the phenomenon documented in humans, and that extended reasoning had a limited impact on reducing it. The Acquisition Phase revealed the classic RGB advantage: when told that one opponent was not good, the models immediately chose it. Without a rule, learning was slower but still occurred, confirming the role of In-Context Learning. The key finding emerged during reversal: models with rules continued to choose the now-suboptimal opponent significantly more often than models without rules, though with marked differences (Claude adapted well, at 56% sensitivity; GPT 5.2 and Grok barely adapted, at 5-9%; Gemini fell in between, at 21%).
The authors interpret these differences in light of the functional analysis of alignment procedures. Claude Opus 4.5, trained with Constitutional AI, showed the highest sensitivity even with a rule, which would resemble tracking (the reinforcement criterion is correspondence with explicit principles). In contrast, GPT 5.2 and Grok 4.1 Fast, trained predominantly with RLHF, showed almost complete insensitivity with a rule, consistent with the development of pliance (rule-following under the control of socially mediated, arbitrary consequences). Extended reasoning modestly improved sensitivity in some models (GPT 5.2, Grok 4.1 Fast) but not others (Claude, already sensitive; Gemini).
Acknowledged limitations include: the use of a single task and contingency configuration; a brief, indirect rule; the rapid evolution of LLMs (findings may not generalize to future models); the tentative nature of the interpretation linking alignment to sensitivity (companies do not disclose their training pipelines and the models differ in many dimensions); the difference between probabilistic feedback and real-world feedback; and the fact that the initial rule corresponded to the optimal response, so persistence could reflect history-based persistence rather than rule-based insensitivity alone. The authors note that a follow-up study with an incorrect-rule condition confirmed that rule-based insensitivity persists even when the rule contradicts feedback from the first trial (Ruiz & Martínez-Carrillo, 2026).
Relevance to Contextual Behavioral Science
This study is the first to demonstrate contingency insensitivity in LLMs, extending the scope of behavioral principles on RGB beyond organisms. It suggests that rule-based insensitivity to contingencies is not uniquely human but may emerge in any system trained to follow verbal instructions, regardless of the underlying learning mechanism. For contextual behavioral science, it offers a complementary "silicon sampling" avenue that could inform subsequent human studies, given the methodological challenges of experimentally studying the functional classes of rule-following. For AI research, it provides a behavior-analytic framework—pliance, tracking, contingency sensitivity—for understanding how alignment procedures shape LLM behavior, along with a direct practical implication: highly detailed prompts can induce rigidity in dynamic situations where adaptation to changing feedback is essential.
This summary was generated using Artificial Intelligence and may contain errors. Please refer to the original article.