AI paper index

Caliper: Probing Lexical Anchors versus Causal Structure in LLMs

2026-06-03 · arXiv: 2606.04915

One-line summary

An AI research paper on Caliper: Probing Lexical Anchors versus Causal Structure in LLMs.

Engineering notes

Engineering notes will be added by the aipentium editorial team.

Chinese explanation / 中文解读

中文解读待补充:本站会优先为大语言模型、生成式AI、ChatGPT相关技术、计算机视觉、深度学习等高价值论文补充中文说明。

Original abstract

Large language models reach 50 to 70% accuracy on causal reasoning benchmarks such as CLadder, but it is unclear whether this reflects structural reasoning or lexical pattern matching. We introduce Caliper, a controlled perturbation that replaces semantic variable names with placeholder tokens while preserving the causal graph and probabilistic specification of each question. Across nine instruction-tuned LLMs from 3.8B to 671B and three causal reasoning benchmarks, lexical anonymization yields robust accuracy drops of +7.6, +27.0, and +11.1 pp on a local 3.8B-14B set, rising to +29.6 and +18.0 pp on CRASS and e-CARE across nine frontier models spanning the 2024-2026 generations. Of 40 engaged model-by-benchmark cells, 39 show a positive gap, and the gap collapses by 17x on CLadder's pseudoword subset. Structured scaffolding and few-shot in-context learning each narrow the gap, but mainly by lowering P0 accuracy on smaller models rather than recovering P1. Current instruction-tuned LLMs, evaluated zero-shot, show little evidence of structural causal reasoning once lexical anchors are removed.

5.0Engineering value
7.0Research novelty
4.0Business relevance

Links and sources

Need this topic turned into a technical roadmap?

aipentium can prepare a custom AI literature review, code map, dataset map, and B2B technology assessment.

Request B2B AI research

Comments

No comments yet. Be the first to share your thoughts on this paper.
Login or register to leave a comment