LLM 理论进展最新论文与文章 - 第 5 页

LLM 理论进展

Curiosity as Linguistic Intervention: Using LLM Tutoring Dialogues to Influence Exploratory Learning Behavior

arXiv 2026-06-21

Large Language Models (LLMs) provide a new opportunity to study how language shapes exploratory cognition because conversational strategies can be systematically manipulated at inference time. We introduce CURIOBOT, a framework that operationalizes Berlyne's collative variables, novelty, complexity, conflict, and uncertainty, as adaptive linguistic interventions for conversational tutoring. Across 270 tutoring conversations spanning multiple model families, domains, and topic complexity levels, curiosity-oriented interventions consistently increased exploratory learner behaviors, producing up to 2.4x more conversational turns under fixed time budgets. To measure these effects, we further introduce a learner-centered evaluation framework capturing exploratory questioning, conversational agency, productive struggle, and observable curiosity. Learner-side gains persisted even when tutor-side instructional quality remained unchanged, suggesting that curiosity functions as a partially independent interaction-level mechanism. More broadly, our results demonstrate that LLM-mediated dialogue can serve as a scalable experimental framework for studying how language shapes exploratory learning behavior.

Leveraging Large Language Models to Obscure Code Stylometry: A Comparative Study of GPT-3.5 and GPT-4

arXiv 2026-06-21

In the rapidly evolving field of software development, code stylometry analyzing unique stylistic signatures of programmers plays a crit-ical role in authorship attribution and cybersecurity. Recent advancements in artificial intelligence, particularly Large Language Models (LLMs) like GPT-3.5 and GPT-4, have introduced new dimensions to this field, challenging traditional stylometry techniques. This study investigates the effectiveness of LLMs in altering code stylometry while preserving functionality and evaluates the impact of various prompt engineering strategies. Through comprehensive experiments, we assess how well these models can obscure stylistic signatures to avoid detection by a Random Forest classifier trained for authorship attribution. The results reveal significant differences in effectiveness between single-shot and multi-shot methods and highlight the importance of detailed, structured prompts. Additionally, functionality preservation checks demonstrate the challenges in maintaining code integrity post-modification. This research provides critical insights into the robustness of authorship attribution techniques against advanced AI capabilities, informing future cybersecurity and software engineering developments

Asymptotic Signal Subspace Recovery in Softmax Attention Models

arXiv 2026-06-21

Attention mechanisms have demonstrated remarkable empirical success in identifying relevant information from large collections of tokens, yet the theoretical principles underlying this behavior remain poorly understood. We study a stylized softmax-attention model in which a query vector is learned by stochastic gradient ascent from a collection of informative and nuisance tokens. Exploiting the symmetry of the model, we derive a population objective and characterize the limiting ordinary differential equation governing the learning dynamics. Using tools from stochastic approximation and dynamical systems theory, we establish a rigorous connection between the stochastic learning algorithm and its deterministic limit. Our main result shows that, under suitable high-dimensional scaling assumptions and standard step-size conditions, the learned query converges almost surely to the one-dimensional signal subspace spanned by the latent informative direction. Equivalently, the query asymptotically recovers the latent signal up to the intrinsic sign ambiguity. These results provide a rigorous theoretical foundation for understanding attention mechanisms as signal extraction procedures in high-dimensional noisy environments and offer a dynamical-systems perspective on how attention discovers relevant information in the presence of substantial noise.

Words as Difference Makers: How Large Language Models Determine Causal Structure in Text

arXiv 2026-06-21

Because large language models (LLMs) are impressively successful in predicting text, it appears that they must have access to a 'world model' representing causal and definitional structure. However, the dominant formalisms of modern causal inference -- Judea Pearl's interventionist approach and the Neyman-Rubin potential outcomes framework -- struggle to illuminate how LLMs learn causal structure. I resolve this puzzle by arguing that LLMs employ a specific inductive approach based on a difference-making logic -- sometimes called variational induction. I demonstrate how central aspects of this logic are realized during training, where LLMs require enormous amounts of text data from a wide range of contexts to identify difference- and indifference-makers within word sequences. Furthermore, I analyze specific architectural features of LLMs -- such as token embeddings and self-attention -- to determine their roles in variational induction. The difference-making logic of LLMs fundamentally parallels the experimental method, where causal relations are derived by systematically varying individual circumstances to determine their influence on a phenomenon.

Breaking the Likelihood Trap: Variance-Calibrated Modulation for Large Language Model Decoding

arXiv 2026-06-21

In open-ended generation, LLMs frequently fall into the "likelihood trap", marked by repetitive degeneration and vocabulary dullness, creating a discrepancy between machine-generated and human-written text. While post-hoc tail truncation (e.g., Top-\(p\), Min-\(p\)) avoids sampling from the unreliable tail, it can over-sample from the uncalibrated head and misalign generation with human lexical preferences; fixed scalar repetition penalties likewise ignore variation in logit scale across inference steps, potentially disrupting semantic coherence. To address both limitations, we propose Variance-Calibrated Modulation (VCM), a training-free pre-decoding intervention that reshapes the probability distribution before truncation through two dynamic mechanisms: (1) Contextual Searchlight via PMI, which suppresses global stopwords while elevating context-evoked tokens, and (2) Adaptive Self-Debiasing, which uses real-time logit standard deviation for scale-invariant penalization. Across open-ended generation, factual QA, and mathematical reasoning, VCM consistently mitigates the likelihood trap. With negligible computational overhead, VCM integrates with existing decoding strategies, improving diversity, coherence, and, particularly at higher decoding temperatures, reasoning accuracy.

Can LLMs Control Readability? A Multi-Dimensional Evaluation Framework for CEFR-Controlled Arabic Generation

arXiv 2026-06-20

While Large Language Models (LLMs) can generate fluent Arabic text, their ability to reliably control readability levels remains unclear. We propose a multi-dimensional evaluation framework for Common European Framework of Reference for Language (CEFR)-controlled Arabic text generation, assessing whether instruction-following LLMs can serve as reliable generators for adaptive language learning. Our framework integrates controlled prompting, automatic readability prediction using a validated Taha-19 model, lexical constraint validation, and syntactic complexity profiling. Results show that structured prompting substantially improves CEFR alignment. In particular, CEFR-guided prompting with lexical constraints achieves the highest conformity to reference linguistic profiles (0.91 cosine similarity) and near-perfect agreement with predicted readability levels (0.99), while unconstrained prompting exhibits weak control. These findings establish an empirical foundation for integrating readability-aware Arabic text generation into adaptive educational systems.

Can Reasoning Models Detect Changes to their Chains of Thought?

arXiv 2026-06-20

There are many reasons one may want to edit a model's chain of thought (CoT) -- e.g., to prefill it with reasoning from a stronger model or to remove steps that may yield unsafe outputs. The success of these interventions plausibly depends on a model's inability to notice them, as the model may alter its behavior if it suspects tampering. In this work, we study whether recent reasoning models are able to detect such interventions on their CoTs under a variety of conditions: both during reasoning and after it, and when prefilled both with their own CoTs and with those of other models. Broadly, we find that (i) models exhibit only very modest detection accuracy; (ii) models struggle to identify how their CoT was modified; and (iii) models are about as good at detecting changes to their own CoTs as to those of other models.

Local Causal Attribution of Chain-of-Thought Reasoning

arXiv 2026-06-20

Understanding the causal structure of a language model's thought process is a problem of significant importance for both transparency and safety. In this work, we take a local approach toward this goal by analyzing the causal relationships among individual components, termed units, of a given, specific chain-of-thought trace. We construct a structural causal model on these units and relate each unit to the log probability of generating (subsequent) output units. Our algorithm, termed AttriCoT, is a black-box method that performs attribution by estimating importance parameters in the structural causal model using \(O(U)\) forward passes through the model, where \(U\) is the number of units. Evaluation of perturbation curves across 5 datasets and 4 reasoning models shows that AttriCoT produces attributions that are more faithful to the model's behavior than alternative methods. The attribution results also reveal notable differences in thought structure between models and domains.

Natural Language-Focused Software Engineering via Code-Documentation Equivalence

arXiv 2026-06-20

Source code documentation is an integral part of software development and maintenance, as it helps in understanding the code and facilitates communication among developers. However, existing documentation is often incomplete, outdated, or inaccurate, which can lead to misunderstandings and errors. In the era of large language models (LLMs), which are being extensively used for software engineering tasks, the quality of documentation becomes even more critical, as documentation provides important context for the models. In this paper, we introduce the notion of documentation-to-code equivalence, a novel property that captures whether documentation accurately and completely describes the code it documents. We present a novel approach, called Documentary, to automatically generate equivalent documentation for a given code snippet. Our evaluation shows that Documentary can generate equivalent documentation for 53.4% of the evaluated function-level code snippets. To show the benefits of documentation-to-code equivalence, we describe and evaluate two software engineering tasks: code understanding and code editing. Our results show that documentation-to-code equivalence allows an LLM to predict the output of a function with 12.8--24.5% higher accuracy, when compared to human-written documentation and documentation generated by a baseline approach. Furthermore, human developers consider documentation generated by Documentary to be more useful for understanding and editing code than the original human-written documentation.

ForEx: A Formal Verification Framework for Explainable Reasoning in Logical Fallacy Detection and Annotation

arXiv 2026-06-20

Current evaluations of Large Language Models (LLMs) on logical fallacy detection focus on predicted labels, but do not establish whether those labels are supported by the reasoning the models provide. We propose ForEx (Formal Verification for Explainable Reasoning), a framework that translates LLM-generated explanations into Lean4 and verifies whether the translated rationale is derivable under encoded premises, not the logical validity of the original natural language argument. To distinguish prediction outcomes from the formal status of the supporting reasoning, we introduce the LLM Argument Verification Matrix, which separates label consistency from formal verification status. Experiments on LOGIC-Climate show that over 90% of LLM outputs can be translated into formal reasoning chains that pass verification, while agreement with human annotations remains around 20%. These results expose a systematic gap between formal derivability and label agreement, a distinction invisible to prediction-based metrics. ForEx moves LLM evaluation beyond label correctness toward machine-checkable analysis of formalized reasoning chains.

Deeper is Not Always Better: Mitigating the Alignment Tax via Confident Layer Decoding

arXiv 2026-06-20

Autoregressive generation in large language models (LLMs) conventionally decodes from the final layer, assuming that deeper representations yield more reliable next-token predictions. We revisit this assumption by revealing a recurring Guess-Refine-Perturb dynamic: early layers form coarse guesses, intermediate layers refine reasoning-relevant semantics, and final layers can perturb these refined predictions toward generic or alignment-preferred tokens. We introduce Confident Decoding, a training-free decoding strategy that dynamically selects the most reliable near-final layer through entropy-guided conservative backward search. We further provide a theoretical formulation of layer selection as an optimal stopping problem, showing that under bounded projection noise and dominant late-stage alignment perturbation, our search rule filters perturbation while bounding the loss relative to the oracle refinement layer. Experiments across dense and Mixture-of-Experts LLMs demonstrate consistent gains on challenging reasoning benchmarks, including GPQA-Diamond, Omni-MATH, and HLE, with zero memory overhead and less than 2% latency increase. These results suggest dynamically bypassing final-layer perturbations can unlock stronger reasoning behavior from aligned LLMs.

Scaling Performance and Low-Resource Annotation with Many-Shot In-Context Learning for Named Entity Recognition

arXiv 2026-06-20

In-context learning (ICL) with large language models (LLMs) has emerged as a powerful alternative to fine-tuning for Named Entity Recognition (NER), achieving strong performance with minimal annotation and no additional training. However, prior work has shown that despite their adaptability, LLMs still lag behind fully supervised models such as fine-tuned BERT in structured tasks like NER. While existing studies on ICL for NER have mainly explored few-shot settings, the potential of scaling to hundreds of demonstrations has not been thoroughly investigated. To address this gap, we conduct a comprehensive investigation of many-shot ICL for NER and further explore its effectiveness in annotating and refining data for low-resource NER tasks. Specifically, we evaluate various LLMs across multiple domains using hundreds of ICL examples and then assess the feasibility of using many-shot ICL as a data annotation framework. Our experiments demonstrate that: (1) scaling to hundreds of in-context examples enables LLMs to match or even surpass the performance of fully supervised BERT models; and (2) using about one hundred human-labeled examples as demonstrations, many-shot in-context annotation can generate high-quality labeled data, leading to approximately 10% absolute F1 improvement over existing state-of-the-art approaches when used to fine-tune BERT on low-resource NER.

CNnotator: LLM-Guided Memory Safety Annotation Synthesis

arXiv 2026-06-20

Memory safety errors account for a large proportion of security bugs in systems written in C; modern languages such as Java and Rust prevent such bugs because they are memory-safe by design. To migrate systems to safer languages or identify memory errors, we must first determine how legacy code manipulates memory. This information is only represented implicitly in such code. In many cases, memory usage patterns are merely tedious for humans to figure out, rather than truly difficult. In this work, we ask if large language models (LLMs) can perform this task by having them synthesize annotations representing memory usage as specifications in CN, a hybrid testing/verification tool. Our tool, CNnotator, uses LLMs to automatically generate and test CN specifications. We find that current models are able to generate CN specifications for small-to-medium C programs, with the OpenAI o3 reasoning model achieving a 90% success rate on first attempts and 97% overall success, while the chat model GPT-4o correctly annotates 65% of first attempts. These results suggest AI-assisted annotation is becoming practical for real-world C codebases.

Pre-Generation Hallucination Detection in Large Language Models via Soft-Target Attention Probing

arXiv 2026-06-20

Detecting hallucination risk before generation enables abstention, retrieval augmentation, and routing decisions without incurring the cost of decoding. While prior work has shown that such risk can be estimated from a model's internal representations, existing approaches treat this as binary classification over a single decoded output. We instead formulate it as a risk-estimation problem. Under this formulation, we introduce soft-target supervision based on the empirical answer error rate over stochastically sampled outputs - an estimator we prove to be the unique unbiased minimum-variance estimator of the model's per-prompt error probability under its sampling distribution. We further adapt attention probing to the pre-generation setting, enabling the detector to selectively aggregate hallucination-relevant prompt representations. Across three question-answering benchmarks and five models, attention probing outperforms linear probing on short-answer tasks. Replacing binary labels with soft-target supervision further and consistently improves detection quality.

Evaluating Large Language Models for Hausa and Fongbe Machine Translation: Benchmarks, Failures, and Metric Reliability

arXiv 2026-06-20

We investigate the translation quality of current large language models (LLMs) for English-to-Hausa and English-to-Fongbe - two typologically distinct West African languages from the Afroasiatic and Niger-Congo families respectively - and evaluate whether standard automatic metrics reliably reflect human judgment for these low-resource languages. We evaluate four models (GPT-4o Mini, Claude Sonnet 4, Gemini 2.5 Flash, and Qwen2.5-7B) at progressive scales (500 to 10,000 sentences) using automatic metrics (BLEU, chrF++, TER, COMET, BERTScore) validated against native-speaker judgment. Our results reveal three key findings. First, translation quality varies substantially by language: Hausa achieves acceptable quality (human scores 4.0-4.5/5) while Fongbe achieves poor quality (1.0-2.2/5), with a consistent 3x BLEU gap across all systems. Second, model rankings differ by language - Gemini leads for Fongbe while GPT-4o leads for Hausa by human evaluation - indicating that performance on one low-resource African language does not predict performance on another. Third, metric-human correlation varies dramatically: perfect rank correlation for Fongbe (rho=1.0) but weak correlation for Hausa (rho=0.5), where human evaluators preferred GPT-4o despite all automatic metrics ranking Claude first. We further show that neural metrics like BERTScore exhibit embedding collapse (within-language similarity >0.99) for both languages, limiting their ability to differentiate translation quality. Based on these findings, we recommend multi-metric evaluation for low-resource African languages, with particular caution when interpreting neural metrics. We establish that minimum sample sizes of n=2,500 sentences are required for stable system rankings, as smaller samples produced artifact findings that reversed at scale.

Benchmarking Large Language Models for Grapheme-to-Phoneme Conversion: A Japanese Case Study

arXiv 2026-06-20

Grapheme-to-phoneme (G2P) conversion is essential for controllable and robust text-to-speech, and large language models (LLMs), with broad linguistic knowledge, offer a promising approach. We benchmarked over 30 LLMs on Japanese G2P, comparing them with conventional morphological analyzers on 3000 manually annotated sentences. We evaluated two prompting strategies: a parse mode, where the LLM performs morphological analysis followed by rule-based kana conversion, and a direct mode, where the LLM directly predicts kana readings. The results show that model size, version, and Japanese-specialized training are key factors, with the best LLMs achieving kana character error rate below 0.52% vs. the best conventional tool (1.03%). Parse mode outperforms direct mode for most models, as rule-based post-processing relieves the LLM of handling complex pronunciation rules. We also show that feeding LLM-predicted kana into a kana-input TTS yields better pronunciation than end-to-end TTS.

The Alignment Problem in Constrained Code Generation

arXiv 2026-06-19

Large Language Models (LLMs) have demonstrated strong capabilities in code generation, but their outputs frequently contain syntax or type errors that result in compilation failures. Constrained decoding has been proposed as a solution to mitigate compilation errors by construction, improving functional correctness as a byproduct. However, previous works overlook a critical aspect of constrained decoding: the alignment between constrainer (e.g., types), language model and the target specification language (e.g., TypeScript). Misalignment is caused by the constrainer being incomplete--rejecting programs that belong to the target--or unsound--allowing programs that are not part of the target. The bias created by incompleteness distorts the language model distribution, and can be detrimental for code generation. We evaluate this hypothesis using seven language models, two target languages, two constrainers, enforcing types and syntax during decoding, and we study how language models react to varying levels of incompleteness. On three benchmarks, when the constrainer is incomplete, unconstrained decoding significantly outperforms constrained decoding in terms of functional correctness. Incompleteness pushes the model into low-probability regions of the program space, causing the generation to frequently time out, and reducing functional correctness by up to 97%. These contributions make the community aware of the negative effects of misalignment in constrained decoding, and provide quantitative insights on how to design constrainers that are beneficial for code generation systems with formal guarantees.

Factual Retrieval in LLMs Is a Redundant, Distributed and Non-Contiguous Process

arXiv 2026-06-19

Large language models (LLMs) store and recall factual knowledge, yet the precise mechanism of how entity representations are transformed to enable specific attribute retrieval remains underexplored. In this work, we investigate this mechanism through the lens of an "attribute-computation path"-a sequence of computational steps over the entity representation required to elicit a target attribute. We then propose an iterative patching protocol to identify a minimal subset of layers necessary for this computation. Applying our method to LLaMA 3.1 8B and Qwen3 8B, we find that these paths are non-contiguous, often skipping layers, and that models possess multiple, functionally-equivalent paths for the same entity and fact, highlighting a high degree of redundancy in attribute computation. This implies that knowledge computation is highly distributed, potentially explaining the localization-editing mismatch and suggesting that knowledge storage and retrieval in LLMs is far from being well understood.

RocketPFN: Accurate Time Series Classification via In-Context Learning

arXiv 2026-06-19

We introduce RocketPFN, a training-free pipeline for time series classification that combines random convolutional feature extraction (Rocket) with in-context classification via a pretrained tabular foundation model (TabPFN v2.5). On 92 UCR datasets (30-resample protocol), RocketPFN matches HC2, the strongest published method on the archive, in mean accuracy (both 0.900, Wilcoxon p=0.50), with no training on the target data and a median inference time of 30 seconds per fold. It also significantly outperforms every individual classifier in the HC2 ensemble. On UEA (20 datasets) the difference is likewise not statistically significant. A separate comparison concerns TSC foundation models: when paired with the same downstream classifier, MOMENT, Mantis, and MantisV2 are all significantly outperformed by RocketPFN using fewer extracted features and no learned parameters (p<0.001 in each case). This holds even when the encoders were pretrained on corpora that include the UCR training samples. We propose this two-stage pipeline as a reference point for evaluating zero-shot TSC foundation models.

Decodable but Not Faithful: Coupling Natural-Language Rationales to Programmatic Verifiers

arXiv 2026-06-19

Language models can generate plausible rationales for their predictions, but these explanations may not faithfully represent the model's internal reasoning. We propose verifier-coupled reasoning, a framework that inserts inline claims into reasoning traces and trains an auxiliary consistency head to predict programmatic verifier outputs from rationale-span hidden states. The central finding is a gap between decodability and faithfulness: consistency training reliably makes verifier information decodable from rationale representations, but decodability does not guarantee faithful generation. In LeanCheck (formal theorem proving), rationale-only and proof-only pooling achieve perfect directional separation under counterfactual conflict. In KataGo (Go engine), commentary spans encode 10-way win-rate buckets at 81% accuracy. Yet in a code setting, the model achieves 98.6% coupling while its generated explanations remain unfaithful: fluent prose with correct structured claims, but describing unrelated algorithms; a controlled pretrained-vs-from-scratch comparison shows the gap is not capacity-driven. Synthetic activation patching confirms causal influence (73-89% vs. 31% baseline), FEVER reveals that evidence-only pooling isolates genuine evidence sensitivity at the cost of raw accuracy, and per-claim analysis shows that consistency loss disproportionately benefits fine-grained claims over binary ones. These results establish that consistency losses are effective diagnostics and representation-shaping tools, but not sufficient conditions for faithful reasoning.

每日文章

LLM 理论进展