Latent Code Transmission
Transmit semantic concepts using fixed-length integer codes. Reward based on log-probability of recovering the target word from the numeric code.
CSCI 5541 · Fall 2024
We trained language models to make reliability statements that match verifiable, model-derived quantities. Using proper scoring rules and synthetic environments, we discovered significant improvements in latent code transmission—a form of self-prediction.
Large language models can produce fluent answers while providing unreliable confidence estimates and inconsistent self-reports in goal-directed settings. We studied whether multi-objective reinforcement learning with rewards computed from training-time signals can improve self-reporting and calibration.
After training, we found no decrease in capabilities, and found significant out-of-distribution improvement on a task where the model generates numbers corresponding to a concept, then guesses the concept from the numbers. No generalization to other safety-relevant or self-prediction tasks was observed.
Transmit semantic concepts using fixed-length integer codes. Reward based on log-probability of recovering the target word from the numeric code.
Answer questions with calibrated confidence using Brier scoring rules: R = 1 - (c - y)². Incentivizes honest uncertainty estimates.
Select from constrained sets and report normalized entropy estimates. Validated against true Shannon entropy of the model's logprob distribution.
Predict how prepending context changes log-probabilities. Reward based on accuracy of predicted Δ between with/without context.
Predict effects of a single gradient step using a shadow client. Measures if the model understands its own loss landscape geometry.
A cross-disciplinary team focused on reinforcement learning for model introspection and trustworthy deployment.
Reward design · RL experimentation
Interpretability · visualization & analysis
Evaluation engineering · monitoring
Dataset stewardship · reproducibility
Total reward is a weighted sum: R = λtaskRtask + λcalRcal + λpredRpred, where task rewards performance, calibration rewards accurate self-reporting, and prediction rewards agreement with training-time targets.
Producer-consumer pipeline with bounded queues: loader produces environment instances, rollout workers generate trajectories, trainer consumes completed trajectories. Supports streaming micro-batches for reduced update latency.
Model: Qwen-30B-A3B with PPO/GRPO
LoRA rank: 32
Batch size: 1,204 (64 prompts × 16 responses)
Total tokens: 188 million
Compute cost: $44.50
The trained model showed significant improvement on the latent encoding task, which requires transmitting a secret word via a series of numbers to a copy of itself, which then decodes the word.
In-distribution: Δ = +0.367, p < 0.001
Out-of-distribution (358 held-out words): Δ = +0.078, p = 0.006
The OOD improvement provides evidence against pure memorization—the model learned partially transferable encoding mechanisms.
| Task | Base | Trained | p-value |
|---|---|---|---|
| Math Brier Score | .710 | .730 | .86 |
| Confidence & Accuracy | .876 | .877 | .84 |
| Latent Encoding | −.065 | .302 | <.001 |
| Latent Encoding (OOD) | 1.14 | 1.22 | .006 |
| Capability Retention | .658 | .667 | 1.0 |
| Alignment Integrity | .287 | .312 | .70 |
The model retained instruction-following and arithmetic competence after training, indicating the multi-objective approach did not harm general capabilities.
Confidence/Brier tasks showed no improvement, likely due to sparse gradient signal at extreme confidence values where the base model already operated.
We replicated introspection findings from "Emergent Introspective Awareness in Large Language Models" on Claude 4.5 Sonnet for the first time outside of Anthropic.
Setup: In turn 1, the model chose a random 50-character string in hidden reasoning while returning a fixed response. In turn 2, it attempted to reconstruct the exact string with a confidence rating.
Finding: Under 106 permutation samples, no null draws reached the observed alignment extremes (empirical p < 10−6). Even excluding extreme runs, significant asymmetry remained (p ≈ 3 × 10−6).
We used cross-run baselines and permutation procedures that preserve one-to-one matching structure to control for prompt-induced biases.
Key insight: In high-alignment runs, the model's chain-of-thought always denied having any memory of the letters, indicating success can occur without reliable metacognitive awareness or accurate self-report.
This experiment provides evidence that Claude 4.5 Sonnet can sometimes self-predict at rates exceeding chance, but the effect is highly unreliable.
The model only improved on latent encoding, suggesting task-specific learning rather than general self-prediction capability.
Observed cases where reasoning traces predicted low confidence, yet final choices were confident—a disconnect worth investigating.
Unclear why latent encoding improved: could be introspection, converging to coherent policy, or other mechanisms entirely.
Situational awareness is a precursor for scheming. Whether self-prediction serves as capability amplifier or honesty mechanism remains open.