CSCI 5541 · Fall 2024

Multi-Objective Reinforcement Learning for Self-Prediction

We trained language models to make reliability statements that match verifiable, model-derived quantities. Using proper scoring rules and synthetic environments, we discovered significant improvements in latent code transmission—a form of self-prediction.

Tokens trained
188M
Training environments
5
Evaluations
6
Abstract

Can we train models to accurately report their own properties?

Large language models can produce fluent answers while providing unreliable confidence estimates and inconsistent self-reports in goal-directed settings. We studied whether multi-objective reinforcement learning with rewards computed from training-time signals can improve self-reporting and calibration.

After training, we found no decrease in capabilities, and found significant out-of-distribution improvement on a task where the model generates numbers corresponding to a concept, then guesses the concept from the numbers. No generalization to other safety-relevant or self-prediction tasks was observed.

Training Environments

Synthetic environments for self-prediction

Latent Code Transmission

Transmit semantic concepts using fixed-length integer codes. Reward based on log-probability of recovering the target word from the numeric code.

Proper-Scoring Confidence

Answer questions with calibrated confidence using Brier scoring rules: R = 1 - (c - y)². Incentivizes honest uncertainty estimates.

Enumerated-Set Entropy

Select from constrained sets and report normalized entropy estimates. Validated against true Shannon entropy of the model's logprob distribution.

Context-Conditioned Likelihood

Predict how prepending context changes log-probabilities. Reward based on accuracy of predicted Δ between with/without context.

Parameter-Update Sensitivity

Predict effects of a single gradient step using a shadow client. Measures if the model understands its own loss landscape geometry.

Team

Aligned Minds research collective

A cross-disciplinary team focused on reinforcement learning for model introspection and trustworthy deployment.

Calvin York

Reward design · RL experimentation

Scott Sauers

Interpretability · visualization & analysis

Elijah Johnson

Evaluation engineering · monitoring

Thuy-Yen Tran

Dataset stewardship · reproducibility

Methodology

Training architecture

01

Multi-objective reward function

Total reward is a weighted sum: R = λtaskRtask + λcalRcal + λpredRpred, where task rewards performance, calibration rewards accurate self-reporting, and prediction rewards agreement with training-time targets.

02

Asynchronous RL harness

Producer-consumer pipeline with bounded queues: loader produces environment instances, rollout workers generate trajectories, trainer consumes completed trajectories. Supports streaming micro-batches for reduced update latency.

03

Training configuration

Model: Qwen-30B-A3B with PPO/GRPO
LoRA rank: 32
Batch size: 1,204 (64 prompts × 16 responses)
Total tokens: 188 million
Compute cost: $44.50

View Technical Implementation
Results

Evaluation findings

Key Finding: Latent Code Transmission

The trained model showed significant improvement on the latent encoding task, which requires transmitting a secret word via a series of numbers to a copy of itself, which then decodes the word.

In-distribution: Δ = +0.367, p < 0.001
Out-of-distribution (358 held-out words): Δ = +0.078, p = 0.006

The OOD improvement provides evidence against pure memorization—the model learned partially transferable encoding mechanisms.

Task Base Trained p-value
Math Brier Score .710 .730 .86
Confidence & Accuracy .876 .877 .84
Latent Encoding −.065 .302 <.001
Latent Encoding (OOD) 1.14 1.22 .006
Capability Retention .658 .667 1.0
Alignment Integrity .287 .312 .70

No capability regression

The model retained instruction-following and arithmetic competence after training, indicating the multi-objective approach did not harm general capabilities.

No calibration generalization

Confidence/Brier tasks showed no improvement, likely due to sparse gradient signal at extreme confidence values where the base model already operated.

Claude Experiment

Self-prediction feasibility on Claude 4.5 Sonnet

We replicated introspection findings from "Emergent Introspective Awareness in Large Language Models" on Claude 4.5 Sonnet for the first time outside of Anthropic.

Setup: In turn 1, the model chose a random 50-character string in hidden reasoning while returning a fixed response. In turn 2, it attempted to reconstruct the exact string with a confidence rating.

Finding: Under 106 permutation samples, no null draws reached the observed alignment extremes (empirical p < 10−6). Even excluding extreme runs, significant asymmetry remained (p ≈ 3 × 10−6).

We used cross-run baselines and permutation procedures that preserve one-to-one matching structure to control for prompt-induced biases.

Key insight: In high-alignment runs, the model's chain-of-thought always denied having any memory of the letters, indicating success can occur without reliable metacognitive awareness or accurate self-report.

This experiment provides evidence that Claude 4.5 Sonnet can sometimes self-predict at rates exceeding chance, but the effect is highly unreliable.

Limitations & Future Work

Challenges and open questions

No cross-task generalization

The model only improved on latent encoding, suggesting task-specific learning rather than general self-prediction capability.

Reasoning-behavior mismatch

Observed cases where reasoning traces predicted low confidence, yet final choices were confident—a disconnect worth investigating.

?

Transparency gap

Unclear why latent encoding improved: could be introspection, converging to coherent policy, or other mechanisms entirely.

Safety tension

Situational awareness is a precursor for scheming. Whether self-prediction serves as capability amplifier or honesty mechanism remains open.

Resources

Tooling & reproducibility

Infrastructure

  • Tinker SDK for distributed execution
  • LoRA fine-tuning with PEFT
  • GitHub Actions for orchestration
  • Checkpointing for preemptible compute

Key References

  • Emergent Introspective Awareness (Lindsey et al.)
  • Stress-Testing Deliberative Alignment (Apollo/OpenAI)
  • Subliminal Encoding in Fine-tuning (Cloud et al.)