
Reward Engineering and Evaluation: Making Your System Learn What Matters
Photo by Kenny Eliason on Unsplash
Executive Summary: Simple metrics create perverse incentives in LLM systems. This post shows how to build multidimensional evaluation that captures what actually matters, handle noisy and delayed feedback, detect hallucinations systematically, and design reward functions that adapt based on observed outcomes. The result: AI systems that learn what drives real success, not just what's easy to measure.
This is the third post in our series on probabilistic prompt pipelines. In our first post, we explored why static prompts become bottlenecks. The second post dove into building intelligent selectors with retrieval and bandits. Now we'll tackle the critical challenge that determines whether your system actually improves: measuring success accurately and designing reward functions that guide the system toward what truly matters.
The Hidden Complexity of Measuring Success
Here's a familiar scenario. You've built a wellness-podcast generation system with sophisticated template selection and you're tracking a simple metric: whether users listen to at least 90 seconds of each episode. The system optimizes beautifully, pushing 90-second retention from 65% to 85% in two weeks. Success, right?
Dig deeper. The system has learned to front-load sensational claims and controversial advice in the first 90 seconds. Users clear the threshold, but completion rates drop—and worse, listeners aren't implementing the advice because it's misleading or impractical. Templates that produce measured, evidence-based content get penalized because they build context before the hook.
Common Pitfall: Optimizing for early retention metrics often rewards clickbait over substance. The easier a metric is to measure, the more likely it is to miss what actually matters.
This is the central challenge of evaluation in LLM systems: simple metrics create perverse incentives. Optimize a single number and the system will maximize that number, often at the expense of what you actually care about. It's like evaluating a doctor solely on appointment speed—fast, but not necessarily good care.
Complication: different stakeholders value different aspects. Content wants factual accuracy and brand alignment. Growth wants engagement and retention. Users want practical advice that improves wellness. Platform partners want policy compliance. Each is valid; optimizing one can harm another.
Understanding Multidimensional Evaluation
The answer isn't "pick a better single metric"—it's embracing the multidimensional nature of quality. Think overall health versus just weight. Weight tells you something; it doesn't capture cardiovascular fitness, sleep, nutrition, or mental wellbeing. Similarly, episode quality spans multiple dimensions that must be measured and balanced.
Let's look at key dimensions for the wellness-podcast system and how they interact.
Engagement: Beyond Simple Retention
Engagement is more than play vs. skip:
- Initial engagement: Did they start listening? (title/description appeal)
- Early retention: Did they pass 90 seconds? (hook effectiveness)
- Completion rate: Did they finish? (sustained value)
- Active engagement: Did they take notes, share, or save? (perceived value)
- Repeat engagement: Did they return for more? (trust)
Each signal reflects a different quality facet. High starts but poor completion signals overpromising. High completion but low repeat suggests "fine, not memorable." The challenge is combining these signals coherently.
Accuracy: The Foundation of Trust
For wellness content, accuracy is non-negotiable—and hard to measure:
- Factual correctness: Are claims evidence-based?
- Contextual appropriateness: Is advice suited to the audience?
- Completeness: Are caveats and warnings included?
- Consistency: Does advice align with established guidelines?
Accuracy often trades off against engagement. Nuanced, careful health advice is less "hooky" than bold claims. Your evaluation must reward accuracy without flattening content into boredom.
Readability and Accessibility
Accurate, engaging content still fails if it's hard to follow—especially in audio:
- Clarity: Simple explanations without condescension
- Structure: Logical flow
- Pacing: Digestible delivery
- Language level: Appropriate vocabulary
- Cultural sensitivity: Respect for diverse perspectives
These directly influence whether listeners benefit, yet they're invisible to naive engagement metrics.
Building the Multidimensional Evaluator
We need immediate signals (available right away) and delayed signals (arrive over time) and a way to weight them by confidence.
Design Tip: Start with high-confidence immediate signals (readability, structure) and low-confidence predictions (engagement). As delayed signals arrive, update your evaluation with confidence-weighted averaging.
import numpy as np
from typing import Dict, List, Optional, Tuple
from dataclasses import dataclass
from datetime import datetime, timedelta
import logging
from enum import Enum
from scipy import stats
class QualityDimension(Enum):
"""
Enumeration of quality dimensions we evaluate.
Each dimension captures a different aspect of content quality
that contributes to overall episode success.
"""
ENGAGEMENT = "engagement"
ACCURACY = "accuracy"
READABILITY = "readability"
DIVERSITY = "diversity"
BRAND_ALIGNMENT = "brand_alignment"
USER_SATISFACTION = "user_satisfaction"
@dataclass
class EpisodeEvaluation:
"""
Complete evaluation results for a generated episode.
This structure captures both immediate and delayed signals,
along with confidence levels for each measurement.
"""
episode_id: str
template_id: str
generation_timestamp: datetime
# Immediate signals (available within minutes)
initial_quality_scores: Dict[QualityDimension, float]
initial_confidence: Dict[QualityDimension, float]
# Delayed signals (available after hours/days)
delayed_quality_scores: Dict[QualityDimension, float]
delayed_confidence: Dict[QualityDimension, float]
# Combined evaluation
final_scores: Dict[QualityDimension, float]
overall_success_score: float
evaluation_metadata: Dict
class MultiDimensionalEvaluator:
"""
Evaluates generated content across multiple quality dimensions,
combining immediate and delayed signals into actionable insights.
This evaluator understands that different signals have different
reliability and availability timelines, and it adapts its
evaluation strategy accordingly.
"""
def __init__(self, dimension_weights: Optional[Dict[QualityDimension, float]] = None):
# Default weights if not specified - these should be tuned for your use case
self.dimension_weights = dimension_weights or {
QualityDimension.ENGAGEMENT: 0.25,
QualityDimension.ACCURACY: 0.30, # Higher weight for accuracy in health content
QualityDimension.READABILITY: 0.15,
QualityDimension.DIVERSITY: 0.10,
QualityDimension.BRAND_ALIGNMENT: 0.10,
QualityDimension.USER_SATISFACTION: 0.10
}
self.logger = logging.getLogger(__name__)
# Components for specific evaluation tasks
self.engagement_analyzer = EngagementAnalyzer()
self.accuracy_checker = AccuracyChecker()
self.readability_scorer = ReadabilityScorer()
self.diversity_tracker = DiversityTracker()
def evaluate_immediate(self, episode_content: str, episode_metadata: Dict) -> Dict:
"""
Perform immediate evaluation using signals available right after generation.
These evaluations can be done without user interaction and provide
early indicators of content quality. They're less reliable than
delayed signals but available immediately for rapid feedback.
"""
immediate_scores = {}
immediate_confidence = {}
# Evaluate readability (high confidence - can be measured directly)
readability_result = self.readability_scorer.score(episode_content)
immediate_scores[QualityDimension.READABILITY] = readability_result['score']
immediate_confidence[QualityDimension.READABILITY] = 0.9 # High confidence
# Evaluate accuracy using automated checks (medium confidence)
accuracy_result = self.accuracy_checker.check_immediate(episode_content, episode_metadata)
immediate_scores[QualityDimension.ACCURACY] = accuracy_result['score']
immediate_confidence[QualityDimension.ACCURACY] = accuracy_result['confidence']
# Predict engagement using content features (low confidence)
predicted_engagement = self.engagement_analyzer.predict_engagement(episode_content, episode_metadata)
immediate_scores[QualityDimension.ENGAGEMENT] = predicted_engagement['score']
immediate_confidence[QualityDimension.ENGAGEMENT] = 0.4 # Low confidence in prediction
# Check diversity against recent content
diversity_score = self.diversity_tracker.evaluate_diversity(episode_content, episode_metadata)
immediate_scores[QualityDimension.DIVERSITY] = diversity_score
immediate_confidence[QualityDimension.DIVERSITY] = 0.8 # Fairly confident
# Brand alignment through keyword and tone analysis
brand_score = self._evaluate_brand_alignment(episode_content, episode_metadata)
immediate_scores[QualityDimension.BRAND_ALIGNMENT] = brand_score
immediate_confidence[QualityDimension.BRAND_ALIGNMENT] = 0.7
# User satisfaction must be predicted (very low confidence)
immediate_scores[QualityDimension.USER_SATISFACTION] = 0.6 # Neutral prior
immediate_confidence[QualityDimension.USER_SATISFACTION] = 0.2 # Very uncertain
return {
'scores': immediate_scores,
'confidence': immediate_confidence,
'evaluation_type': 'immediate',
'timestamp': datetime.now()
}
def evaluate_delayed(self, episode_id: str, user_interaction_data: Dict,
feedback_data: Dict, time_elapsed: timedelta) -> Dict:
"""
Perform delayed evaluation using actual user behavior and feedback.
These evaluations use real user signals and are much more reliable
than immediate predictions, but they're only available after users
have had time to interact with the content.
"""
delayed_scores = {}
delayed_confidence = {}
# Measure actual engagement from user behavior
engagement_result = self.engagement_analyzer.analyze_actual_engagement(user_interaction_data)
delayed_scores[QualityDimension.ENGAGEMENT] = engagement_result['score']
delayed_confidence[QualityDimension.ENGAGEMENT] = min(0.9, 0.5 + 0.1 * engagement_result['sample_size'] / 100)
# Accuracy can be refined with user reports and fact-checking
if feedback_data.get('accuracy_reports'):
accuracy_score = self.accuracy_checker.check_with_feedback(feedback_data['accuracy_reports'])
delayed_scores[QualityDimension.ACCURACY] = accuracy_score
delayed_confidence[QualityDimension.ACCURACY] = 0.95
# User satisfaction from explicit feedback
if feedback_data.get('ratings'):
satisfaction_score = self._calculate_satisfaction_score(feedback_data['ratings'])
delayed_scores[QualityDimension.USER_SATISFACTION] = satisfaction_score['score']
delayed_confidence[QualityDimension.USER_SATISFACTION] = satisfaction_score['confidence']
# Some dimensions don't change with delayed evaluation
# We'll carry forward the immediate scores for these
return {
'scores': delayed_scores,
'confidence': delayed_confidence,
'evaluation_type': 'delayed',
'timestamp': datetime.now(),
'time_elapsed': time_elapsed.total_seconds()
}
def combine_evaluations(self, immediate_eval: Dict, delayed_eval: Optional[Dict] = None) -> EpisodeEvaluation:
"""
Combine immediate and delayed evaluations using confidence-weighted averaging.
This method creates a unified quality assessment that uses the best
available information for each dimension, weighting more confident
signals more heavily in the final score.
"""
final_scores = {}
for dimension in QualityDimension:
immediate_score = immediate_eval['scores'].get(dimension, 0.5)
immediate_conf = immediate_eval['confidence'].get(dimension, 0.1)
if delayed_eval and dimension in delayed_eval['scores']:
delayed_score = delayed_eval['scores'][dimension]
delayed_conf = delayed_eval['confidence'][dimension]
# Confidence-weighted combination
total_confidence = immediate_conf + delayed_conf
if total_confidence > 0:
final_scores[dimension] = (
immediate_score * immediate_conf +
delayed_score * delayed_conf
) / total_confidence
else:
final_scores[dimension] = immediate_score
else:
# Only immediate evaluation available
final_scores[dimension] = immediate_score
# Calculate overall success score using dimension weights
overall_score = self._calculate_overall_score(final_scores)
return EpisodeEvaluation(
episode_id=immediate_eval.get('episode_id', 'unknown'),
template_id=immediate_eval.get('template_id', 'unknown'),
generation_timestamp=immediate_eval['timestamp'],
initial_quality_scores=immediate_eval['scores'],
initial_confidence=immediate_eval['confidence'],
delayed_quality_scores=delayed_eval['scores'] if delayed_eval else {},
delayed_confidence=delayed_eval['confidence'] if delayed_eval else {},
final_scores=final_scores,
overall_success_score=overall_score,
evaluation_metadata={
'has_delayed_signals': delayed_eval is not None,
'evaluation_timestamp': datetime.now()
}
)
def _calculate_overall_score(self, dimension_scores: Dict[QualityDimension, float]) -> float:
"""
Calculate overall success score from individual dimension scores.
This method implements a weighted average with optional threshold
requirements. For example, we might require minimum accuracy
regardless of other dimensions for health content.
"""
# Check critical thresholds
if dimension_scores.get(QualityDimension.ACCURACY, 0) < 0.6:
# Accuracy below threshold - heavily penalize
return dimension_scores[QualityDimension.ACCURACY] * 0.5
# Calculate weighted average
total_weight = 0
weighted_sum = 0
for dimension, weight in self.dimension_weights.items():
if dimension in dimension_scores:
score = dimension_scores[dimension]
# Apply non-linear transformation to emphasize high quality
transformed_score = score ** 1.5 # Rewards excellence
weighted_sum += transformed_score * weight
total_weight += weight
if total_weight > 0:
return weighted_sum / total_weight
return 0.5 # Neutral score if no dimensions available
def _evaluate_brand_alignment(self, content: str, metadata: Dict) -> float:
"""
Evaluate how well content aligns with brand values and guidelines.
This is crucial for maintaining consistency and trust, especially
in health and wellness content where brand reputation matters.
"""
score = 1.0
# Check for required disclaimers
required_disclaimers = ["consult your healthcare provider", "individual results may vary"]
for disclaimer in required_disclaimers:
if disclaimer.lower() not in content.lower():
score -= 0.1
# Check tone alignment
if metadata.get('target_tone') == 'supportive':
supportive_phrases = ["you've got this", "be patient with yourself", "progress not perfection"]
if not any(phrase in content.lower() for phrase in supportive_phrases):
score -= 0.15
# Check for prohibited content
prohibited_terms = ["miracle cure", "guaranteed results", "doctors hate this"]
for term in prohibited_terms:
if term in content.lower():
score -= 0.3
return max(0.0, min(1.0, score))Handling Noisy and Delayed Feedback
Real feedback is messy. Ratings trickle in. Engagement unfolds over days. Some value only shows up later (e.g., advice that works after weeks of practice). Your system must model this temporal reality.
Common Pitfall: Treating all feedback as equally reliable regardless of when it arrives. Early signals are often biased toward reactive responses; true value emerges over time.
class TemporalFeedbackProcessor:
"""
Handles the complex temporal dynamics of feedback in content evaluation.
This processor understands that different signals arrive at different times
and have different reliability patterns. It maintains temporal models of
feedback evolution and adjusts evaluations as new information arrives.
"""
def __init__(self, feedback_window_hours: int = 72):
self.feedback_window_hours = feedback_window_hours
self.feedback_cache = {}
self.temporal_models = {}
self.logger = logging.getLogger(__name__)
# Configure expected feedback timelines for different signals
self.feedback_timelines = {
'initial_play': timedelta(minutes=5),
'early_retention': timedelta(minutes=10),
'completion': timedelta(hours=2),
'rating': timedelta(hours=24),
'share': timedelta(hours=12),
'implement_advice': timedelta(days=7),
'health_outcome': timedelta(days=30)
}
# Reliability curves - how much to trust signals at different time delays
self.reliability_curves = {
'immediate': lambda t: 0.3 + 0.7 * (1 - np.exp(-t / 3600)), # Rises quickly
'short_term': lambda t: 0.1 + 0.9 * (1 - np.exp(-t / 86400)), # Rises over a day
'long_term': lambda t: 0.05 + 0.95 * (1 - np.exp(-t / 604800)) # Rises over a week
}
def process_feedback_stream(self, episode_id: str, feedback_events: List[Dict]) -> Dict:
"""
Process a stream of feedback events with different timestamps and types.
This method handles the reality that feedback arrives asynchronously
and must be integrated into a coherent quality assessment over time.
"""
# Group events by type and time
grouped_events = self._group_feedback_events(feedback_events)
# Build temporal profile of engagement
temporal_profile = self._build_temporal_profile(grouped_events)
# Detect anomalies that might indicate problems
anomalies = self._detect_feedback_anomalies(temporal_profile)
# Estimate missing signals using temporal models
estimated_signals = self._estimate_missing_signals(temporal_profile, episode_id)
# Combine observed and estimated signals
combined_feedback = self._combine_feedback_signals(
observed=temporal_profile,
estimated=estimated_signals,
anomalies=anomalies
)
return combined_feedback
def _detect_feedback_anomalies(self, temporal_profile: Dict) -> List[Dict]:
"""
Detect anomalies in feedback patterns that might indicate quality issues.
Anomalies can reveal problems that aggregate metrics miss. For example,
a bimodal distribution in listening time might indicate that content
works well for one audience segment but not another.
"""
anomalies = []
# Check for unusual dropout patterns
if temporal_profile['quality_signals'].get('early_dropout_rate', 0) > 0.5:
if temporal_profile['quality_signals'].get('completion_rate', 0) > 0.7:
# High early dropout but also high completion - bimodal audience
anomalies.append({
'type': 'bimodal_engagement',
'severity': 'medium',
'description': 'Content strongly polarizes audience',
'recommendation': 'Consider audience segmentation'
})
# Check for delayed negative feedback
if 'negative_feedback_delay' in temporal_profile['quality_signals']:
delay = temporal_profile['quality_signals']['negative_feedback_delay']
if delay > 86400: # More than a day
anomalies.append({
'type': 'delayed_negative_reaction',
'severity': 'high',
'description': 'Users report problems after trying advice',
'recommendation': 'Review accuracy and safety of recommendations'
})
return anomaliesWord Error Rate (WER) for Hallucination Detection
Hallucinations—plausible but false statements—are especially risky in wellness. Our approach combines WER against references with semantic verification and pattern checks to flag likely issues for review.
Design Tip: Don't rely on WER alone. High WER in factual segments combined with specific linguistic patterns (exact percentages, absolute claims) provides much stronger hallucination signals.
import difflib
from typing import List, Tuple, Set
import re
import spacy
class HallucinationDetector:
"""
Detects potential hallucinations in generated wellness content using
WER analysis, semantic verification, and pattern recognition.
This detector is specifically tuned for health and wellness content
where accuracy is critical and hallucinations could cause harm.
"""
def __init__(self, reference_database, medical_entity_recognizer=None):
self.reference_database = reference_database
self.nlp = spacy.load("en_core_web_md")
self.medical_ner = medical_entity_recognizer or self._load_medical_ner()
self.logger = logging.getLogger(__name__)
# Patterns that often indicate hallucinations
self.hallucination_patterns = [
r'\b\d+\.?\d*\s*%\s*of\s*people\b', # Specific percentages
r'\bstudies\s+show\b(?!\s+that\s+some)', # Unqualified study claims
r'\balways\s+\w+s?\b', # Absolute statements
r'\bnever\s+\w+s?\b', # Absolute negatives
r'\bguaranteed\s+to\b', # Certainty claims
r'\bclinically\s+proven\b', # Medical claims without citation
r'\b\d+\s*calories?\s*per\b', # Specific nutritional claims
r'\bexactly\s+\d+\b', # Overly precise numbers
]
# Hedge phrases that indicate appropriate uncertainty
self.hedge_phrases = [
'may', 'might', 'could', 'typically', 'often', 'sometimes',
'generally', 'usually', 'tends to', 'in many cases', 'for some people'
]
def detect_hallucinations(self, generated_content: str, episode_context: Dict) -> Dict:
"""
Comprehensive hallucination detection combining multiple techniques.
This method uses WER analysis against references, pattern matching,
entity verification, and claim extraction to identify potential
hallucinations with different confidence levels.
"""
# Find relevant reference content
references = self._find_relevant_references(generated_content, episode_context)
# Perform WER analysis against references
wer_results = self._calculate_wer_segments(generated_content, references)
# Extract and verify medical claims
medical_claims = self._extract_medical_claims(generated_content)
claim_verification = self._verify_medical_claims(medical_claims, references)
# Check for hallucination patterns
pattern_matches = self._check_hallucination_patterns(generated_content)
# Analyze hedge phrase usage
hedge_analysis = self._analyze_hedge_usage(generated_content)
# Combine all signals
hallucination_assessment = self._combine_hallucination_signals(
wer_results, claim_verification, pattern_matches, hedge_analysis
)
return hallucination_assessmentBuilding Adaptive Reward Functions
Static rewards fail for the same reason static prompts do: they can't learn. An adaptive reward function should discover which signals predict long-term success and update itself continuously.
class AdaptiveRewardFunction:
"""
Self-learning reward function that adapts based on observed outcomes.
This system learns which quality signals actually predict long-term success
and adjusts its reward calculations accordingly. It's like having a
reward function that gets smarter over time.
"""
def __init__(self, initial_weights: Optional[Dict] = None):
self.current_weights = initial_weights or self._get_default_weights()
self.weight_history = [self.current_weights.copy()]
self.learning_rate = 0.01
self.adaptation_window = 1000 # Episodes to consider for adaptation
self.logger = logging.getLogger(__name__)
# Track correlations between signals and outcomes
self.signal_outcome_correlations = defaultdict(list)
self.outcome_buffer = deque(maxlen=self.adaptation_window)
def calculate_reward(self, evaluation: EpisodeEvaluation) -> float:
"""
Calculate reward using current adaptive weights.
This method applies learned weights to various quality signals,
emphasizing those that have proven predictive of success.
"""
reward = 0.0
# Apply adaptive weights to each dimension
for dimension, score in evaluation.final_scores.items():
weight = self.current_weights.get(dimension, 0.1)
# Non-linear transformation based on learned importance
if weight > 0.3: # High importance dimensions
transformed_score = score ** 0.8 # Less harsh transformation
else: # Lower importance dimensions
transformed_score = score ** 1.5 # Steeper transformation
reward += weight * transformed_score
# Apply special bonuses/penalties based on learned patterns
reward = self._apply_learned_adjustments(reward, evaluation)
return max(0.0, min(1.0, reward))
def observe_outcome(self, episode_id: str, evaluation: EpisodeEvaluation,
long_term_outcome: Dict):
"""
Learn from observed long-term outcomes to improve reward function.
This method creates the learning loop that makes the reward function
increasingly accurate at predicting what will lead to success.
"""
# Store outcome for correlation analysis
self.outcome_buffer.append({
'episode_id': episode_id,
'evaluation': evaluation,
'outcome': long_term_outcome,
'timestamp': datetime.now()
})
# Update signal-outcome correlations
self._update_correlations(evaluation, long_term_outcome)
# Adapt weights if we have enough data
if len(self.outcome_buffer) >= 100:
self._adapt_weights()Designing Evaluation as a System
Treat evaluation as a living system that needs monitoring, maintenance, and adjustment. Evaluation drift is real.
Common Pitfall: Building evaluation once and assuming it works forever. Evaluation systems degrade as user behavior changes, new content types emerge, and model capabilities evolve.
class EvaluationSystemManager:
"""
Manages the entire evaluation system as a living, breathing entity.
This manager ensures that our evaluation remains accurate, relevant,
and aligned with actual success metrics over time.
"""
def __init__(self):
self.evaluator = MultiDimensionalEvaluator()
self.hallucination_detector = HallucinationDetector()
self.reward_function = AdaptiveRewardFunction()
self.feedback_processor = TemporalFeedbackProcessor()
# System health monitoring
self.health_metrics = {
'evaluation_latency': deque(maxlen=1000),
'signal_availability': defaultdict(list),
'prediction_accuracy': deque(maxlen=1000),
'drift_indicators': []
}
self.logger = logging.getLogger(__name__)
def monitor_system_health(self) -> Dict:
"""
Monitor the health of the evaluation system itself.
This method detects when the evaluation system is degrading
or drifting from its intended behavior.
"""
health_report = {
'status': 'healthy',
'concerns': [],
'metrics': {}
}
# Check evaluation latency
if self.health_metrics['evaluation_latency']:
avg_latency = np.mean(list(self.health_metrics['evaluation_latency']))
if avg_latency > 5.0: # More than 5 seconds
health_report['concerns'].append({
'issue': 'high_latency',
'severity': 'medium',
'detail': f'Average evaluation latency: {avg_latency:.2f}s'
})
# Check for evaluation drift
if self._detect_evaluation_drift():
health_report['concerns'].append({
'issue': 'evaluation_drift',
'severity': 'high',
'detail': 'Evaluation predictions diverging from outcomes'
})
# Update overall status
if any(c['severity'] == 'high' for c in health_report['concerns']):
health_report['status'] = 'degraded'
elif health_report['concerns']:
health_report['status'] = 'warning'
return health_reportBringing It All Together
The full pipeline turns template selection from guesswork into a data-driven loop. You measure across dimensions, respect timing, detect hallucinations, and let rewards adapt based on outcomes.
def production_evaluation_pipeline(generated_episode: Dict) -> Dict:
"""
Complete production pipeline showing how all evaluation components integrate.
This demonstrates the full flow from content generation through
long-term learning and system adaptation.
"""
# Initialize evaluation system
eval_system = EvaluationSystemManager()
# Immediate evaluation and reward calculation
evaluation_result = eval_system.evaluate_episode(
episode_content=generated_episode['content'],
episode_metadata=generated_episode['metadata'],
template_id=generated_episode['template_id']
)
# Make immediate decision based on evaluation
if evaluation_result['hallucination_assessment']['hallucination_risk_score'] > 0.7:
# High hallucination risk - require human review
return {
'action': 'hold_for_review',
'reason': 'High hallucination risk detected',
'evaluation': evaluation_result
}
if evaluation_result['initial_reward'] < 0.3:
# Low quality - regenerate with different template
return {
'action': 'regenerate',
'reason': 'Initial quality below threshold',
'evaluation': evaluation_result
}
# Publish episode and begin collecting feedback
publish_result = publish_episode(generated_episode)
# The evaluation continues asynchronously
# Process delayed evaluations continuously
while True:
eval_system.process_delayed_evaluations()
# Monitor system health
health_status = eval_system.monitor_system_health()
if health_status['status'] == 'degraded':
alert_operations_team(health_status)
# The reward function adapts based on outcomes
# Template selection improves based on updated rewards
# The cycle continues, getting smarter with each iteration
time.sleep(300) # Check every 5 minutesKey Takeaways
-
Multidimensional evaluation aligns with real goals. Single metrics invite perverse incentives. Measure engagement, accuracy, readability, and more—then combine them thoughtfully.
-
Time matters. Immediate signals are convenient, not always reliable. Model how feedback evolves; estimate missing signals rather than pausing progress.
-
Hallucination detection needs layered methods. WER + semantic checks + pattern recognition beats keyword heuristics.
-
Rewards should learn. Observe which signals predict long-term success and let weights adapt; don't freeze your assumptions in code.
-
Operate your evaluator. Monitor latency, signal availability, prediction accuracy, and drift. Adjust before things go off the rails.
TL;DR Checklist
Setting Up Evaluation:
- Identify all quality dimensions that matter to stakeholders
- Design immediate evaluation for each dimension (even if low confidence)
- Plan delayed evaluation timeline for each dimension
- Set critical thresholds (e.g., minimum accuracy for health content)
Building the System:
- Implement confidence-weighted signal combination
- Create temporal feedback models with reliability curves
- Build multi-layer hallucination detection (WER + patterns + verification)
- Design adaptive reward functions with learning loops
- Add system health monitoring and drift detection
Operating in Production:
- Schedule delayed evaluations at key intervals
- Monitor evaluation system health metrics continuously
- Review and adjust dimension weights based on outcomes
- Flag high-risk content for human review
- Let the system learn and adapt, but supervise the learning
Common Mistakes to Avoid:
- Don't optimize for single metrics
- Don't treat all feedback as equally reliable
- Don't ignore temporal patterns in feedback
- Don't use static reward functions
- Don't deploy evaluation without monitoring
Looking Ahead
In our final post, we'll cover how to evolve template portfolios safely, run A/B tests at scale, and maintain these systems for the long haul: adding new templates without destabilizing quality, retiring underperformers gracefully, and ensuring your system keeps improving month after month.
The goal: turn prompt engineering from a one-off optimization into a continuous improvement engine. With intelligent selection, multidimensional evaluation, and adaptive evolution, you build AI systems that don't just work—they get better every day.