
Building the Selector: Retrieval, Bandits, and Cold Start Solutions
Photo by Ani Adigyozalyan on Unsplash
This is the second post in our series on probabilistic prompt pipelines. In the first post, we explored why static prompts become bottlenecks and saw a simple working example of probabilistic selection. Now we'll dive deep into the technical heart of production systems: how to build selectors that intelligently match context with templates while continuously learning from performance data.
The Challenge of Intelligent Selection
In our first post, we demonstrated template selection with a simplified example where any template could be used for any episode. But production systems face a more complex challenge: not every template is appropriate for every situation. A template designed for stress management advice might be catastrophically inappropriate for episodes about high-intensity fitness training, even if it has historically high performance scores.
This creates what we call the relevance-performance tension. You need templates that are both contextually appropriate and historically successful. A template that performs brilliantly for one type of content might fail completely when applied to different contexts. The selector must understand this nuance and make decisions that optimize both dimensions simultaneously.
Consider a concrete example from our wellness podcast system. Suppose you have five templates: one optimized for busy professionals managing work-life balance, another for new parents dealing with sleep deprivation and health maintenance, a third for older adults focusing on mobility and chronic condition management, a fourth for college students handling stress and building healthy habits, and a fifth for general wellness education. Each template has been carefully crafted with language, examples, and perspectives that resonate with its target audience.
Now imagine an episode request comes in with context indicating the listener is a 28-year-old working parent struggling with maintaining exercise routines while managing childcare responsibilities. The selector needs to understand that the new parent template is contextually relevant, even if the general wellness template happens to have slightly higher overall performance scores. Using the college student template would create content that's technically well-written but completely misaligned with the listener's life circumstances.
This challenge becomes even more complex when you consider that context isn't just demographic. Current health trends matter too. During periods when gentle movement and stress reduction are trending due to increased awareness of burnout, templates that acknowledge the importance of rest and recovery might be more appropriate than those that assume high-energy fitness goals, regardless of the listener's age or situation. The selector must understand these subtle contextual factors and weigh them appropriately in its decision-making process.
Understanding Two-Phase Selection Architecture
The solution we've developed uses what we call two-phase selection, which separates the relevance problem from the performance optimization problem. Think of this like how you might choose a wellness practitioner for a specific health concern: first you filter by specialization and approach to find practitioners who are qualified for your particular situation, then you pick the best-rated option from that filtered list based on reviews and outcomes.
The first phase, which we call contextual retrieval, uses semantic similarity to identify templates that are appropriate for the current episode's context. This phase asks the question: "Which templates are designed for situations like this one?" The second phase, which we call performance optimization, uses bandit algorithms to select the best-performing template from among the contextually appropriate options. This phase asks: "Which of these relevant templates is most likely to produce a successful episode?"
This separation of concerns provides several important benefits. First, it prevents high-performing templates from being misapplied to inappropriate contexts. A template that works brilliantly for one audience won't accidentally get used for a completely different audience just because it has good overall statistics. Second, it allows the performance optimization to focus on the relevant choice set, making the bandit algorithm more effective because it's not wasting exploration on fundamentally inappropriate options.
Perhaps most importantly, this architecture makes the system's decision-making process interpretable and debuggable. When you need to understand why a particular template was selected, you can examine both phases independently. You can ask whether the contextual retrieval identified the right set of candidates, and whether the performance optimization chose wisely among those candidates. This interpretability is crucial for maintaining and improving production systems.
Let's examine each phase in detail, starting with the contextual retrieval that ensures relevance before we optimize for performance.
Phase 1: Contextual Retrieval Through Semantic Similarity
The first phase of our selection process focuses entirely on understanding context and finding templates that are designed for similar situations. This is where we transform the rich contextual information about an episode into a mathematical representation that allows us to measure similarity between the current situation and the situations each template was designed to handle.
The process begins by creating an embedding vector that captures the semantic meaning of the episode context. This isn't just a simple concatenation of text fields; it's a thoughtful representation that emphasizes the contextual factors most important for template selection. We include demographic information about the target audience, the primary health topics being addressed, current wellness trends, and any specific goals or constraints that should influence the tone and approach of the content.
Think of this embedding as a multidimensional fingerprint that captures the essence of the episode's requirements. Similar episodes will have similar embeddings, while episodes that require fundamentally different approaches will have embeddings that are far apart in this semantic space. This mathematical representation allows us to search efficiently through potentially hundreds of templates to find those most appropriate for the current context.
Here's how we implement the contextual retrieval system:
import numpy as np
from typing import List, Dict, Tuple
import logging
from dataclasses import dataclass
@dataclass
class EpisodeContext:
"""
Structured representation of all context needed for episode generation.
This class captures the various dimensions of context that influence
which templates are appropriate: audience demographics, health topics,
wellness trends, and specific goals or constraints.
"""
listener_age_range: str # "25-34", "35-44", etc.
listener_life_stage: str # "early_career", "family_building", "pre_retirement"
primary_topics: List[str] # ["stress_management", "nutrition", "exercise"]
wellness_trend_alignment: float # 0.0 to 1.0, current wellness trend intensity
episode_goals: List[str] # ["actionable_advice", "emotional_support", "education"]
fitness_level: str # "beginner", "intermediate", "advanced"
special_considerations: List[str] # ["chronic_pain", "limited_mobility", "time_constraints"]
class ContextualRetriever:
"""
Handles the first phase of template selection: finding contextually relevant templates.
This component transforms episode context into semantic embeddings and uses
vector similarity search to identify templates designed for similar situations.
The goal is relevance, not performance - we want templates that make sense
for this context, regardless of their historical success rates.
"""
def __init__(self, embedding_model, vector_database, similarity_threshold=0.7):
self.embedding_model = embedding_model
self.vector_database = vector_database
self.similarity_threshold = similarity_threshold
self.logger = logging.getLogger(__name__)
def create_context_embedding(self, context: EpisodeContext) -> np.ndarray:
"""
Transform episode context into a semantic embedding vector.
This method creates a rich textual representation of the episode context
that captures the nuances important for template matching. The embedding
model then converts this into a vector that enables similarity search.
"""
# Create a structured text representation that emphasizes key contextual factors
context_text_parts = [
f"Audience: {context.listener_life_stage} aged {context.listener_age_range}",
f"Topics: {', '.join(context.primary_topics)}",
f"Goals: {', '.join(context.episode_goals)}",
f"Fitness level: {context.fitness_level}",
f"Wellness trends: {'high intensity' if context.wellness_trend_alignment > 0.6 else 'gentle approach'}"
]
# Include special considerations if present
if context.special_considerations:
context_text_parts.append(f"Special needs: {', '.join(context.special_considerations)}")
# Combine into a coherent description
context_description = ". ".join(context_text_parts)
# Generate embedding using the same model used for template embeddings
embedding = self.embedding_model.embed(context_description)
self.logger.debug(f"Generated embedding for context: {context_description}")
return embedding
def find_similar_templates(self, context_embedding: np.ndarray, max_candidates: int = 8) -> List[Dict]:
"""
Use vector similarity search to find templates appropriate for this context.
This method searches through all available templates to find those with
embeddings most similar to the current episode context. The similarity
threshold ensures we only consider templates that are genuinely relevant.
"""
# Search for templates with similar context embeddings
similar_templates = self.vector_database.similarity_search(
query_vector=context_embedding,
top_k=max_candidates,
similarity_threshold=self.similarity_threshold
)
# Log the retrieval results for debugging and monitoring
self.logger.info(f"Found {len(similar_templates)} similar templates above threshold {self.similarity_threshold}")
if len(similar_templates) < 2:
# If we don't find enough similar templates, we need fallback strategies
similar_templates = self._apply_fallback_retrieval(context_embedding, max_candidates)
return similar_templates
def _apply_fallback_retrieval(self, context_embedding: np.ndarray, max_candidates: int) -> List[Dict]:
"""
Handle cases where we don't find enough contextually similar templates.
This fallback system prevents the selector from failing when encountering
novel contexts that don't closely match existing templates. We progressively
relax similarity requirements and ultimately fall back to general-purpose templates.
"""
self.logger.warning("Insufficient similar templates found, applying fallback strategies")
# Try progressively lower similarity thresholds
for fallback_threshold in [0.6, 0.5, 0.4]:
fallback_results = self.vector_database.similarity_search(
query_vector=context_embedding,
top_k=max_candidates,
similarity_threshold=fallback_threshold
)
if len(fallback_results) >= 2:
self.logger.info(f"Fallback successful with threshold {fallback_threshold}")
return fallback_results
# Ultimate fallback: return general-purpose templates
self.logger.warning("Using general-purpose templates as final fallback")
return self.vector_database.get_general_purpose_templates(max_candidates)
def analyze_retrieval_quality(self, context: EpisodeContext, retrieved_templates: List[Dict]) -> Dict:
"""
Analyze the quality of contextual retrieval for monitoring and debugging.
This method helps us understand whether the retrieval system is working
effectively and identifies cases where we might need to improve template
coverage or adjust similarity thresholds.
"""
quality_metrics = {
'retrieval_count': len(retrieved_templates),
'min_similarity': min(t['similarity_score'] for t in retrieved_templates) if retrieved_templates else 0,
'max_similarity': max(t['similarity_score'] for t in retrieved_templates) if retrieved_templates else 0,
'used_fallback': any(t.get('is_fallback', False) for t in retrieved_templates)
}
# Check for potential gaps in template coverage
if quality_metrics['min_similarity'] < 0.5:
quality_metrics['coverage_concern'] = True
quality_metrics['suggestion'] = "Consider creating templates for this context type"
return quality_metrics
The contextual retrieval system creates a foundation for intelligent template selection by ensuring that performance optimization only occurs within the set of templates that actually make sense for the current situation. This prevents the kinds of mismatches that can occur when high-performing templates get applied inappropriately.
Notice how the fallback system ensures that the selector never fails completely, even when encountering completely novel contexts. This robustness is essential for production systems, where you can't predict every possible combination of contextual factors that might arise.
Phase 2: Performance Optimization with Thompson Sampling
Once we have a set of contextually relevant templates, the second phase focuses entirely on selecting the template most likely to produce a successful episode. This is where we apply bandit algorithms to balance exploiting templates with proven track records against exploring templates that might perform even better.
The algorithm we use, Thompson Sampling, is particularly well-suited for this application because it naturally handles the exploration-exploitation tradeoff without requiring manual tuning of exploration parameters. The key insight behind Thompson Sampling is that instead of trying to estimate each template's exact performance rate, we maintain a probability distribution representing our uncertainty about that performance rate.
Think of this like having confidence intervals around each template's success rate. A template that has been used many times and consistently performed well will have a narrow, high confidence interval. A template that has been used only a few times will have a wide confidence interval reflecting our uncertainty about its true performance. A brand new template will have the widest confidence interval of all.
Thompson Sampling works by sampling a performance rate from each template's distribution, then selecting the template with the highest sampled rate. This approach naturally gives more chances to templates with higher uncertainty, ensuring that potentially excellent templates don't get overlooked just because they haven't been tested extensively yet.
Here's how we implement the performance optimization phase:
import numpy as np
from scipy import stats
from typing import Dict, List, Optional, Tuple
from dataclasses import dataclass
from datetime import datetime, timedelta
@dataclass
class TemplatePerformanceStats:
"""
Tracks performance statistics for a single template.
This data structure captures both the raw performance data and metadata
that helps us understand the context and reliability of that performance.
"""
template_id: str
total_uses: int
successes: int
failures: int
recent_uses: int # Uses in last 30 days
last_updated: datetime
context_tags: List[str] # What contexts this template has been used for
class PerformanceOptimizer:
"""
Handles the second phase of template selection: choosing the best performer
from among contextually relevant templates.
This component uses Thompson Sampling to balance exploitation of proven
templates with exploration of potentially better options. It maintains
detailed performance statistics and handles the complexities of learning
from sparse, noisy feedback.
"""
def __init__(self, stats_database, exploration_bonus=0.05, min_uses_for_confidence=10):
self.stats_database = stats_database
self.exploration_bonus = exploration_bonus
self.min_uses_for_confidence = min_uses_for_confidence
self.logger = logging.getLogger(__name__)
def thompson_sample_selection(self, candidate_templates: List[Dict], context: EpisodeContext) -> Dict:
"""
Use Thompson Sampling to select the best template from contextually relevant candidates.
This method samples from each template's performance distribution and selects
the template with the highest sampled performance. This naturally balances
using proven winners while still exploring potentially better options.
"""
if not candidate_templates:
raise ValueError("Cannot select from empty candidate list")
sampled_scores = {}
template_info = {}
for template in candidate_templates:
template_id = template['id']
# Get performance statistics for this template
stats = self._get_template_stats(template_id)
# Apply cold start handling for templates with limited data
adjusted_stats = self._apply_cold_start_adjustment(stats, template)
# Sample from the template's performance distribution
sampled_score = self._sample_performance_distribution(adjusted_stats)
sampled_scores[template_id] = sampled_score
template_info[template_id] = {
'template_data': template,
'stats': stats,
'adjusted_stats': adjusted_stats,
'sampled_score': sampled_score
}
# Select the template with the highest sampled score
best_template_id = max(sampled_scores.keys(), key=lambda tid: sampled_scores[tid])
selected_info = template_info[best_template_id]
# Log the selection decision for monitoring and debugging
self._log_selection_decision(selected_info, sampled_scores, context)
return {
'template_id': best_template_id,
'template_data': selected_info['template_data'],
'selection_metadata': {
'sampled_score': selected_info['sampled_score'],
'all_scores': sampled_scores,
'stats_used': selected_info['adjusted_stats'],
'selection_reason': self._explain_selection(selected_info, template_info)
}
}
def _get_template_stats(self, template_id: str) -> TemplatePerformanceStats:
"""
Retrieve performance statistics for a template, with sensible defaults for new templates.
This method handles the case where we don't have any performance data yet
by providing reasonable prior beliefs about template performance.
"""
stored_stats = self.stats_database.get_template_stats(template_id)
if stored_stats is None:
# New template - use optimistic priors to encourage exploration
return TemplatePerformanceStats(
template_id=template_id,
total_uses=0,
successes=1, # Optimistic prior
failures=1, # But not overconfident
recent_uses=0,
last_updated=datetime.now(),
context_tags=[]
)
return stored_stats
def _apply_cold_start_adjustment(self, stats: TemplatePerformanceStats, template: Dict) -> TemplatePerformanceStats:
"""
Apply special handling for templates with limited performance data.
This method gives newer templates a fighting chance by adding an exploration
bonus that decreases as we gather more data about their true performance.
The bonus prevents new templates from being ignored just because they
lack extensive track records.
"""
if stats.total_uses >= self.min_uses_for_confidence:
# Template has enough data, no adjustment needed
return stats
# Calculate exploration bonus based on how little data we have
data_scarcity = 1.0 - (stats.total_uses / self.min_uses_for_confidence)
exploration_boost = int(self.exploration_bonus * data_scarcity * 10)
# Create adjusted stats with exploration bonus
adjusted_stats = TemplatePerformanceStats(
template_id=stats.template_id,
total_uses=stats.total_uses,
successes=stats.successes + exploration_boost,
failures=stats.failures,
recent_uses=stats.recent_uses,
last_updated=stats.last_updated,
context_tags=stats.context_tags
)
self.logger.debug(f"Applied cold start bonus of {exploration_boost} to template {stats.template_id}")
return adjusted_stats
def _sample_performance_distribution(self, stats: TemplatePerformanceStats) -> float:
"""
Sample from a template's performance distribution using Beta distribution.
The Beta distribution is perfect for modeling success rates because it's
bounded between 0 and 1, and its shape is determined by the number of
successes and failures we've observed. High uncertainty (few observations)
leads to wide distributions that encourage exploration.
"""
alpha = stats.successes + 1 # Add 1 for Bayesian smoothing
beta = stats.failures + 1
# Sample from Beta(alpha, beta) distribution
sampled_rate = np.random.beta(alpha, beta)
return sampled_rate
def _explain_selection(self, selected_info: Dict, all_template_info: Dict) -> str:
"""
Generate a human-readable explanation of why this template was selected.
This explanation helps with debugging and monitoring by making the
selection process transparent and interpretable.
"""
selected_stats = selected_info['adjusted_stats']
selected_score = selected_info['sampled_score']
# Analyze why this template won
if selected_stats.total_uses < self.min_uses_for_confidence:
return f"Selected due to exploration bonus (only {selected_stats.total_uses} uses)"
elif selected_score > 0.8:
return f"Selected as high-confidence winner (sampled {selected_score:.3f})"
else:
return f"Selected as best available option (sampled {selected_score:.3f})"
def _log_selection_decision(self, selected_info: Dict, all_scores: Dict, context: EpisodeContext):
"""
Log the selection decision for monitoring and analysis.
This logging provides the data needed to understand system behavior,
debug selection issues, and identify opportunities for improvement.
"""
selection_log = {
'selected_template': selected_info['template_data']['id'],
'sampled_score': selected_info['sampled_score'],
'all_sampled_scores': all_scores,
'context_summary': f"{context.listener_life_stage}_{context.fitness_level}",
'selection_timestamp': datetime.now().isoformat()
}
self.logger.info(f"Template selection: {selection_log}")
def update_performance(self, template_id: str, episode_success: bool, context: EpisodeContext, performance_details: Dict):
"""
Update performance statistics based on episode results.
This method closes the learning loop by incorporating new performance
data into our statistical models. The context and performance details
help us understand when and why templates succeed or fail.
"""
current_stats = self._get_template_stats(template_id)
# Update success/failure counts
if episode_success:
new_successes = current_stats.successes + 1
new_failures = current_stats.failures
else:
new_successes = current_stats.successes
new_failures = current_stats.failures + 1
# Update context tags to track what situations this template has been used for
context_tag = f"{context.listener_life_stage}_{context.fitness_level}"
updated_context_tags = list(set(current_stats.context_tags + [context_tag]))
# Create updated statistics
updated_stats = TemplatePerformanceStats(
template_id=template_id,
total_uses=current_stats.total_uses + 1,
successes=new_successes,
failures=new_failures,
recent_uses=current_stats.recent_uses + 1,
last_updated=datetime.now(),
context_tags=updated_context_tags
)
# Store the updated statistics
self.stats_database.save_template_stats(updated_stats)
# Also store detailed performance data for analysis
self.stats_database.save_performance_detail(
template_id=template_id,
context=context,
success=episode_success,
performance_metrics=performance_details,
timestamp=datetime.now()
)
self.logger.info(f"Updated stats for template {template_id}: {new_successes}/{new_successes + new_failures} success rate")
The performance optimization phase completes our intelligent selection system by ensuring that we consistently choose the most promising template from among the contextually appropriate options. The Thompson Sampling algorithm naturally handles the complex tradeoffs between proven performance and potential upside, creating a system that learns continuously without getting stuck in local optima.
Handling Content Diversity and Pattern Recognition
As we've built and operated this selection system in production, we've discovered an additional layer of intelligence that significantly improves content quality: tracking content diversity and recognizing successful patterns across recent episodes. The basic two-phase approach works well for individual episode optimization, but it doesn't consider the broader content strategy across multiple episodes.
Think about this from a listener's perspective. If someone follows your podcast regularly, they don't want to hear the same themes or approaches repeated frequently, even if those approaches are individually successful. A template that works brilliantly for stress management episodes might produce excellent content every time it's used, but if it gets selected for three episodes in a row, regular listeners will notice the repetition and start losing interest.
Conversely, we've observed that certain content patterns tend to perform well during specific time periods or wellness trends. For example, during periods of high stress awareness, episodes that acknowledge mental health and provide emotional support tend to outperform those that focus purely on physical fitness advice, regardless of which specific template is used. The system should learn to recognize these patterns and factor them into selection decisions.
Here's how we enhance the selector to consider content diversity and emerging patterns:
from collections import defaultdict, deque
import pandas as pd
from typing import Deque
class EnhancedPromptSelector:
"""
Enhanced selector that considers context, performance, diversity, and emerging patterns.
This version builds on the two-phase approach by adding intelligence about
content variety and pattern recognition. It ensures that the podcast maintains
freshness for regular listeners while adapting to emerging trends in what
content resonates with audiences.
"""
def __init__(self, contextual_retriever, performance_optimizer, diversity_tracker=None, pattern_analyzer=None):
self.contextual_retriever = contextual_retriever
self.performance_optimizer = performance_optimizer
self.diversity_tracker = diversity_tracker or ContentDiversityTracker()
self.pattern_analyzer = pattern_analyzer or ContentPatternAnalyzer()
# Track recent episode content for diversity and pattern analysis
self.recent_episodes = deque(maxlen=50) # Last 50 episodes
self.logger = logging.getLogger(__name__)
def select_template(self, context: EpisodeContext) -> Dict:
"""
Enhanced template selection that considers relevance, performance, diversity, and patterns.
This method orchestrates all four factors: contextual relevance ensures
appropriateness, performance optimization drives quality, diversity
tracking prevents staleness, and pattern recognition adapts to what's
currently working well.
"""
# Phase 1: Find contextually relevant templates
context_embedding = self.contextual_retriever.create_context_embedding(context)
relevant_templates = self.contextual_retriever.find_similar_templates(context_embedding)
# Phase 2: Apply diversity filtering to prevent repetitive content
diversity_filtered = self._apply_diversity_considerations(relevant_templates, context)
# Phase 3: Enhanced performance optimization with pattern awareness
selected_template = self._pattern_aware_selection(diversity_filtered, context)
return selected_template
def _apply_diversity_considerations(self, templates: List[Dict], context: EpisodeContext) -> List[Dict]:
"""
Adjust template selection to promote content diversity and prevent theme staleness.
This method analyzes recent episode content to identify overused themes
or approaches, then adjusts template selection probabilities to encourage
variety while still respecting performance data.
"""
if len(self.recent_episodes) < 5:
# Not enough history for diversity analysis
return templates
# Analyze recent content patterns
recent_themes = self._extract_recent_themes()
recent_approaches = self._extract_recent_approaches()
current_theme = self._classify_episode_theme(context)
# Check if current theme is oversaturated
theme_frequency = recent_themes.get(current_theme, 0) / len(self.recent_episodes)
for template in templates:
template_approach = template.get('approach_style', 'standard')
# Calculate diversity adjustments
diversity_penalty = 0.0
diversity_bonus = 0.0
# Penalize overused themes
if theme_frequency > 0.3: # More than 30% of recent episodes
if template_approach in recent_approaches.get(current_theme, []):
diversity_penalty = 0.1 # We've used this approach for this theme recently
else:
diversity_bonus = 0.1 # Fresh approach to familiar theme
# Bonus for underused approaches
approach_frequency = sum(1 for ep in self.recent_episodes
if ep.get('approach_style') == template_approach) / len(self.recent_episodes)
if approach_frequency < 0.1: # Less than 10% recent usage
diversity_bonus += 0.05
# Store diversity adjustments for use in selection
template['diversity_adjustment'] = diversity_bonus - diversity_penalty
self.logger.debug(f"Template {template['id']} diversity adjustment: {template['diversity_adjustment']:.3f}")
return templates
def _pattern_aware_selection(self, templates: List[Dict], context: EpisodeContext) -> Dict:
"""
Perform Thompson sampling enhanced with recent content pattern recognition.
This method identifies patterns in recent high-performing content and
adjusts selection probabilities to favor templates that align with
successful emerging trends.
"""
# Identify successful patterns from recent episodes
success_patterns = self.pattern_analyzer.identify_current_patterns(self.recent_episodes)
# Apply pattern bonuses to templates
for template in templates:
pattern_bonus = self._calculate_pattern_alignment_bonus(template, success_patterns, context)
template['pattern_bonus'] = pattern_bonus
# Perform enhanced Thompson sampling with all adjustments
return self._enhanced_thompson_sampling(templates, context)
def _calculate_pattern_alignment_bonus(self, template: Dict, success_patterns: Dict, context: EpisodeContext) -> float:
"""
Calculate bonus for templates that align with recently successful content patterns.
This method looks at what characteristics of recent content have driven
success and boosts templates that embody those characteristics. The bonus
helps the system adapt to changing preferences or wellness trends.
"""
bonus = 0.0
template_features = template.get('features', {})
# Check alignment with each successful pattern
for pattern_name, pattern_strength in success_patterns.items():
if pattern_name == 'concrete_examples' and template_features.get('encourages_examples', False):
bonus += 0.08 * pattern_strength
elif pattern_name == 'emotional_support' and template_features.get('supportive_tone', False):
bonus += 0.06 * pattern_strength
elif pattern_name == 'actionable_advice' and template_features.get('action_oriented', False):
bonus += 0.10 * pattern_strength
elif pattern_name == 'trend_awareness' and context.wellness_trend_alignment > 0.6:
if template_features.get('trend_conscious', False):
bonus += 0.12 * pattern_strength
# Cap the bonus to prevent it from overwhelming other factors
return min(bonus, 0.15)
def _enhanced_thompson_sampling(self, templates: List[Dict], context: EpisodeContext) -> Dict:
"""
Perform Thompson sampling with diversity and pattern adjustments.
This method combines all our intelligence sources: base performance data,
diversity considerations, and pattern recognition to make the most
informed selection possible.
"""
best_score = -1
best_template = None
sampling_details = {}
for template in templates:
# Get base Thompson sampling score
base_score = self.performance_optimizer._sample_performance_distribution(
self.performance_optimizer._get_template_stats(template['id'])
)
# Apply all adjustments
diversity_adj = template.get('diversity_adjustment', 0.0)
pattern_bonus = template.get('pattern_bonus', 0.0)
final_score = base_score + diversity_adj + pattern_bonus
sampling_details[template['id']] = {
'base_score': base_score,
'diversity_adjustment': diversity_adj,
'pattern_bonus': pattern_bonus,
'final_score': final_score
}
if final_score > best_score:
best_score = final_score
best_template = template
# Log the enhanced selection for analysis
self.logger.info(f"Enhanced selection details: {sampling_details}")
return {
'template_id': best_template['id'],
'template_data': best_template,
'selection_metadata': {
'final_score': best_score,
'sampling_breakdown': sampling_details[best_template['id']],
'selection_factors': 'base_performance + diversity + patterns'
}
}
def record_episode_completion(self, template_id: str, template_data: Dict, context: EpisodeContext,
generated_content: str, performance_metrics: Dict):
"""
Record completed episode for future diversity and pattern analysis.
This method creates the feedback loop that enables learning about
content strategy beyond just individual template performance.
"""
episode_record = {
'template_id': template_id,
'template_data': template_data,
'context': context,
'content_summary': self._summarize_content(generated_content),
'performance_metrics': performance_metrics,
'episode_theme': self._classify_episode_theme(context),
'approach_style': template_data.get('approach_style', 'standard'),
'timestamp': datetime.now(),
'success': performance_metrics.get('overall_success', False)
}
# Add to recent episodes for future analysis
self.recent_episodes.append(episode_record)
# Update diversity tracker and pattern analyzer
self.diversity_tracker.update_with_episode(episode_record)
self.pattern_analyzer.update_with_episode(episode_record)
# Also update base performance statistics
self.performance_optimizer.update_performance(
template_id,
episode_record['success'],
context,
performance_metrics
)
class ContentDiversityTracker:
"""
Tracks content themes and approaches to ensure variety across episodes.
This component helps prevent the podcast from becoming repetitive by
monitoring theme frequency and approach diversity, providing data that
influences template selection to maintain listener engagement.
"""
def __init__(self, max_history=100):
self.theme_history = deque(maxlen=max_history)
self.approach_history = deque(maxlen=max_history)
self.theme_approach_combinations = defaultdict(list)
def update_with_episode(self, episode_record: Dict):
"""Update tracking with new episode data."""
theme = episode_record['episode_theme']
approach = episode_record['approach_style']
self.theme_history.append(theme)
self.approach_history.append(approach)
self.theme_approach_combinations[theme].append(approach)
def get_theme_saturation(self, theme: str) -> float:
"""Calculate how frequently a theme has appeared recently."""
if not self.theme_history:
return 0.0
return list(self.theme_history).count(theme) / len(self.theme_history)
def get_underused_approaches(self, threshold: float = 0.1) -> List[str]:
"""Identify approaches that haven't been used much recently."""
if not self.approach_history:
return []
approach_frequencies = defaultdict(int)
for approach in self.approach_history:
approach_frequencies[approach] += 1
total_episodes = len(self.approach_history)
underused = []
for approach, count in approach_frequencies.items():
if count / total_episodes < threshold:
underused.append(approach)
return underused
class ContentPatternAnalyzer:
"""
Identifies patterns in successful content to guide future template selection.
This component goes beyond individual template performance to understand
what content characteristics drive success, helping the system adapt to
changing audience preferences and wellness trends.
"""
def __init__(self, success_threshold=0.7, min_episodes_for_pattern=10):
self.success_threshold = success_threshold
self.min_episodes_for_pattern = min_episodes_for_pattern
self.pattern_cache = {}
self.cache_timestamp = None
def identify_current_patterns(self, recent_episodes: List[Dict]) -> Dict[str, float]:
"""
Analyze recent episodes to identify patterns that correlate with success.
This method looks for content characteristics that appear more frequently
in successful episodes than in unsuccessful ones, indicating they might
be driving the success.
"""
if len(recent_episodes) < self.min_episodes_for_pattern:
return {}
# Check if we can use cached results
if self._can_use_cache():
return self.pattern_cache
# Separate successful from unsuccessful episodes
successful_episodes = [ep for ep in recent_episodes
if ep['performance_metrics'].get('engagement_90s', 0) >= self.success_threshold]
if len(successful_episodes) < 5: # Need minimum successful episodes
return {}
patterns = {}
total_episodes = len(recent_episodes)
successful_count = len(successful_episodes)
# Analyze various content characteristics
patterns.update(self._analyze_content_features(successful_episodes, recent_episodes))
patterns.update(self._analyze_contextual_factors(successful_episodes, recent_episodes))
patterns.update(self._analyze_timing_patterns(successful_episodes, recent_episodes))
# Cache the results
self.pattern_cache = patterns
self.cache_timestamp = datetime.now()
return patterns
def _analyze_content_features(self, successful: List[Dict], all_episodes: List[Dict]) -> Dict[str, float]:
"""Analyze content features that correlate with success."""
patterns = {}
# Check for concrete examples pattern
concrete_in_successful = sum(1 for ep in successful
if ep['template_data'].get('features', {}).get('encourages_examples', False))
concrete_in_all = sum(1 for ep in all_episodes
if ep['template_data'].get('features', {}).get('encourages_examples', False))
if concrete_in_all > 0:
success_rate_with_concrete = concrete_in_successful / len(successful)
overall_rate_concrete = concrete_in_all / len(all_episodes)
if success_rate_with_concrete > overall_rate_concrete * 1.2: # 20% lift
patterns['concrete_examples'] = min(success_rate_with_concrete - overall_rate_concrete, 1.0)
# Similar analysis for other features
return patterns
def _can_use_cache(self) -> bool:
"""Check if cached pattern analysis is still valid."""
if not self.cache_timestamp:
return False
# Cache is valid for 24 hours
return (datetime.now() - self.cache_timestamp) < timedelta(hours=24)
def update_with_episode(self, episode_record: Dict):
"""Update pattern analysis with new episode data."""
# Invalidate cache when new data arrives
self.cache_timestamp = None
This enhanced selection system creates a sophisticated understanding of content strategy that goes far beyond individual template performance. It ensures that regular listeners encounter varied, fresh content while the system continuously adapts to emerging patterns in what resonates with audiences.
Building Your Production Selector
Now that we've explored the complete architecture of intelligent template selection, let's discuss how to implement this system in your own production environment. The key is to start simple and add sophistication gradually as you gain experience and gather data.
Begin with the basic two-phase approach we demonstrated. Implement contextual retrieval using embeddings and similarity search, then add Thompson sampling for performance optimization. This foundation provides immediate benefits over static prompts while establishing the infrastructure for more advanced features.
Focus first on getting the data collection right. You need reliable ways to measure template performance, track contextual factors, and store the statistics that drive learning. Without good data, even the most sophisticated algorithms won't help. Start with simple metrics like engagement rates or completion rates, then gradually add more nuanced measures as you understand what drives success in your specific application.
Once you have the basic system working and collecting data, you can add diversity tracking and pattern recognition. These enhancements provide significant value, but they require sufficient historical data to work effectively. Don't try to build everything at once; let each component prove its value before adding the next layer of complexity.
Remember that the specific implementation details will depend heavily on your use case, infrastructure, and performance requirements. The concepts we've explored—contextual relevance, performance optimization, diversity considerations, and pattern recognition—are universal, but how you implement them should reflect your specific constraints and goals.
Looking Forward: Reward Engineering and Evaluation
This post has shown you how to build sophisticated template selection systems that intelligently balance relevance, performance, diversity, and emerging patterns. But selection is only half the story. To create systems that truly improve over time, you need evaluation frameworks that accurately measure what you care about and reward functions that guide the system toward your actual objectives.
In our next post, we'll dive deep into reward engineering and evaluation design. We'll explore how to combine multiple quality signals into reliable performance measures, how to handle the inevitable noise and delays in feedback, and how to design evaluation systems that remain aligned with your goals as your application evolves. We'll also examine the sophisticated use of Word Error Rate (WER) for detecting hallucinations in generated content, and how to build adaptive reward functions that learn what matters most for your specific use case.
The goal is to transform evaluation from an afterthought into the engine that drives continuous improvement. When you get evaluation right, your template selection system becomes truly autonomous, adapting automatically to changing conditions while maintaining the quality standards that matter most to your users.