Technical Guide
Analysis methods, models, and metrics behind VideoTracker
System Overview
After you create a tracker and add channels or videos, analysis runs asynchronously; you do not need to stay on the page.
Pipeline
- Channel and video metadata are fetched from the YouTube API and stored
- Transcripts are collected and stored for text-based analysis
- Stored text is fed into pipelines for sentiment and emotion, topic distribution, clustering, and narrative detection
- The web dashboard calls the VideoTracker API and renders visualizations for posting frequency, daily video views, top keywords, social media footprint, language distribution, related videos, sentiment analysis, topic distribution, narratives, and clusters
Content Collection
When a channel or video is added, VideoTracker fetches and stores metadata and runs background tasks for related data.
What Happens in the Background When a Channel or Video Is Added
- When a video is added: The YouTube API is called to fetch full video metadata and store it. Background tasks then run in parallel: related videos are fetched, comments are fetched, and the transcript is fetched. After the transcript is stored, sentiment/emotion and keyword extraction run for that video.
- When a channel is added: The YouTube API is called to fetch channel metadata and store it. Background tasks then run in parallel: the channel's videos are fetched and added; for each new video, the same post-add pipeline runs (related videos, comments, transcript, then sentiment/keywords). Social media links for the channel are scraped and stored for the social media footprint.
- All of this runs asynchronously; you do not need to stay on the page. Data is stored in the database and forms the foundation for topic extraction, clustering, and narrative generation.
Posting Frequency
Tracks when videos are published over time, overall or per channel. Optional date filters apply to the range of data shown.
Overview
- Tracks when videos are published over time, overall or per channel
- Optional date filters apply to the range of data shown
- Dashboard shows aggregate posting frequency; the deeper analysis page adds per channel series, video category distribution, and channel details
How It Works
- Videos come from the tracker (directly added or via channels). Only videos with a valid publication date are counted.
- When "per channel" is requested, the same logic is applied per channel so you can compare posting patterns across channels.
- Backend: the posting frequency endpoint returns a time series. A separate mode returns the same series broken down by channel. Date range is supported as a query parameter.
- Dashboard: the main dashboard loads aggregate posting frequency and passes it to the posting frequency chart. Filters are applied when calling the API.
- Deeper analysis page: the posting frequency analysis page loads the channel list (from the videos or channels endpoint), overall and per channel posting frequency, and video category distribution. Selecting a category loads that category's videos via the videos by category endpoint. Chart zoom can narrow the date range used for category distribution.
What the Results Show
- Publication volume over time, per channel comparison, video category distribution, and channel level details for monitoring and research
Daily Video Views
Shows how statistical metrics for tracked videos change over time. Data comes from stored snapshots of video stats (updated periodically from YouTube).
Overview
- Shows how statistical metrics for tracked videos change over time
- Data comes from stored snapshots of video stats (updated periodically from YouTube)
- Dashboard shows an aggregate statistical metric time series; the deeper analysis page (engagement) adds per channel breakdown and other engagement metrics (likes, comments, etc.)
How It Works
- View counts and related stats are collected over time and stored. The component aggregates them into a time series for the tracker's videos. Data is read from the Elasticsearch videos_daily index when available.
- Backend: the daily views endpoint returns a time series (date and value). Data is read from the videos_daily index. The dashboard and analysis page use an interval and filters are passed as query parameters.
- Dashboard: the main dashboard loads daily views (weekly aggregation) and passes the series to the daily video views chart. Filters apply when calling the API.
- Deeper analysis page: the engagement page (linked from the Daily Video Views widget) shows the same daily stats series plus engagement stats and per channel breakdown. Channel selection and metric selection (views, likes, comments, etc.) are available there.
What the Results Show
- How views accumulate over time and which videos or channels drive the most traffic, and on the engagement page, video rankings and engagement trends
Top Keywords
Identifies and tracks the most important terms in video titles, descriptions and transcripts. Surfaces which themes are emerging or dominant across the tracker.
How It Works
- Each video is processed to extract keywords and frequencies from its text. Terms are preprocessed (e.g. tokenized, lemmatized, stopwords removed), then weighted with TF-IDF. Results are stored per video in the top_terms JSONB column. The dashboard aggregates these across all tracker videos and returns keywords sorted by total frequency.
- The aggregated keywords endpoint reads from the database (videos.top_terms) for the tracker's videos. Date range can be passed. The dashboard calls this endpoint and displays the top keywords.
Libraries Used
- scikit learn TF-IDFVectorizer for term weighting, NLTK for tokenization, stopwords.
What the Results Show
- Which topics and terms matter most across the tracker, and how often they appear. The list helps you see dominant terms and how these terms take shape across the content.
Language Distribution
Shows how many videos in the tracker fall into each language. Language comes from the stored video metadata.
How It Works
- Video IDs for the tracker are resolved from the database. Language distribution is read from the Elasticsearch videos index by aggregating on the language field. Counts per language are returned sorted by count descending.
- The language-distribution endpoint takes tracker ID. It calls the Elasticsearch client to aggregate by language. The dashboard uses this for the language distribution chart.
What the Results Show
- The distribution of languages used in your tracked content (e.g. bar chart of language vs count). You can use this to see which languages dominate.
Sentiment Analysis
Analyzes the sentimental tone of video titles, descriptions, and comments as positive, neutral, or negative. Uses full context, and can include emotion statistics (e.g. joy, anger, fear, disgust, sadness).
How It Works
- Each piece of text is processed by a language model that evaluates semantic context. The model considers word choice, phrasing, and sentence relationships and produces a continuous sentiment score and emotion. Scores are normalized and mapped to sentiment labels, and results are aggregated by video, channel.
- Sentiment and emotion run as background tasks after video content (transcript) is stored. The dashboard and deeper sentiment page read stored scores from the API and render timelines, distributions.
Models & Libraries Used
- Qwen based LLM (qwen3-vl-32b) for contextual sentiment and emotion. Guided JSON output is used for structured sentiment and emotion scores
What the Results Show
- How tone and emotions vary across channels and over time, and which videos or channels drive positive or negative sentiment.
Topic Distribution
Identifies dominant topics across video content using topic modeling. Shows how topics distribute across channels and over time, with a chord diagram for topic relationships and n/r/t metrics (Novelty, Resonance, Transience).
How It Works
- Videos are analyzed with LLM-based topic extraction and statistical topic modeling on titles, descriptions, and transcripts. Each video receives topic weights, and the dominant topic is the one with the highest weight. n/r/t metrics (Novelty, Resonance, Transience) use TF-IDF, cosine similarity, and KL divergence to capture topic dynamics
- Topic extraction runs as a service (e.g. per channel or tracker). Stored topic weights and labels are exposed via API. The dashboard and topic distribution page render chord diagrams, topic trends, and n/r/t metrics from that data
Models & Libraries Used
- LLM (GPT-OSS-120B) for topic extraction, scikit-learn for weights and similarity, NumPy for numerical calculations. Chart libraries (Recharts, D3) are used for chord diagrams and trend visualizations
What the Results Show
- Major topics in the tracker and how different channels contribute to each topic, plus temporal patterns in topic popularity. The chord diagram shows relationships between topics, and n/r/t metrics show Novelty, Resonance, and Transience of topics over time.
Narrative Detection
Identifies dominant narratives across videos. Builds on previously identified clusters.
How Narratives Are Calculated
- Videos are first grouped by clusters. Each cluster is analyzed to identify common themes and perspectives. Representative narratives are generated from clusters via LLM. Multiple narratives are extracted from a single cluster.
Models & Libraries Used
- GPT-OSS-120B – used to generate narratives and compute similarity between narratives and videos and keyword extraction
Narrative–Video Association
- Each narrative is compared against videos within its cluster
- Videos are ranked based on how strongly they express a given narrative
- Similarity scores (0–100) indicate the degree of narrative alignment
- Narrative extraction runs after clustering. The narratives API returns keywords and narrative list. The deeper analysis page shows keywords, narratives, videos per narrative with similarity scores, and a video preview
What the Results Show
- Dominant narratives shaping the discussion, how different channels frame the same topic, and which videos most strongly reinforce each narrative. The results help explain not only what is being discussed but how ideas and perspectives are formed, and let you explore each narrative with ranked supporting evidence.
Clusters
Groups videos based on meaning rather than shared keywords. Brings together content that uses different language to express similar ideas. Clusters form the foundation for narrative, and the UI can show trend analysis over time.
How Clusters Are Calculated
- LLM-based clustering: An LLM first analyzes a sample of videos (e.g. 30% of the tracker) and generates a taxonomy of 4–10 semantic clusters (plus a noise cluster), each with a description and inclusion/exclusion criteria. Then each video is classified into exactly one cluster by sending its title, description, and transcript to the LLM along with the taxonomy; the model returns the assigned cluster. Cluster names and keywords come from the LLM-generated taxonomy. Classification runs in parallel with rate limiting.
- Clustering is triggered after new videos are added (or manually via the clusters regenerate endpoint). The clusters API returns the cluster list with keywords and video membership. The deeper analysis page shows clusters, trend over time, and drill down into cluster videos.
Models & Libraries Used
- LLM clustering: gemma3-27b is used for taxonomy generation from samples and for per-video classification into taxonomy clusters.
What the Results Show
- Major discussion groups within the tracker and how videos organize around shared ideas, and how clusters evolve over time. The clusters provide the analytical basis for narrative detection.
YouTube Characterization (content, barcode, commenter)
Characterization is produced by the vtracker background tasks service: each pipeline is a Kafka-chained sequence of stage handlers. Stages read/write Postgres tables (e.g. characterization_run, characterization_cluster_result, characterization_euclidean_distance, characterization_clique, and comment-prefixed variants). The web app calls REST endpoints such as cluster-plots and top-clique payloads for a given run_id.
UI surfaces
- Dashboard CharacterizationDashboardWidget: GET …/characterization/cluster-plots with top_n_cliques; highlights channels that share top clique groups for the active similarity combination.
- Deeper page /dashboard/characterization: combination pickers, content and commenter cluster plots, regenerate-characterization and regenerate-commenter-characterization actions that enqueue backend work.
1. Content behaviour — youtube_content_behaviour_pipeline (6 stages)
- 1Stage 1 — group_single_video_channels: Consumes the pipeline input topic; resolves tracker/run scope and builds per-channel work units (including grouped pseudo-channels where needed). Writes grouped membership rows for downstream SQL joins.
- 2Stage 2 — unified_similarity_score_caluclator: Computes LLM-driven pairwise similarity features over video text (titles, descriptions, transcripts) using chunked batching (config: VIDEOS_PER_REQUEST, INNER_VIDEOS_PER_REQUEST, token budgets, concurrent LLM batches). Outputs aggregate avg_* text similarity scores per channel pair / inner feature slots used later as 2D axes.
- 3Stage 3 — barcode_unified_similarity_score_caluclator: For each channel’s movie_barcode PNGs, decodes to 224×224 grayscale, runs pairwise Structural Similarity Index (SSIM) on CPU in parallel batches (SSIM_PAIR_PARALLEL_BATCH, CHANNEL_PARALLEL_PROCESSES), and stores mean barcode similarity as inner_barcode_similarity and related aggregates for clustering views.
- 4Stage 4 — unified_clustering_analysis: Loads per-channel scores for the run; builds fixed 2D feature spaces (six text-inner pairs plus text-vs-barcode views, plus fifteen avg_* LLM pair combinations). Runs multiple sklearn clusterers per view — e.g. KMeans, AgglomerativeClustering, SpectralClustering, GaussianMixture, AffinityPropagation, MeanShift, optional fuzzy c-means (skfuzzy) — with StandardScaler preprocessing and silhouette_score where applicable. Majority vote across algorithms yields a consensus cluster label per channel per combination; results upsert to characterization_cluster_result.
- 5Stage 5 — unified_pairwise_channels_distance: For each 2D combination, computes Euclidean distance d(u,v) = √(Σ_i (x_i(u) − x_i(v))²) between all channel pairs, persists to characterization_euclidean_distance. Clique mining: for subsets of combination names of size r ≥ 2, builds a graph where an edge exists iff both endpoints lie strictly below the per-combination quantile of pairwise distances on every axis (CLIQUE_PARAMS: quantile, min_group_size, top_k). Uses NetworkX maximal cliques, ranks by group size, keeps top_k groups, persists to characterization_clique.
- 6Stage 6 — unified_visualize_top_clique_groups: Consumes final-stage topic output; prepares top-N clique visualization payloads (e.g. TOP_N_GROUPS) consumed by the API for green-ring highlights and tables.
Content pipeline — formulas & libraries
- SSIM: classic Wang/Bovik-style SSIM on luminance windows after resize; channel score is mean SSIM over eligible unordered video pairs (implementation in barcode_unified_similarity_score_caluclator).
- Cluster consensus: per (channel, combination), each algorithm votes a label; the mode (Counter majority) becomes the stored consensus cluster when algorithms agree enough; ambiguous cases follow implementation fallbacks.
- Euclidean 2D distance and quantile edge test as in stage 5; cliques are maximal sets mutually close under simultaneous thresholds.
- Libraries: OpenAI-compatible LLM client for text stage; NumPy; scikit-learn (clustering, StandardScaler, silhouette_score); NetworkX (cliques); image decode/resize for barcodes; psycopg2 for Postgres.
2. Barcode behaviour — youtube_barcode_behaviour_pipeline (5 stages)
- 1Stage 1 — group_single_video_channels: Same grouping role as content pipeline but on barcode_behaviour_pipeline Kafka topics; prepares channel lists for barcode-only similarity.
- 2Stage 2 — unified_similarity_score_caluclator (barcode module): CPU SSIM pipeline only — no LLM text stage in this pipeline. Pairwise SSIM over decoded barcodes per channel; parallel batching identical in spirit to content stage 3.
- 3Stage 3 — unified_clustering_analysis: Same multi-algorithm + majority-vote pattern as content, but feature space excludes the dedicated LLM-only barcode bridge stage present on the full content pipeline.
- 4Stage 4 — unified_pairwise_channels_distance: Same Euclidean + quantile-based maximal-clique workflow as content, scoped to barcode-derived feature combinations; persists rows in characterization_euclidean_distance and characterization_clique for the run_id (handlers delete prior rows for that run before insert).
- 5Stage 5 — unified_visualize_top_clique_groups: Emits final Kafka message on barcode_behaviour_pipeline.final.out for downstream persistence and UI polling.
Barcode pipeline — notes
- Omits content-only stages (no LLM unified_similarity_score_caluclator text pass and no barcode_unified_similarity_score_caluclator bridge — SSIM is the primary similarity signal).
- Tuning via stage_config mirrors content where stages are shared: SSIM_PAIR_PARALLEL_BATCH, CHANNEL_PARALLEL_PROCESSES, NUMBER_OF_CLUSTERS, CLIQUE_PARAMS.quantile / min_group_size / top_k.
3. Commenter behaviour — youtube_commenter_behaviour_pipeline (7 stages)
- 1Stage 1 — group_single_video_channels: Commenter-specific grouping; writes characterization_channel_grouping_comment and run channel lists in characterization_run_channels_comment.
- 2Stage 2 — video_network_analysis: Builds a commenter co-occurrence graph (nodes = commenters or videos per implementation paths) with NetworkX; edges reflect shared commenting across videos for channels in the run. Outputs network features for downstream clique extraction.
- 3Stage 3 — extract_cliques: Enumerates dense subgraphs / cliques in the commenter network to find coordinated commenting structures.
- 4Stage 4 — suspicious_clique_analysis: Scores cliques using commenter behaviour aggregates (toxicity, spam, edit distance, timing, duplication, etc.) with configurable edit_distance_method and published_date_method; flags anomalous coordination patterns.
- 5Stage 5 — unified_clustering_analysis: Uses ten predefined 2D metric pairs (e.g. mean_toxicity_score × mean_spam_promotion_score, mean_time_gap × mean_edit_distance_normalized, …) as axes; runs the same ensemble clustering + silhouette + majority consensus pattern, writing comment-suffixed cluster tables.
- 6Stage 6 — unified_pairwise_channels_distance: For each of the ten COMMENTER_COMBINATIONS, loads per-channel aggregate means from characterization_run_similarity_override_comment (fallback characterization_channel_similarity_cache), computes all-pairs Euclidean distances, stores characterization_euclidean_distance_comment, then applies the same quantile-graph + NetworkX maximal clique + top_k selection to populate characterization_clique_comment.
- 7Stage 7 — unified_visualize_top_clique_groups: Packages top clique groups for the commenter cluster-plots-commenters API path.
Commenter pipeline — metrics & distance
- Axes include mean_sentiment_distribution, mean_toxicity_score, mean_spam_promotion_score, mean_duplicate_comment_ratio, mean_time_gap, mean_edit_distance_normalized, mean_average_normalized_published_date, mean_length_variability, mean_sentiment_variance, mean_vocab_uniqueness — paired into ten clustering scatter views.
- Pairwise channel distance in each view is Euclidean in the corresponding 2D mean-feature space; clique edges require simultaneous nearness under per-view quantile thresholds, matching the content pipeline’s cross-plot clique logic.
Euclidean distance (all pairwise stages)
d(u, v) = √( (x₁(u) − x₁(v))² + (x₂(u) − x₂(v))² )
For higher-dimensional extensions, sum squared differences across all axes before the square root.
Quantile clique rule (cross-metric)
For a set of r combination plots, channels u and v are adjacent only if, for every plot in the set, their Euclidean distance is strictly below the q-quantile of all pairwise distances observed on that plot (default q=0.3). Maximal cliques in that graph are candidates; the pipeline retains the largest groups subject to min_group_size and top_k caps per configuration.
Operational
- Each pipeline exposes a dead-letter topic {pipeline_name}.dead_letter for poison messages.
- Internal/manual triggers may exist per stage via the background service internal API when enabled.
- characterization_run rows tie UI run_id to the Postgres state produced by these stages; failures in listed pipelines can mark runs failed via characterization_run_status handling.
Topic Analysis Methodology
Hybrid approach combining Large Language Model (LLM) topic extraction with statistical distribution. Multi level analysis at tracker, channel, and video levels for context aware insights. Computes Novelty, Resonance, and Transience (NRT) metrics using KL divergence.
Analysis Pipeline
- 1Data Collection – Retrieve videos with full content text (titles, descriptions, transcripts), metadata
- 2LLM Topic Extraction – Extract distinct topics using parallel LLM calls with chunked processing
- 3Topic Consolidation – Merge topics from different chunks into coherent topic sets
- 4Topic Distribution – Assign topic weights to each video using TF-IDF cosine similarity
- 5NRT Calculation – Compute Novelty, Resonance, and Transience metrics for each video
- 6Database Storage – Save results with cascade support for different analysis levels
LLM Topic Extraction Process
- Parallel Processing – Uses thread pooling to process multiple chunks simultaneously
- Chunked Analysis – Divides videos into manageable chunks for efficient processing
- Structured Output – LLM returns topics in consistent JSON format with labels, keywords, and descriptions
- Error Handling – Robust fallback mechanisms for LLM failures or malformed responses
- Consolidation – Merges topics from different chunks using keyword similarity and LLM consolidation
Topic Distribution Algorithm
- TF-IDF Vectorization – Converts video texts and topic descriptions into numerical vectors
- Cosine Similarity – Measures similarity between videos and topics in high-dimensional space
- Weight Normalization – Converts similarity scores into probability distributions
Novelty, Resonance, and Transience (NRT) Calculations
NRT metrics measure how topics evolve over time within video content streams. Based on KL divergence (Kullback–Leibler divergence) between topic distributions. Window-based analysis comparing each video to its temporal neighbors. Inspired by computational social science research on discourse evolution.
KL Divergence Formula
D_KL(P ‖ Q) = Σ_i P(i) log₂(P(i)/Q(i))
where P and Q are probability distributions over topics.
Novelty Calculation
- Measures how different a video's topics are from preceding videos
- Average KL divergence between current video and videos in preceding window (20 videos)
- Higher novelty = more departure from recent discussion patterns
- Formula: Novelty(i) = avg(D_KL(P_i ‖ P_j)) for j in [i−window, i−1]
Transience Calculation
- Measures how quickly topics change after a video
- Average KL divergence between current video and videos in following window
- Higher transience = less lasting influence on subsequent discussion
- Formula: Transience(i) = avg(D_KL(P_j ‖ P_i)) for j in [i+1, i+window]
Resonance Calculation
- Measures the lasting impact and influence of a video's topics
- Difference between novelty and transience: Resonance = Novelty − Transience
- Higher resonance = videos that introduce new topics that persist in discussion
- Negative resonance = topics that appear briefly then fade quickly
Cascade Analysis System
- Tracker-Level Analysis – Identifies broad topics across all channels and videos in the tracker
- Channel-Level Analysis – Extracts topics specific to individual channels
- Video-Level Analysis – Topic weights and NRT metrics per video for drill-down and ranking
- Automatic Context Switching – System uses appropriate topic set based on analysis level (dashboard vs. deeper analysis)
- Parallel Processing – Analyzes multiple channels simultaneously using thread pooling
What NRT Metrics Reveal
- High Novelty + High Resonance – Videos that introduce new, lasting topics (topic leadership)
- High Novelty + Low Resonance – Videos that introduce topics that don't catch on (failed innovations)
- Low Novelty + Low Transience – Videos that reinforce existing stable topics (discussion maintenance)
- High Transience – Topics that appear briefly then disappear (trends or noise)
Apply these analytical methods to your own YouTube trackers.
Social Media Footprint
Shows the presence of tracker channels on other social platforms (e.g. Twitter, Instagram). Links are scraped from channel pages and stored in the smlinks table.
How It Works
What the Results Show