Technical Guide
Analysis methods, models, and metrics behind VideoTracker
System Overview
After you create a tracker and add channels or videos, analysis runs asynchronously; you do not need to stay on the page.
Pipeline
- Channel and video metadata are fetched from the YouTube API and stored
- Transcripts are collected and stored for text-based analysis
- Stored text is fed into pipelines for sentiment and emotion, topic distribution, clustering, and narrative detection
- The web dashboard calls the VideoTracker API and renders visualizations for posting frequency, daily video views, top keywords, social media footprint, language distribution, related videos, sentiment analysis, topic distribution, narratives, and clusters
Content Collection
When a channel or video is added, VideoTracker fetches and stores metadata and runs background tasks for related data.
What Happens in the Background When a Channel or Video Is Added
- When a video is added: The YouTube API is called to fetch full video metadata and store it. Background tasks then run in parallel: related videos are fetched, comments are fetched, and the transcript is fetched. After the transcript is stored, sentiment/emotion and keyword extraction run for that video.
- When a channel is added: The YouTube API is called to fetch channel metadata and store it. Background tasks then run in parallel: the channel's videos are fetched and added; for each new video, the same post-add pipeline runs (related videos, comments, transcript, then sentiment/keywords). Social media links for the channel are scraped and stored for the social media footprint.
- All of this runs asynchronously; you do not need to stay on the page. Data is stored in the database and forms the foundation for topic extraction, clustering, and narrative generation.
Posting Frequency
Tracks when videos are published over time, overall or per channel. Optional date filters apply to the range of data shown.
Overview
- Tracks when videos are published over time, overall or per channel
- Optional date filters apply to the range of data shown
- Dashboard shows aggregate posting frequency; the deeper analysis page adds per channel series, video category distribution, and channel details
How It Works
- Videos come from the tracker (directly added or via channels). Only videos with a valid publication date are counted.
- When "per channel" is requested, the same logic is applied per channel so you can compare posting patterns across channels.
- Backend: the posting frequency endpoint returns a time series. A separate mode returns the same series broken down by channel. Date range is supported as a query parameter.
- Dashboard: the main dashboard loads aggregate posting frequency and passes it to the posting frequency chart. Filters are applied when calling the API.
- Deeper analysis page: the posting frequency analysis page loads the channel list (from the videos or channels endpoint), overall and per channel posting frequency, and video category distribution. Selecting a category loads that category's videos via the videos by category endpoint. Chart zoom can narrow the date range used for category distribution.
What the Results Show
- Publication volume over time, per channel comparison, video category distribution, and channel level details for monitoring and research
Daily Video Views
Shows how statistical metrics for tracked videos change over time. Data comes from stored snapshots of video stats (updated periodically from YouTube).
Overview
- Shows how statistical metrics for tracked videos change over time
- Data comes from stored snapshots of video stats (updated periodically from YouTube)
- Dashboard shows an aggregate statistical metric time series; the deeper analysis page (engagement) adds per channel breakdown and other engagement metrics (likes, comments, etc.)
How It Works
- View counts and related stats are collected over time and stored. The component aggregates them into a time series for the tracker's videos. Data is read from the Elasticsearch videos_daily index when available.
- Backend: the daily views endpoint returns a time series (date and value). Data is read from the videos_daily index. The dashboard and analysis page use an interval and filters are passed as query parameters.
- Dashboard: the main dashboard loads daily views (weekly aggregation) and passes the series to the daily video views chart. Filters apply when calling the API.
- Deeper analysis page: the engagement page (linked from the Daily Video Views widget) shows the same daily stats series plus engagement stats and per channel breakdown. Channel selection and metric selection (views, likes, comments, etc.) are available there.
What the Results Show
- How views accumulate over time and which videos or channels drive the most traffic, and on the engagement page, video rankings and engagement trends
Top Keywords
Identifies and tracks the most important terms in video titles, descriptions and transcripts. Surfaces which themes are emerging or dominant across the tracker.
How It Works
- Each video is processed to extract keywords and frequencies from its text. Terms are preprocessed (e.g. tokenized, lemmatized, stopwords removed), then weighted with TF-IDF. Results are stored per video in the top_terms JSONB column. The dashboard aggregates these across all tracker videos and returns keywords sorted by total frequency.
- The aggregated keywords endpoint reads from the database (videos.top_terms) for the tracker's videos. Date range can be passed. The dashboard calls this endpoint and displays the top keywords.
Libraries Used
- scikit learn TF-IDFVectorizer for term weighting, NLTK for tokenization, stopwords.
What the Results Show
- Which topics and terms matter most across the tracker, and how often they appear. The list helps you see dominant terms and how these terms take shape across the content.
Language Distribution
Shows how many videos in the tracker fall into each language. Language comes from the stored video metadata.
How It Works
- Video IDs for the tracker are resolved from the database. Language distribution is read from the Elasticsearch videos index by aggregating on the language field. Counts per language are returned sorted by count descending.
- The language-distribution endpoint takes tracker ID. It calls the Elasticsearch client to aggregate by language. The dashboard uses this for the language distribution chart.
What the Results Show
- The distribution of languages used in your tracked content (e.g. bar chart of language vs count). You can use this to see which languages dominate.
Sentiment Analysis
Analyzes the sentimental tone of video titles, descriptions, and comments as positive, neutral, or negative. Uses full context, and can include emotion statistics (e.g. joy, anger, fear, disgust, sadness).
How It Works
- Each piece of text is processed by a language model that evaluates semantic context. The model considers word choice, phrasing, and sentence relationships and produces a continuous sentiment score and emotion. Scores are normalized and mapped to sentiment labels, and results are aggregated by video, channel.
- Sentiment and emotion run as background tasks after video content (transcript) is stored. The dashboard and deeper sentiment page read stored scores from the API and render timelines, distributions.
Models & Libraries Used
- Qwen based LLM (qwen3-vl-32b) for contextual sentiment and emotion. Guided JSON output is used for structured sentiment and emotion scores
What the Results Show
- How tone and emotions vary across channels and over time, and which videos or channels drive positive or negative sentiment.
Topic Distribution
Identifies dominant topics across video content using topic modeling. Shows how topics distribute across channels and over time, with a chord diagram for topic relationships and n/r/t metrics (Novelty, Resonance, Transience).
How It Works
- Videos are analyzed with LLM-based topic extraction and statistical topic modeling on titles, descriptions, and transcripts. Each video receives topic weights, and the dominant topic is the one with the highest weight. n/r/t metrics (Novelty, Resonance, Transience) use TF-IDF, cosine similarity, and KL divergence to capture topic dynamics
- Topic extraction runs as a service (e.g. per channel or tracker). Stored topic weights and labels are exposed via API. The dashboard and topic distribution page render chord diagrams, topic trends, and n/r/t metrics from that data
Models & Libraries Used
- LLM (GPT-OSS-120B) for topic extraction, scikit-learn for weights and similarity, NumPy for numerical calculations. Chart libraries (Recharts, D3) are used for chord diagrams and trend visualizations
What the Results Show
- Major topics in the tracker and how different channels contribute to each topic, plus temporal patterns in topic popularity. The chord diagram shows relationships between topics, and n/r/t metrics show Novelty, Resonance, and Transience of topics over time.
Narrative Detection
Identifies dominant narratives across videos. Builds on previously identified clusters.
How Narratives Are Calculated
- Videos are first grouped by clusters. Each cluster is analyzed to identify common themes and perspectives. Representative narratives are generated from clusters via LLM. Multiple narratives are extracted from a single cluster.
Models & Libraries Used
- GPT-OSS-120B – used to generate narratives and compute similarity between narratives and videos and keyword extraction
Narrative–Video Association
- Each narrative is compared against videos within its cluster
- Videos are ranked based on how strongly they express a given narrative
- Similarity scores (0–100) indicate the degree of narrative alignment
- Narrative extraction runs after clustering. The narratives API returns keywords and narrative list. The deeper analysis page shows keywords, narratives, videos per narrative with similarity scores, and a video preview
What the Results Show
- Dominant narratives shaping the discussion, how different channels frame the same topic, and which videos most strongly reinforce each narrative. The results help explain not only what is being discussed but how ideas and perspectives are formed, and let you explore each narrative with ranked supporting evidence.
Clusters
Groups videos based on meaning rather than shared keywords. Brings together content that uses different language to express similar ideas. Clusters form the foundation for narrative, and the UI can show trend analysis over time.
How Clusters Are Calculated
- LLM-based clustering: An LLM first analyzes a sample of videos (e.g. 30% of the tracker) and generates a taxonomy of 4–10 semantic clusters (plus a noise cluster), each with a description and inclusion/exclusion criteria. Then each video is classified into exactly one cluster by sending its title, description, and transcript to the LLM along with the taxonomy; the model returns the assigned cluster. Cluster names and keywords come from the LLM-generated taxonomy. Classification runs in parallel with rate limiting.
- Clustering is triggered after new videos are added (or manually via the clusters regenerate endpoint). The clusters API returns the cluster list with keywords and video membership. The deeper analysis page shows clusters, trend over time, and drill down into cluster videos.
Models & Libraries Used
- LLM clustering: gemma3-27b is used for taxonomy generation from samples and for per-video classification into taxonomy clusters.
What the Results Show
- Major discussion groups within the tracker and how videos organize around shared ideas, and how clusters evolve over time. The clusters provide the analytical basis for narrative detection.
Topic Analysis Methodology
Hybrid approach combining Large Language Model (LLM) topic extraction with statistical distribution. Multi level analysis at tracker, channel, and video levels for context aware insights. Computes Novelty, Resonance, and Transience (NRT) metrics using KL divergence.
Analysis Pipeline
- 1Data Collection – Retrieve videos with full content text (titles, descriptions, transcripts), metadata
- 2LLM Topic Extraction – Extract distinct topics using parallel LLM calls with chunked processing
- 3Topic Consolidation – Merge topics from different chunks into coherent topic sets
- 4Topic Distribution – Assign topic weights to each video using TF-IDF cosine similarity
- 5NRT Calculation – Compute Novelty, Resonance, and Transience metrics for each video
- 6Database Storage – Save results with cascade support for different analysis levels
LLM Topic Extraction Process
- Parallel Processing – Uses thread pooling to process multiple chunks simultaneously
- Chunked Analysis – Divides videos into manageable chunks for efficient processing
- Structured Output – LLM returns topics in consistent JSON format with labels, keywords, and descriptions
- Error Handling – Robust fallback mechanisms for LLM failures or malformed responses
- Consolidation – Merges topics from different chunks using keyword similarity and LLM consolidation
Topic Distribution Algorithm
- TF-IDF Vectorization – Converts video texts and topic descriptions into numerical vectors
- Cosine Similarity – Measures similarity between videos and topics in high-dimensional space
- Weight Normalization – Converts similarity scores into probability distributions
Novelty, Resonance, and Transience (NRT) Calculations
NRT metrics measure how topics evolve over time within video content streams. Based on KL divergence (Kullback–Leibler divergence) between topic distributions. Window-based analysis comparing each video to its temporal neighbors. Inspired by computational social science research on discourse evolution.
KL Divergence Formula
D_KL(P ‖ Q) = Σ_i P(i) log₂(P(i)/Q(i))
where P and Q are probability distributions over topics.
Novelty Calculation
- Measures how different a video's topics are from preceding videos
- Average KL divergence between current video and videos in preceding window (20 videos)
- Higher novelty = more departure from recent discussion patterns
- Formula: Novelty(i) = avg(D_KL(P_i ‖ P_j)) for j in [i−window, i−1]
Transience Calculation
- Measures how quickly topics change after a video
- Average KL divergence between current video and videos in following window
- Higher transience = less lasting influence on subsequent discussion
- Formula: Transience(i) = avg(D_KL(P_j ‖ P_i)) for j in [i+1, i+window]
Resonance Calculation
- Measures the lasting impact and influence of a video's topics
- Difference between novelty and transience: Resonance = Novelty − Transience
- Higher resonance = videos that introduce new topics that persist in discussion
- Negative resonance = topics that appear briefly then fade quickly
Cascade Analysis System
- Tracker-Level Analysis – Identifies broad topics across all channels and videos in the tracker
- Channel-Level Analysis – Extracts topics specific to individual channels
- Video-Level Analysis – Topic weights and NRT metrics per video for drill-down and ranking
- Automatic Context Switching – System uses appropriate topic set based on analysis level (dashboard vs. deeper analysis)
- Parallel Processing – Analyzes multiple channels simultaneously using thread pooling
What NRT Metrics Reveal
- High Novelty + High Resonance – Videos that introduce new, lasting topics (topic leadership)
- High Novelty + Low Resonance – Videos that introduce topics that don't catch on (failed innovations)
- Low Novelty + Low Transience – Videos that reinforce existing stable topics (discussion maintenance)
- High Transience – Topics that appear briefly then disappear (trends or noise)
Apply these analytical methods to your own YouTube trackers.
Social Media Footprint
Shows the presence of tracker channels on other social platforms (e.g. Twitter, Instagram). Links are scraped from channel pages and stored in the smlinks table.
How It Works
What the Results Show