Technical Documentation
Complete reference for the Advancement Index scoring engine, data pipeline, real-time architecture, and GLP-1 market tracking system. This document covers every formula, data source, refresh cycle, and API endpoint that powers the platform.
System architecture
The platform continuously ingests data from 8 independent public sources, scores 190+ compounds across 20 categories, and serves real-time updates to the dashboard. The scoring engine runs on FastAPI with SQLite time-series storage, Bayesian rating persistence, and server-sent events for live updates.
Three-layer composite scoring
Every compound score is constructed through three sequential layers. Layer 1 combines six raw signal dimensions using fixed weights. Layer 2 applies exponential moving average smoothing to filter transient noise. Layer 3 blends the EMA with a Bayesian rating that encodes longer-term trajectory confidence.
EMAt = EMAt-1 + α × (Rawt − EMAt-1)
α = 1 − 2−(Δt / 12h)
Advancement Index = 0.7 × EMA + 0.3 × Bayesian μ
All raw counts undergo a ln(x + 1) transform before normalization. This stabilizes variance across power-law distributions -- a compound with 10,000 papers and one with 10 are both scored on a comparable logarithmic scale.
Six independent signal vectors
Each dimension is scored independently on a 0-100 scale, then combined using fixed weights. Research velocity carries the highest weight because clinical and academic evidence is the least gameable signal.
Social Signal (25%)
Aggregated from Reddit post volume, engagement, and subreddit spread across 65+ monitored communities. Volume and engagement each contribute 40%, spread contributes 20%. Comments are weighted 3x more than upvotes.
engagement = ln_norm(score + comments × 3, scale=5000)
spread = ln_norm(subreddits, scale=15)
social = volume × 0.4 + engagement × 0.4 + spread × 0.2
Search Momentum (15%)
Google Trends interest score over 90-day windows. Composite of average interest (50%), momentum percentage change (30%), and count of rising related queries (20%).
Research Velocity (30%)
Papers, citations, and clinical trial activity across OpenAlex, PubMed, and ClinicalTrials.gov. Active recruiting trials carry the highest sub-weight (25%) because they represent current investment in the compound.
recent = ln_norm(recent_papers, scale=5,000) × 0.30
citations = ln_norm(citations, scale=50,000) × 0.15
trials = ln_norm(trials_total, scale=500) × 0.15
active = ln_norm(recruiting, scale=50) × 0.25
Regulatory Signal (15%)
FDA approval status provides a +60 point base. Additional signal from label count (10%), completed trial phase progression (15%), and active recruiting pipeline (15%).
Sentiment (10%)
Keyword-based sentiment analysis on Reddit discourse. Positive and negative keyword dictionaries produce a score from -1 to +1, mapped linearly to 0-100.
Market Signal (5%)
Stock price momentum and trading volume for companies with exposure to each compound. Sourced from Yahoo Finance. Weighted lowest because market signals follow advancement rather than leading it.
Logarithmic scale factors
Every raw count passes through ln(x + 1) before normalization. The scale factor defines what count maps to 100 points on the logarithmic curve. Counts beyond the scale factor are capped at 100.
| Metric | Scale Factor | Meaning |
|---|---|---|
| Reddit posts (7d) | 200 | 200 posts/week = max score |
| Reddit engagement | 5,000 | score + comments × 3 |
| Subreddit spread | 15 | 15 unique subreddits |
| Total papers | 30,000 | OpenAlex + PubMed combined |
| Recent papers | 5,000 | Last 2 years |
| Citations | 50,000 | Total citation count |
| Clinical trials | 500 | Total registered trials |
| Recruiting trials | 50 | Actively recruiting |
| FDA labels | 10 | Approved label count |
| Completed trials | 100 | Phase progression signal |
| Stock volume | 50,000,000 | Daily trading volume |
Exponential moving average with 12-hour half-life
Raw scores are noisy. A single viral Reddit post can move a compound's social signal by 40 points in a day. The EMA layer smooths this by weighting recent observations exponentially, with a half-life of 12 hours.
α = clamp(α, 0.001, 1.0)
EMAnew = EMAold + α × (Raw − EMAold)
After 12 hours, a new observation has ~50% influence on the EMA. After 24 hours, ~75%. After 3 days, ~94%. This means genuine sustained trends propagate within 1-2 days, while transient spikes are damped within hours.
Encoding compound trajectory as a probability distribution
Each compound is modeled as a Bayesian rating with a mean (μ) and uncertainty (σ), using the OpenSkill library's Plackett-Luce model. New compounds start with high uncertainty (μ = 25, σ = 8.33). As observations accumulate, the model tightens its confidence interval.
Each scoring cycle compares the compound's current raw score against its previous score. Score increases are treated as "wins" against a baseline, decreases as "losses." The Plackett-Luce model updates μ and σ accordingly.
A compound that scores consistently high earns a stable prior that resists transient drops. A compound riding a single spike will see its Bayesian score remain conservative until sustained evidence confirms the trend. State is persisted to disk between server restarts.
Eight independent public sources
| Source | What We Collect | Method |
|---|---|---|
| Post volume, score, comments, sentiment, subreddit spread across 65+ communities (peptides, Semaglutide, Biohackers, longevity, DrugNerds, etc.) | Apify Reddit Scraper Pro, batched 10 subreddits/request | |
| Google Trends | Average interest (0-100), momentum %, peak interest, rising queries, regional breakdown | pytrends library, 90-day windows |
| OpenAlex | Total papers, recent papers (last 2 years), citation count, top titles | REST API, sorted by citation count |
| PubMed | Total papers, recent papers, paper titles, publication dates | NCBI E-utilities (esearch + esummary) |
| ClinicalTrials.gov | Total trials, recruiting trials, completed trials, recent trial details | v2 REST API, paginated |
| FDA / openFDA | Approval status, label count, brand names | Drug label API |
| arXiv | Preprint count, recent preprints, titles | Atom feed API |
| Yahoo Finance | Stock price, change %, volume, 52-week range for pharma companies mapped to compounds | Chart API (free, no auth), US (.NYQ) and India (.NS) tickers |
Reddit communities monitored
Tier 1 (core): peptides, Peptides, PeptideScience, researchchemicals. Tier 2 (compound-specific): Semaglutide, Ozempic, Wegovy, tirzepatide, Mounjaro. Tier 3 (health): Biohackers, longevity, Nootropics, PEDs, fitness, diabetes. Tier 4 (research): DrugNerds, medicine, pharmacology, neuroscience. Total: 65+ subreddits.
Stock ticker mappings
40+ global pharma companies are mapped to compounds. US tickers include NVO (Novo Nordisk), LLY (Eli Lilly), AZN (AstraZeneca), PFE (Pfizer), AMGN (Amgen). Indian NSE tickers include SUNPHARMA.NS, DRREDDY.NS, CIPLA.NS, ZYDUSLIFE.NS, LUPIN.NS, GLENMARK.NS for post-patent generic manufacturers.
Multi-layer real-time data pipeline
Data freshness varies by source. Social signals refresh most frequently because they move fastest. Research data refreshes less often because papers and trials update on longer timescales.
US and India GLP-1 market intelligence
Dedicated market views track the GLP-1 therapeutic landscape in the United States and India. Both views pull live stock data from Yahoo Finance, map compounds to companies and brands, and display macro health statistics.
US Market
Tracks 11 GLP-1 compounds with FDA status, brand names, originator companies, and stock performance. Top-line stats: CDC obesity rate (40.3%, 100M+ adults), diabetes prevalence (12%, 37M adults), annual economic burden ($260B), GLP-1 market size ($54B 2024), and top combined revenue ($41.2B for semaglutide + tirzepatide).
Key tickers: NVO (Novo Nordisk, semaglutide), LLY (Eli Lilly, tirzepatide/retatrutide), AZN (AstraZeneca), ALT (Altimmune, pemvidutide).
India Market
Following semaglutide's patent expiry on March 20, 2026, 40+ Indian companies launched 50+ generic brands. The India view tracks originator products alongside domestic generics, with live NSE stock tickers for companies like Dr. Reddy's, Sun Pharma, Cipla, Zydus Lifesciences, Lupin, and Glenmark.
India stats: obesity rate (24%, 350M+ overweight), diabetes prevalence (89.8M, 10.5%), annual burden ($29B), GLP-1 market ($118M, projected $530M by 2030), and 40+ generic manufacturers post-patent.
Key generic brands (India, post-patent)
| Company | Brand | NSE Ticker |
|---|---|---|
| Dr. Reddy's | Obeda | DRREDDY.NS |
| Sun Pharma | Noveltreat | SUNPHARMA.NS |
| Zydus Lifesciences | Semaglyn, Alterme | ZYDUSLIFE.NS |
| Cipla | Yurpeak (tirzepatide) | CIPLA.NS |
| Glenmark | GLIPIQ, Lirafit | GLENMARK.NS |
| Natco Pharma | Semanat | NATCOPHARMA.NS |
| Lupin | Semanext | LUPIN.NS |
| Alkem | Semasize | ALKEM.NS |
| Torrent | Sembolic | TORNTPHARM.NS |
| Mankind Pharma | Samakind | MANKIND.NS |
Signal classification and automated alerts
Each compound receives a signal label based on its Advancement Index and search momentum. Labels are descriptive, not predictive.
| Label | Criteria |
|---|---|
| Surging | Score ≥ 80, or score ≥ 70 with momentum > 5% |
| Rising | Score ≥ 55, or score ≥ 40 with momentum > 10% |
| Stable | Score ≥ 30 |
| Cooling | Score ≥ 15 |
| Dormant | Score < 15 |
Automated alert rules
SQLite append-only signal history
Every 15 minutes, the system snapshots all compound scores into an SQLite database. This produces ~96 data points per compound per day, enabling trend visualization and historical analysis.
| Column | Type | Description |
|---|---|---|
| slug | TEXT | Compound identifier |
| timestamp | TEXT | ISO 8601 timestamp |
| signal_score | REAL | Raw composite score |
| advancement_index | REAL | Final blended score |
| social_buzz | REAL | Social dimension (0-100) |
| search_momentum | REAL | Search dimension (0-100) |
| research_velocity | REAL | Research dimension (0-100) |
| sentiment | REAL | Sentiment dimension (0-100) |
| mu | REAL | Bayesian mean |
| sigma | REAL | Bayesian uncertainty |
| reddit_posts_7d | INTEGER | Reddit post count |
| google_interest | REAL | Google Trends score |
| google_momentum_pct | REAL | Search momentum % |
| openalex_recent | INTEGER | Recent paper count |
| trials_recruiting | INTEGER | Active recruiting trials |
| fda_approved | INTEGER | FDA approval flag |
REST endpoints
All endpoints are served from the FastAPI application on port 8420. CORS is enabled. Responses are JSON unless otherwise noted.
| Endpoint | Method | Description |
|---|---|---|
| /api/signals/cache | GET | Full ranked compound cache, sorted by advancement_index |
| /api/signals/compound/{slug} | GET | Single compound detail. Pass ?refresh=true to force live rescore. |
| /api/signals/top?limit=20 | GET | Top N movers with score, label, momentum, alerts |
| /api/signals/history/{slug}?days=30 | GET | Time-series history from SQLite |
| /api/signals/alerts | GET | All active alerts across all compounds |
| /api/signals/refresh | POST | Trigger full rescan and rescore (10-min timeout) |
| /api/glp/market?region=us | GET | GLP-1 market data, stats, and live stock tickers. Regions: us, india |
| /api/signals/stream | GET | Server-sent events stream for live dashboard updates (30s interval) |
| /api/waitlist | POST | Email subscription. Stores locally, sends confirmation via Resend. |
| /api/contact | POST | Contact form. Stores locally, notifies via Resend. |
Self-expanding compound universe
The system does not maintain a fixed list. It continuously scans social and research sources for new compound mentions, validates them against academic databases, and adds them to the catalog automatically. Started with 73 seed compounds, now tracks 190+ across 20 categories.
Categories
| Category | Examples |
|---|---|
| GLP-1 / Weight Loss | Semaglutide, Tirzepatide, Retatrutide, Survodutide, Cagrisema, Orforglipron |
| Growth Hormone | Tesamorelin, MK-677, CJC-1295, Ipamorelin, Sermorelin, Hexarelin |
| Healing & Repair | BPC-157, TB-500, TB4-FRAG |
| Nootropic | Semax, Selank, Dihexa, P21, Cerebrolysin, Cortexin |
| Longevity | SS-31, Rapamycin, NAD+, Humanin, MOTS-c, FOXO4-DRI, Epitalon |
| Skin, Hair & Aging | Melanotan-2, GHK-Cu, SNAP-8, Argireline, Matrixyl |
| Immune | Thymosin Alpha-1, LL-37, KPV, Larazotide |
| Muscle & Performance | Follistatin-344, ACE-031, AOD-9604, IGF-1 LR3 |
| SARMs | Enclomiphene, Ostarine, RAD-140, LGD-4033, Cardarine |
| Bioregulators | Thymalin, Thymogen, Ovagen, Vesugen, Chelohart (Khavinson peptides) |
| Sexual Health | PT-141, Kisspeptin |
| Clinical Pipeline | VX-548, ER-100 |
| Nucleotides | NAD+, NMN, NR (Dinucleotide, Mononucleotide) |
| Antimicrobial | LL-37, Thymosin peptides |
Fuzzy matching
Compounds are discovered via fuzzy string matching against social mentions. Thresholds: basic similarity ≥ 85%, token-flexible ≥ 80%, phonetic similarity ≥ 0.9. This catches misspellings (e.g., "semiglutide" → semaglutide) and brand name references.
Not medical advice. Not investment advice. The Advancement Index is provided for informational and research purposes only. Signal data reflects observable public activity and does not constitute a recommendation to buy, sell, or use any compound. Data sources are public APIs subject to their own terms of service.