About Docs
← Back to Index

Technical Documentation

Complete reference for the Advancement Index scoring engine, data pipeline, real-time architecture, and GLP-1 market tracking system. This document covers every formula, data source, refresh cycle, and API endpoint that powers the platform.

System architecture

The platform continuously ingests data from 8 independent public sources, scores 190+ compounds across 20 categories, and serves real-time updates to the dashboard. The scoring engine runs on FastAPI with SQLite time-series storage, Bayesian rating persistence, and server-sent events for live updates.

Reddit Google Trends OpenAlex PubMed ClinicalTrials.gov FDA arXiv Yahoo Finance
Scrapers ln(x+1) Transform Weighted Composite EMA Smoothing Bayesian Update Advancement Index

Three-layer composite scoring

Every compound score is constructed through three sequential layers. Layer 1 combines six raw signal dimensions using fixed weights. Layer 2 applies exponential moving average smoothing to filter transient noise. Layer 3 blends the EMA with a Bayesian rating that encodes longer-term trajectory confidence.

Raw Score = Σ(dimensioni × weighti)
EMAt = EMAt-1 + α × (Rawt − EMAt-1)
α = 1 − 2−(Δt / 12h)
Advancement Index = 0.7 × EMA + 0.3 × Bayesian μ

All raw counts undergo a ln(x + 1) transform before normalization. This stabilizes variance across power-law distributions -- a compound with 10,000 papers and one with 10 are both scored on a comparable logarithmic scale.


Six independent signal vectors

Each dimension is scored independently on a 0-100 scale, then combined using fixed weights. Research velocity carries the highest weight because clinical and academic evidence is the least gameable signal.

Research Velocity
30%
Social Signal
25%
Search Momentum
15%
Regulatory Signal
15%
Sentiment
10%
Market Signal
5%

Social Signal (25%)

Aggregated from Reddit post volume, engagement, and subreddit spread across 65+ monitored communities. Volume and engagement each contribute 40%, spread contributes 20%. Comments are weighted 3x more than upvotes.

volume = ln_norm(posts_7d, scale=200)
engagement = ln_norm(score + comments × 3, scale=5000)
spread = ln_norm(subreddits, scale=15)
social = volume × 0.4 + engagement × 0.4 + spread × 0.2

Search Momentum (15%)

Google Trends interest score over 90-day windows. Composite of average interest (50%), momentum percentage change (30%), and count of rising related queries (20%).

Research Velocity (30%)

Papers, citations, and clinical trial activity across OpenAlex, PubMed, and ClinicalTrials.gov. Active recruiting trials carry the highest sub-weight (25%) because they represent current investment in the compound.

papers = ln_norm(openalex + pubmed, scale=30,000) × 0.15
recent = ln_norm(recent_papers, scale=5,000) × 0.30
citations = ln_norm(citations, scale=50,000) × 0.15
trials = ln_norm(trials_total, scale=500) × 0.15
active = ln_norm(recruiting, scale=50) × 0.25

Regulatory Signal (15%)

FDA approval status provides a +60 point base. Additional signal from label count (10%), completed trial phase progression (15%), and active recruiting pipeline (15%).

Sentiment (10%)

Keyword-based sentiment analysis on Reddit discourse. Positive and negative keyword dictionaries produce a score from -1 to +1, mapped linearly to 0-100.

sentiment = max(0, min(100, (avg + 1) × 50))

Market Signal (5%)

Stock price momentum and trading volume for companies with exposure to each compound. Sourced from Yahoo Finance. Weighted lowest because market signals follow advancement rather than leading it.


Logarithmic scale factors

Every raw count passes through ln(x + 1) before normalization. The scale factor defines what count maps to 100 points on the logarithmic curve. Counts beyond the scale factor are capped at 100.

MetricScale FactorMeaning
Reddit posts (7d)200200 posts/week = max score
Reddit engagement5,000score + comments × 3
Subreddit spread1515 unique subreddits
Total papers30,000OpenAlex + PubMed combined
Recent papers5,000Last 2 years
Citations50,000Total citation count
Clinical trials500Total registered trials
Recruiting trials50Actively recruiting
FDA labels10Approved label count
Completed trials100Phase progression signal
Stock volume50,000,000Daily trading volume

Exponential moving average with 12-hour half-life

Raw scores are noisy. A single viral Reddit post can move a compound's social signal by 40 points in a day. The EMA layer smooths this by weighting recent observations exponentially, with a half-life of 12 hours.

α = 1 − 2−(Δt / 12)
α = clamp(α, 0.001, 1.0)
EMAnew = EMAold + α × (Raw − EMAold)

After 12 hours, a new observation has ~50% influence on the EMA. After 24 hours, ~75%. After 3 days, ~94%. This means genuine sustained trends propagate within 1-2 days, while transient spikes are damped within hours.


Encoding compound trajectory as a probability distribution

Each compound is modeled as a Bayesian rating with a mean (μ) and uncertainty (σ), using the OpenSkill library's Plackett-Luce model. New compounds start with high uncertainty (μ = 25, σ = 8.33). As observations accumulate, the model tightens its confidence interval.

Each scoring cycle compares the compound's current raw score against its previous score. Score increases are treated as "wins" against a baseline, decreases as "losses." The Plackett-Luce model updates μ and σ accordingly.

Advancement Index = 0.7 × EMA + 0.3 × μbayesian

A compound that scores consistently high earns a stable prior that resists transient drops. A compound riding a single spike will see its Bayesian score remain conservative until sustained evidence confirms the trend. State is persisted to disk between server restarts.


Eight independent public sources

SourceWhat We CollectMethod
Reddit Post volume, score, comments, sentiment, subreddit spread across 65+ communities (peptides, Semaglutide, Biohackers, longevity, DrugNerds, etc.) Apify Reddit Scraper Pro, batched 10 subreddits/request
Google Trends Average interest (0-100), momentum %, peak interest, rising queries, regional breakdown pytrends library, 90-day windows
OpenAlex Total papers, recent papers (last 2 years), citation count, top titles REST API, sorted by citation count
PubMed Total papers, recent papers, paper titles, publication dates NCBI E-utilities (esearch + esummary)
ClinicalTrials.gov Total trials, recruiting trials, completed trials, recent trial details v2 REST API, paginated
FDA / openFDA Approval status, label count, brand names Drug label API
arXiv Preprint count, recent preprints, titles Atom feed API
Yahoo Finance Stock price, change %, volume, 52-week range for pharma companies mapped to compounds Chart API (free, no auth), US (.NYQ) and India (.NS) tickers

Reddit communities monitored

Tier 1 (core): peptides, Peptides, PeptideScience, researchchemicals. Tier 2 (compound-specific): Semaglutide, Ozempic, Wegovy, tirzepatide, Mounjaro. Tier 3 (health): Biohackers, longevity, Nootropics, PEDs, fitness, diabetes. Tier 4 (research): DrugNerds, medicine, pharmacology, neuroscience. Total: 65+ subreddits.

Stock ticker mappings

40+ global pharma companies are mapped to compounds. US tickers include NVO (Novo Nordisk), LLY (Eli Lilly), AZN (AstraZeneca), PFE (Pfizer), AMGN (Amgen). Indian NSE tickers include SUNPHARMA.NS, DRREDDY.NS, CIPLA.NS, ZYDUSLIFE.NS, LUPIN.NS, GLENMARK.NS for post-patent generic manufacturers.


Multi-layer real-time data pipeline

Data freshness varies by source. Social signals refresh most frequently because they move fastest. Research data refreshes less often because papers and trials update on longer timescales.

Full Rescore
4h
All compounds re-scored via cron. Reddit + all APIs.
History Snapshots
15min
SQLite time-series snapshots for trend charts.
SSE Broadcast
30s
Server-sent events push live scores to connected dashboards.
Frontend Refresh
5min
GLP market views auto-refresh stock tickers and stats.
Reddit Scan
5min
Background auto-refresh of social signals.
Google Trends
1h
Search momentum and rising query refresh.

US and India GLP-1 market intelligence

Dedicated market views track the GLP-1 therapeutic landscape in the United States and India. Both views pull live stock data from Yahoo Finance, map compounds to companies and brands, and display macro health statistics.

US Market

Tracks 11 GLP-1 compounds with FDA status, brand names, originator companies, and stock performance. Top-line stats: CDC obesity rate (40.3%, 100M+ adults), diabetes prevalence (12%, 37M adults), annual economic burden ($260B), GLP-1 market size ($54B 2024), and top combined revenue ($41.2B for semaglutide + tirzepatide).

Key tickers: NVO (Novo Nordisk, semaglutide), LLY (Eli Lilly, tirzepatide/retatrutide), AZN (AstraZeneca), ALT (Altimmune, pemvidutide).

India Market

Following semaglutide's patent expiry on March 20, 2026, 40+ Indian companies launched 50+ generic brands. The India view tracks originator products alongside domestic generics, with live NSE stock tickers for companies like Dr. Reddy's, Sun Pharma, Cipla, Zydus Lifesciences, Lupin, and Glenmark.

India stats: obesity rate (24%, 350M+ overweight), diabetes prevalence (89.8M, 10.5%), annual burden ($29B), GLP-1 market ($118M, projected $530M by 2030), and 40+ generic manufacturers post-patent.

Key generic brands (India, post-patent)

CompanyBrandNSE Ticker
Dr. Reddy'sObedaDRREDDY.NS
Sun PharmaNoveltreatSUNPHARMA.NS
Zydus LifesciencesSemaglyn, AltermeZYDUSLIFE.NS
CiplaYurpeak (tirzepatide)CIPLA.NS
GlenmarkGLIPIQ, LirafitGLENMARK.NS
Natco PharmaSemanatNATCOPHARMA.NS
LupinSemanextLUPIN.NS
AlkemSemasizeALKEM.NS
TorrentSembolicTORNTPHARM.NS
Mankind PharmaSamakindMANKIND.NS

Signal classification and automated alerts

Each compound receives a signal label based on its Advancement Index and search momentum. Labels are descriptive, not predictive.

LabelCriteria
SurgingScore ≥ 80, or score ≥ 70 with momentum > 5%
RisingScore ≥ 55, or score ≥ 40 with momentum > 10%
StableScore ≥ 30
CoolingScore ≥ 15
DormantScore < 15

Automated alert rules

Social buzz ≥ 80 Exceptional social media activity
Search +30% Search interest surging
Search -20% Search interest declining
Recruiting ≥ 3 Multiple trials actively recruiting
FDA approved Regulatory milestone reached
Posts ≥ 20/week High Reddit activity
5+ subreddits Cross-community discussion
Sentiment < -0.3 Negative sentiment, possible safety concerns
50+ recent papers Active research surge
5,000+ citations Highly cited compound

SQLite append-only signal history

Every 15 minutes, the system snapshots all compound scores into an SQLite database. This produces ~96 data points per compound per day, enabling trend visualization and historical analysis.

ColumnTypeDescription
slugTEXTCompound identifier
timestampTEXTISO 8601 timestamp
signal_scoreREALRaw composite score
advancement_indexREALFinal blended score
social_buzzREALSocial dimension (0-100)
search_momentumREALSearch dimension (0-100)
research_velocityREALResearch dimension (0-100)
sentimentREALSentiment dimension (0-100)
muREALBayesian mean
sigmaREALBayesian uncertainty
reddit_posts_7dINTEGERReddit post count
google_interestREALGoogle Trends score
google_momentum_pctREALSearch momentum %
openalex_recentINTEGERRecent paper count
trials_recruitingINTEGERActive recruiting trials
fda_approvedINTEGERFDA approval flag

REST endpoints

All endpoints are served from the FastAPI application on port 8420. CORS is enabled. Responses are JSON unless otherwise noted.

EndpointMethodDescription
/api/signals/cacheGETFull ranked compound cache, sorted by advancement_index
/api/signals/compound/{slug}GETSingle compound detail. Pass ?refresh=true to force live rescore.
/api/signals/top?limit=20GETTop N movers with score, label, momentum, alerts
/api/signals/history/{slug}?days=30GETTime-series history from SQLite
/api/signals/alertsGETAll active alerts across all compounds
/api/signals/refreshPOSTTrigger full rescan and rescore (10-min timeout)
/api/glp/market?region=usGETGLP-1 market data, stats, and live stock tickers. Regions: us, india
/api/signals/streamGETServer-sent events stream for live dashboard updates (30s interval)
/api/waitlistPOSTEmail subscription. Stores locally, sends confirmation via Resend.
/api/contactPOSTContact form. Stores locally, notifies via Resend.

Self-expanding compound universe

The system does not maintain a fixed list. It continuously scans social and research sources for new compound mentions, validates them against academic databases, and adds them to the catalog automatically. Started with 73 seed compounds, now tracks 190+ across 20 categories.

Categories

CategoryExamples
GLP-1 / Weight LossSemaglutide, Tirzepatide, Retatrutide, Survodutide, Cagrisema, Orforglipron
Growth HormoneTesamorelin, MK-677, CJC-1295, Ipamorelin, Sermorelin, Hexarelin
Healing & RepairBPC-157, TB-500, TB4-FRAG
NootropicSemax, Selank, Dihexa, P21, Cerebrolysin, Cortexin
LongevitySS-31, Rapamycin, NAD+, Humanin, MOTS-c, FOXO4-DRI, Epitalon
Skin, Hair & AgingMelanotan-2, GHK-Cu, SNAP-8, Argireline, Matrixyl
ImmuneThymosin Alpha-1, LL-37, KPV, Larazotide
Muscle & PerformanceFollistatin-344, ACE-031, AOD-9604, IGF-1 LR3
SARMsEnclomiphene, Ostarine, RAD-140, LGD-4033, Cardarine
BioregulatorsThymalin, Thymogen, Ovagen, Vesugen, Chelohart (Khavinson peptides)
Sexual HealthPT-141, Kisspeptin
Clinical PipelineVX-548, ER-100
NucleotidesNAD+, NMN, NR (Dinucleotide, Mononucleotide)
AntimicrobialLL-37, Thymosin peptides

Fuzzy matching

Compounds are discovered via fuzzy string matching against social mentions. Thresholds: basic similarity ≥ 85%, token-flexible ≥ 80%, phonetic similarity ≥ 0.9. This catches misspellings (e.g., "semiglutide" → semaglutide) and brand name references.


Not medical advice. Not investment advice. The Advancement Index is provided for informational and research purposes only. Signal data reflects observable public activity and does not constitute a recommendation to buy, sell, or use any compound. Data sources are public APIs subject to their own terms of service.