Gaussia
Born at Alquimia AI Labs
Scientific metrics for intelligent behaviors.
If you can't trace it to a paper, it's not a metric. It's an opinion. Gaussia implements peer-reviewed research so your evaluations become part of the scientific record.
The Problem No One
Wants to Admit
There's a thriving ecosystem of tools for evaluating language models. Most of them work. And yet, if you ask an engineering team why a faithfulness score of 0.83 is trustworthy, the most common answer is:
"Because that's what the dashboard says."
That's not a technical problem. It's an epistemological one. Evaluating AI systems with metrics that no one can trace, cite, or reproduce is exactly the kind of magical thinking those systems taught us to avoid.
The Current Landscape
DeepEval
Quality metrics
50+ metrics for RAG, agents, conversation
Most metrics require LLM API calls. Scientific backing is partial.
RAGAS
RAG evaluation
Academic origin (arXiv paper). No ground truth needed.
Scope limited to retrieval and generation. No safety/fairness.
Promptfoo
Red teaming
50+ attack types. Adopted by OpenAI and Anthropic.
Exclusively offensive. No quality or fairness metrics.
Garak
Vulnerability scanning
3000+ static probes. Research-oriented.
Static attacks only. Doesn't learn from context.
What These Tools Define by Omission
No Integration
Quality, security, and ethics exist in silos. No single tool covers all three with scientific rigor.
No Traceability
When a tool gives you a score of 0.83, can you cite the paper that defines what 0.83 means?
Models Only
Every tool assumes you're evaluating an AI model. But intelligent behavior can come from humans too.
Gaussia starts from a different premise:
The unit of analysis is the behavior, not the architecture.
A behavior can come from an LLM, a voice agent in a call center, a human operator, or a hybrid system. If you want to measure quality, coherence, or bias in responses, it shouldn't matter who or what produced them.
The Contract
Every metric in Gaussia comes with explicit scientific backing. When you use a metric, you know exactly what paper defined it, how it was validated, and how to cite it in your own work.
Paper First
No metric exists without its paper. Title, authors, year, venue, arXiv/DOI, implementation notes, validation datasets, and BibTeX entry. All included.
Reproducible
Every implementation follows the exact methodology described in the paper. Run the same validation the authors did. Trust the numbers because they come from rigorous science.
Citeable
When you use Gaussia in production or research, you can cite the underlying papers. Your evaluations become part of the scientific record, not just a number in a dashboard.
Why Gaussia is Different
Multi-Environment Native
Rust for low-level performance and WebAssembly. Python for deep analysis. JavaScript/TypeScript for edge and browser. Not wrappers. Native implementations.
Multimodal by Design
Text, audio, image, video. Intelligence doesn't communicate only with text. Neither should evaluation. Same scientific rigor across all modalities.
Infrastructure, Not Platform
MIT license. No telemetry. No lock-in. Build your own dashboards, auditing services, or compliance tools on top. Gaussia is the foundation, not the product.
Evaluating AI systems with metrics that no one can trace, cite, or reproduce is exactly the kind of magical thinking those systems taught us to avoid.
The Architecture
Gaussia is organized into four modules, each with well-defined responsibilities and environment-specific implementations.
Papers First, SDKs Follow
Every metric starts as a paper. Once approved, the community builds official SDKs that implement the science in your language of choice.
Paper Repository
Scientific papers are proposed, reviewed, and approved by the community. Each paper defines a metric with mathematical rigor.
gaussia-labs/papersImplementation Specs
Contributors translate papers into language-agnostic specifications. These specs define interfaces, edge cases, and test vectors.
RFC ProcessCommunity SDKs
Official SDKs maintained by the community implement specs in every major language. Same science, native performance.
Pick Your LanguageOfficial SDKs
Python
stableTypeScript
betaRust
plannedC++
plannedSwift
plannedGo
plannedWant to maintain an SDK for your favorite language?
Read the contribution guideconst measure = (response) => analyze(response).then(r => validate(r)); const measure = (response) => analyze(response).then(r => validate(r)); const measure = (response) => analyze(response).then(r => validate(r)); const measure = (response) => analyze(response).then(r => validate(r)); const measure = (response) => analyze(response).then(r => validate(r)); const measure = (response) => analyze(response).then(r => validate(r)); const measure = (response) => analyze(response).then(r => validate(r)); const measure = (response) => analyze(response).then(r => validate(r)); const measure = (response) => analyze(response).then(r => validate(r)); const measure = (response) => analyze(response).then(r => validate(r)); const measure = (response) => analyze(response).then(r => validate(r)); const measure = (response) => analyze(response).then(r => validate(r)); const measure = (response) => analyze(response).then(r => validate(r)); const measure = (response) => analyze(response).then(r => validate(r)); const measure = (response) => analyze(response).then(r => validate(r)); const measure = (response) => analyze(response).then(r => validate(r)); const measure = (response) => analyze(response).then(r => validate(r)); const measure = (response) => analyze(response).then(r => validate(r)); const measure = (response) => analyze(response).then(r => validate(r)); const measure = (response) => analyze(response).then(r => validate(r)); const measure = (response) => analyze(response).then(r => validate(r)); const measure = (response) => analyze(response).then(r => validate(r)); const measure = (response) => analyze(response).then(r => validate(r)); const measure = (response) => analyze(response).then(r => validate(r)); const measure = (response) => analyze(response).then(r => validate(r)); const measure = (response) => analyze(response).then(r => validate(r)); const measure = (response) => analyze(response).then(r => validate(r)); const measure = (response) => analyze(response).then(r => validate(r)); const measure = (response) => analyze(response).then(r => validate(r)); const measure = (response) => analyze(response).then(r => validate(r)); const measure = (response) => analyze(response).then(r => validate(r)); const measure = (response) => analyze(response).then(r => validate(r)); const measure = (response) => analyze(response).then(r => validate(r)); const measure = (response) => analyze(response).then(r => validate(r)); const measure = (response) => analyze(response).then(r => validate(r)); const measure = (response) => analyze(response).then(r => validate(r)); const measure = (response) => analyze(response).then(r => validate(r)); const measure = (response) => analyze(response).then(r => validate(r)); const measure = (response) => analyze(response).then(r => validate(r)); const measure = (response) => analyze(response).then(r => validate(r)); const measure = (response) => analyze(response).then(r => validate(r)); const measure = (response) => analyze(response).then(r => validate(r)); const measure = (response) => analyze(response).then(r => validate(r)); const measure = (response) => analyze(response).then(r => validate(r)); const measure = (response) => analyze(response).then(r => validate(r)); const measure = (response) => analyze(response).then(r => validate(r)); const measure = (response) => analyze(response).then(r => validate(r)); const measure = (response) => analyze(response).then(r => validate(r)); const measure = (response) => analyze(response).then(r => validate(r)); const measure = (response) => analyze(response).then(r => validate(r));
Simple, Powerful API
Get started in minutes with our intuitive API. Same concepts, native implementations for each environment.
Where We're Going
Gaussia's long-term vision is an auditor model: trained from scratch, with a fully public dataset and transparent methodology. The model's education will be as auditable as the metrics it implements.
Foundation
- Core Python library with Quality & RAG metrics
- RAGAS paper implementations (faithfulness, relevance)
- Basic red teaming: prompt injection, jailbreak detection
- Paper index with BibTeX citations
- RFC process for new metric proposals
Multi-Environment
- JavaScript/TypeScript SDK for edge deployment
- Rust core for WebAssembly and high-performance use cases
- Guardrails module: PII detection, toxicity, content policy
- Helping module: prompt analysis and templates
- Multimodal support: audio transcription evaluation
The Auditor Model
- Training dataset: fully public, auditable line by line
- Model architecture and hyperparameters: transparent
- Validation against original paper benchmarks
- Sectoral finetuning: medical, legal, technical
- Community governance for model updates
About the Auditor Model
The destination of Gaussia, if the community validates it, is a model trained specifically for auditing AI responses. Not a general LLM fine-tuned for evaluation, but a model built from scratch with three distinguishing characteristics:
- 1.Completely public training dataset, including annotations
- 2.Transparent training process: architecture, hyperparameters, loss curves, known biases
- 3.Sectoral finetuning: medical, legal, technical domains with public or private weights
This model is not a substitute for human judgment. It's a tool that makes human judgment more scalable, consistent, and traceable.
Why Now
Three pressures are converging. Gaussia responds to all of them simultaneously.
Regulation is Here
The EU AI Act is in effect. NIST AI RMF is mandatory for US government procurement. Companies deploying LLMs in healthcare, finance, and education are receiving questions they can't answer with a dashboard: What methodology defines that your system is 'fair'? What paper validates that your hallucination detector works?
Reproducibility Crisis
An ICSE 2025 study found that half of LLM research artifacts with ACM reproducibility badges failed to meet their requirements just one year later. If research can't reproduce itself, how can industry trust the metrics derived from it?
Infrastructure Migration
Deployment is moving away from Python servers: edge computing, browsers, mobile devices, embedded systems. The evaluation ecosystem didn't migrate with it. There's no Rust evaluation library. No native JavaScript implementation. Gaussia fills that gap.
Traceable scientific rigor. Multi-language native. Vendor independence. Gaussia addresses these three pressures simultaneously.
Community First
Gaussia doesn't implement metrics because they sound good. Every metric goes through a public review process before a single line of code is written. The scientific community validates the science. Then we write the implementation.
How Metrics Get Added
Paper Proposal
Anyone can open an issue with a reference to a peer-reviewed paper. The discussion starts with the science, not the code.
Methodology Discussion
Open debate about the paper's assumptions, limitations, and how the implementation should map to the methodology.
RFC & Implementation
Only after consensus on interpretation does the RFC open. Every design decision is documented and traceable.
This means Gaussia has an advantage no commercial tool can easily replicate: its metrics are publicly audited before they exist. The debate is visible. Disagreements are recorded. Implementation is traceable to documented decisions.
Documentation
Guides with mathematical foundations and implementation details.
Read the DocsGitHub
Source code, issues, RFCs, and contribution guidelines.
View RepositoryDiscord
Discuss papers, implementations, and evaluation methodology.
Join DiscussionPaper Index
All referenced papers with BibTeX entries and summaries.
Explore PapersHave a paper you'd like to see implemented?