Gaussia

Born at Alquimia AI Labs

Scientific metrics for intelligent behaviors.

If you can't trace it to a paper, it's not a metric. It's an opinion. Gaussia implements peer-reviewed research so your evaluations become part of the scientific record.

$ pip install gaussia
Scroll
The Problem

The Problem No One
Wants to Admit

There's a thriving ecosystem of tools for evaluating language models. Most of them work. And yet, if you ask an engineering team why a faithfulness score of 0.83 is trustworthy, the most common answer is:

"Because that's what the dashboard says."

That's not a technical problem. It's an epistemological one. Evaluating AI systems with metrics that no one can trace, cite, or reproduce is exactly the kind of magical thinking those systems taught us to avoid.

The Current Landscape

DeepEval

Quality metrics

50+ metrics for RAG, agents, conversation

Most metrics require LLM API calls. Scientific backing is partial.

RAGAS

RAG evaluation

Academic origin (arXiv paper). No ground truth needed.

Scope limited to retrieval and generation. No safety/fairness.

Promptfoo

Red teaming

50+ attack types. Adopted by OpenAI and Anthropic.

Exclusively offensive. No quality or fairness metrics.

Garak

Vulnerability scanning

3000+ static probes. Research-oriented.

Static attacks only. Doesn't learn from context.

What These Tools Define by Omission

No Integration

Quality, security, and ethics exist in silos. No single tool covers all three with scientific rigor.

No Traceability

When a tool gives you a score of 0.83, can you cite the paper that defines what 0.83 means?

Models Only

Every tool assumes you're evaluating an AI model. But intelligent behavior can come from humans too.

Gaussia starts from a different premise:

The unit of analysis is the behavior, not the architecture.

A behavior can come from an LLM, a voice agent in a call center, a human operator, or a hybrid system. If you want to measure quality, coherence, or bias in responses, it shouldn't matter who or what produced them.

Paper First

The Contract

Every metric in Gaussia comes with explicit scientific backing. When you use a metric, you know exactly what paper defined it, how it was validated, and how to cite it in your own work.

Paper First

No metric exists without its paper. Title, authors, year, venue, arXiv/DOI, implementation notes, validation datasets, and BibTeX entry. All included.

Reproducible

Every implementation follows the exact methodology described in the paper. Run the same validation the authors did. Trust the numbers because they come from rigorous science.

Citeable

When you use Gaussia in production or research, you can cite the underlying papers. Your evaluations become part of the scientific record, not just a number in a dashboard.

Why Gaussia is Different

Multi-Environment Native

Rust for low-level performance and WebAssembly. Python for deep analysis. JavaScript/TypeScript for edge and browser. Not wrappers. Native implementations.

Multimodal by Design

Text, audio, image, video. Intelligence doesn't communicate only with text. Neither should evaluation. Same scientific rigor across all modalities.

Infrastructure, Not Platform

MIT license. No telemetry. No lock-in. Build your own dashboards, auditing services, or compliance tools on top. Gaussia is the foundation, not the product.

"

Evaluating AI systems with metrics that no one can trace, cite, or reproduce is exactly the kind of magical thinking those systems taught us to avoid.

Gaussia Whitepaper
Four Modules

The Architecture

Gaussia is organized into four modules, each with well-defined responsibilities and environment-specific implementations.

Open Ecosystem

Papers First, SDKs Follow

Every metric starts as a paper. Once approved, the community builds official SDKs that implement the science in your language of choice.

1

Paper Repository

Scientific papers are proposed, reviewed, and approved by the community. Each paper defines a metric with mathematical rigor.

gaussia-labs/papers
2

Implementation Specs

Contributors translate papers into language-agnostic specifications. These specs define interfaces, edge cases, and test vectors.

RFC Process
3

Community SDKs

Official SDKs maintained by the community implement specs in every major language. Same science, native performance.

Pick Your Language

Official SDKs

Python

stable

TypeScript

beta

Rust

planned

C++

planned

Swift

planned

Go

planned

Want to maintain an SDK for your favorite language?

Read the contribution guide
const measure = (response) => analyze(response).then(r => validate(r));
const measure = (response) => analyze(response).then(r => validate(r));
const measure = (response) => analyze(response).then(r => validate(r));
const measure = (response) => analyze(response).then(r => validate(r));
const measure = (response) => analyze(response).then(r => validate(r));
const measure = (response) => analyze(response).then(r => validate(r));
const measure = (response) => analyze(response).then(r => validate(r));
const measure = (response) => analyze(response).then(r => validate(r));
const measure = (response) => analyze(response).then(r => validate(r));
const measure = (response) => analyze(response).then(r => validate(r));
const measure = (response) => analyze(response).then(r => validate(r));
const measure = (response) => analyze(response).then(r => validate(r));
const measure = (response) => analyze(response).then(r => validate(r));
const measure = (response) => analyze(response).then(r => validate(r));
const measure = (response) => analyze(response).then(r => validate(r));
const measure = (response) => analyze(response).then(r => validate(r));
const measure = (response) => analyze(response).then(r => validate(r));
const measure = (response) => analyze(response).then(r => validate(r));
const measure = (response) => analyze(response).then(r => validate(r));
const measure = (response) => analyze(response).then(r => validate(r));
const measure = (response) => analyze(response).then(r => validate(r));
const measure = (response) => analyze(response).then(r => validate(r));
const measure = (response) => analyze(response).then(r => validate(r));
const measure = (response) => analyze(response).then(r => validate(r));
const measure = (response) => analyze(response).then(r => validate(r));
const measure = (response) => analyze(response).then(r => validate(r));
const measure = (response) => analyze(response).then(r => validate(r));
const measure = (response) => analyze(response).then(r => validate(r));
const measure = (response) => analyze(response).then(r => validate(r));
const measure = (response) => analyze(response).then(r => validate(r));
const measure = (response) => analyze(response).then(r => validate(r));
const measure = (response) => analyze(response).then(r => validate(r));
const measure = (response) => analyze(response).then(r => validate(r));
const measure = (response) => analyze(response).then(r => validate(r));
const measure = (response) => analyze(response).then(r => validate(r));
const measure = (response) => analyze(response).then(r => validate(r));
const measure = (response) => analyze(response).then(r => validate(r));
const measure = (response) => analyze(response).then(r => validate(r));
const measure = (response) => analyze(response).then(r => validate(r));
const measure = (response) => analyze(response).then(r => validate(r));
const measure = (response) => analyze(response).then(r => validate(r));
const measure = (response) => analyze(response).then(r => validate(r));
const measure = (response) => analyze(response).then(r => validate(r));
const measure = (response) => analyze(response).then(r => validate(r));
const measure = (response) => analyze(response).then(r => validate(r));
const measure = (response) => analyze(response).then(r => validate(r));
const measure = (response) => analyze(response).then(r => validate(r));
const measure = (response) => analyze(response).then(r => validate(r));
const measure = (response) => analyze(response).then(r => validate(r));
const measure = (response) => analyze(response).then(r => validate(r));
Developer Experience

Simple, Powerful API

Get started in minutes with our intuitive API. Same concepts, native implementations for each environment.

example.py
Install now
pip install gaussianpm i gaussia (coming soon)cargo add gaussia (coming soon)
Roadmap

Where We're Going

Gaussia's long-term vision is an auditor model: trained from scratch, with a fully public dataset and transparent methodology. The model's education will be as auditable as the metrics it implements.

In Progress

Foundation

  • Core Python library with Quality & RAG metrics
  • RAGAS paper implementations (faithfulness, relevance)
  • Basic red teaming: prompt injection, jailbreak detection
  • Paper index with BibTeX citations
  • RFC process for new metric proposals
Next

Multi-Environment

  • JavaScript/TypeScript SDK for edge deployment
  • Rust core for WebAssembly and high-performance use cases
  • Guardrails module: PII detection, toxicity, content policy
  • Helping module: prompt analysis and templates
  • Multimodal support: audio transcription evaluation
Vision

The Auditor Model

  • Training dataset: fully public, auditable line by line
  • Model architecture and hyperparameters: transparent
  • Validation against original paper benchmarks
  • Sectoral finetuning: medical, legal, technical
  • Community governance for model updates

About the Auditor Model

The destination of Gaussia, if the community validates it, is a model trained specifically for auditing AI responses. Not a general LLM fine-tuned for evaluation, but a model built from scratch with three distinguishing characteristics:

  • 1.Completely public training dataset, including annotations
  • 2.Transparent training process: architecture, hyperparameters, loss curves, known biases
  • 3.Sectoral finetuning: medical, legal, technical domains with public or private weights

This model is not a substitute for human judgment. It's a tool that makes human judgment more scalable, consistent, and traceable.

Timing

Why Now

Three pressures are converging. Gaussia responds to all of them simultaneously.

Regulation is Here

The EU AI Act is in effect. NIST AI RMF is mandatory for US government procurement. Companies deploying LLMs in healthcare, finance, and education are receiving questions they can't answer with a dashboard: What methodology defines that your system is 'fair'? What paper validates that your hallucination detector works?

Reproducibility Crisis

An ICSE 2025 study found that half of LLM research artifacts with ACM reproducibility badges failed to meet their requirements just one year later. If research can't reproduce itself, how can industry trust the metrics derived from it?

Infrastructure Migration

Deployment is moving away from Python servers: edge computing, browsers, mobile devices, embedded systems. The evaluation ecosystem didn't migrate with it. There's no Rust evaluation library. No native JavaScript implementation. Gaussia fills that gap.

Traceable scientific rigor. Multi-language native. Vendor independence. Gaussia addresses these three pressures simultaneously.

Scientific Governance

Community First

Gaussia doesn't implement metrics because they sound good. Every metric goes through a public review process before a single line of code is written. The scientific community validates the science. Then we write the implementation.

How Metrics Get Added

01

Paper Proposal

Anyone can open an issue with a reference to a peer-reviewed paper. The discussion starts with the science, not the code.

02

Methodology Discussion

Open debate about the paper's assumptions, limitations, and how the implementation should map to the methodology.

03

RFC & Implementation

Only after consensus on interpretation does the RFC open. Every design decision is documented and traceable.

This means Gaussia has an advantage no commercial tool can easily replicate: its metrics are publicly audited before they exist. The debate is visible. Disagreements are recorded. Implementation is traceable to documented decisions.

Open to all researchers

Have a paper you'd like to see implemented?