Open source · paper-first · community-maintained

Metrics grounded in science,
built by the community.

Open-source metrics that are paper-backed, reproducible, and language-agnostic. Evaluate, protect, and improve AI behaviour with open, auditable tools.

Get started Propose a new paper

LicenseMIT

Lock-inZero

01Manifesto

From metrics to evidence-backed evaluation.

“An evaluation metric is only as useful as the evidence behind it.”

— Gaussia manifesto

01
Scientific grounding
The metric's definition, assumptions, and validation basis are directly connected to published research.
02
Reproducibility
Every score can be independently verified, reproduced, and confidently cited in your own work.
03
Decision confidence
See the methodological strength behind every number instead of trusting a polished dashboard.

Many AI teams work with dashboards full of scores (faithfulness 0.87, toxicity 0.03, bias 0.12) without enough context to understand what those numbers truly represent or how much confidence they deserve.

Gaussia was built to close that gap. By requiring every metric to be grounded in verifiable scientific evidence, it turns evaluation outputs into claims that are clearer, more reproducible, and more trustworthy.

01
Scientific grounding
The metric's definition, assumptions, and validation basis are directly connected to published research.
02
Reproducibility
Every score can be independently verified, reproduced, and confidently cited in your own work.
03
Decision confidence
See the methodological strength behind every number instead of trusting a polished dashboard.

02Why Gaussia ?

The four gaps we close.

The current evaluation ecosystem has four problems no single tool solves on its own, and all four point back to the same two pillars: paper-first and community-first.

Gap 01

Fragmented tooling

Today

Quality, safety, and ethics live in separate libraries with incompatible conventions.

Gaussia

A single, extensible library that houses all three concerns under one scientific contract.

Gap 02

Absence of scientific traceability

Today

Scores are just numbers; the originating paper, methodology, and validation data are rarely exposed.

Gaussia

Every metric ships with the paper title, authors, year, DOI/arXiv link, and a ready-to-cite BibTeX entry.

Gap 03

Lock-in to a single language

Today

Metric logic is tied to a specific runtime or stack, making it hard to apply consistently across environments.

Gaussia

The metric definition lives in the scientific source; implementations follow the same spec across languages.

Gap 04

Wrong unit of analysis

Today

Evaluation assumes the target is an “AI system” rather than the observable behaviour.

Gaussia

Gaussia's modules evaluate any behaviour — model output, human response, or hybrid interaction — by the same criteria.

03The Gaussia Contract

Paper-first. Community-first.

Every metric in Gaussia is provably linked to a peer-reviewed source, and the link is immutable and visible to every user.

Step 01
Paper proposal
Anyone opens a Discussion in the Proposals category, cites the peer-reviewed work, and explains the problem the metric solves.
DiscussionDOI / arXivMotivation
Step 02
Community debate
Reviewers examine the proposal publicly — the debate is visible, dissent is recorded, decisions are traceable.
NoveltySoundnessClarityFeasibility
Step 03
Implementation
Any language can implement from the paper. The code must include a metadata block mapping the implementation to the methodology.
PythonTypeScriptRustGoC++
Step 04
Merge & credit
After two reviewer approvals the PR is merged. The paper joins the official framework and authors are credited in the code and citation list.
PR mergedAuthors creditedImmutable link

04How it works

One spec. Many implementations.

All steps are public, version-controlled, and require no vendor approval. Self-hosted usage with zero outbound telemetry.

Scientific paper

arXiv / DOI

Open discussion

Review & interpretation

RFC

Formal specification

Implementations

Across environments

Validation scripts

Reproducible checks

Self-hosted usage

No telemetry

The methodology always precedes the code. A paper becomes a conversation, a conversation becomes a formal RFC, and only then does it become an implementation — every step in the open, every step traceable.

05Modules at a glance

Three modules. One scientific contract.

Every metric page lists the paper reference, the validation dataset, and a one-line SDK call. Pick your module, the lineage ships with every score.

01Module

Evaluate

How good is the output?

Context & Faithfulness in RAG
Es et al., EACL 2024
Conversational quality (Grice's maxims)
Grice, 1975
Agentic pass@K
Ruan et al., 2024

02Module

Protect

Is it safe and fair?

Toxicity via DIDT
Gehman et al., 2020
Bias · Granite Guardian & LLaMA Guard
Liang et al., 2021
Regulatory compliance via NLI
Markov et al., 2023

03Module

Improve

How can I make it better?

Prompt optimisation · GEPA, MIPROv2
Zhou et al., 2023
Explainability
Integrated Gradients · SHAP · LIME
Synthetic data generation
Reynolds & McDonell, 2021

01Module

Evaluate

How good is the output?

Context & Faithfulness in RAG
Es et al., EACL 2024
Conversational quality (Grice's maxims)
Grice, 1975
Agentic pass@K
Ruan et al., 2024

02Module

Protect

Is it safe and fair?

Toxicity via DIDT
Gehman et al., 2020
Bias · Granite Guardian & LLaMA Guard
Liang et al., 2021
Regulatory compliance via NLI
Markov et al., 2023

03Module

Improve

How can I make it better?

Prompt optimisation · GEPA, MIPROv2
Zhou et al., 2023
Explainability
Integrated Gradients · SHAP · LIME
Synthetic data generation
Reynolds & McDonell, 2021

Read the papers →

06Get started

Install locally. Run anywhere.

All SDKs are MIT-licensed, install locally, and run without any outbound telemetry.

1from gaussia.metrics.toxicity import Toxicity
2from gaussia.core.retriever import Retriever
3from gaussia.schemas.common import Dataset, Batch
4
5# Define a custom retriever to load your data
6class MyRetriever(Retriever):
7    def load_dataset(self) -> list[Dataset]:
8        return [
9            Dataset(
10                session_id="session-1",
11                assistant_id="my-assistant",
12                language="english",
13                context="",
14                conversation=[
15                    Batch(
16                        qa_id="q1",
17                        query="Tell me about AI safety",
18                        assistant="AI safety is important...",
19                    )
20                ]
21            )
22        ]
23
24# Run the toxicity metric
25results = Toxicity.run(
26    MyRetriever,
27    group_prototypes={
28        "gender": ["women", "men", "female", "male"],
29        "race": ["Asian", "African", "European"],
30    },
31    verbose=True,
32)
33
34# Analyze results
35for metric in results:
36    print(f"DIDT Score: {metric.group_profiling.frequentist.DIDT}")

Get started

07Contribute

A quick guide to shaping the library.

All contributions stay under the MIT licence. Reviewers receive permanent citation credit. The debate is public, and the record is immutable.

01
Propose your idea
Open a Discussion in Proposals. Include the paper citation, problem statement, and target SDKs.
02
Write the paper
Fork the repo, copy template/ to papers/YYYY-MM-your-title/, write in LaTeX, add figures and references.bib.
03
Open a PR
Target main, fill the PR template, link the original discussion, and ensure the paper compiles.
04
Community review
At least two reviewers approve on novelty, soundness, clarity, and feasibility.
05
Merge & implement
An implementation issue opens in the relevant SDK repo; authors are credited in code and citation list.

Read the full contribution guide Start a proposal

Metrics grounded in science, built by the community.

From metrics to evidence-backed evaluation.

Scientific grounding

Reproducibility

Decision confidence

Scientific grounding

Reproducibility

Decision confidence

The four gaps we close.

Fragmented tooling

Absence of scientific traceability

Lock-in to a single language

Wrong unit of analysis

Paper-first. Community-first.

Paper proposal

Community debate

Implementation

Merge & credit

One spec. Many implementations.

Three modules. One scientific contract.

Evaluate

Protect

Improve

Evaluate

Protect

Improve

Install locally. Run anywhere.

A quick guide to shaping the library.

Propose your idea

Write the paper

Open a PR

Community review

Merge & implement

Metrics grounded in science,
built by the community.