Open source · paper-first · community-maintained

Metrics grounded in science, built by the community.

Open-source metrics that are paper-backed, reproducible, and language-agnostic. Evaluate, protect, and improve AI behaviour with open, auditable tools.

LicenseMIT
Lock-inZero
01Manifesto

From metrics to evidence-backed evaluation.

“An evaluation metric is only as useful as the evidence behind it.”
— Gaussia manifesto
  • 01

    Scientific grounding

    The metric's definition, assumptions, and validation basis are directly connected to published research.

  • 02

    Reproducibility

    Every score can be independently verified, reproduced, and confidently cited in your own work.

  • 03

    Decision confidence

    See the methodological strength behind every number instead of trusting a polished dashboard.

Many AI teams work with dashboards full of scores (faithfulness 0.87, toxicity 0.03, bias 0.12) without enough context to understand what those numbers truly represent or how much confidence they deserve.

Gaussia was built to close that gap. By requiring every metric to be grounded in verifiable scientific evidence, it turns evaluation outputs into claims that are clearer, more reproducible, and more trustworthy.

02Why Gaussia ?

The four gaps we close.

The current evaluation ecosystem has four problems no single tool solves on its own, and all four point back to the same two pillars: paper-first and community-first.

Gap 01

Fragmented tooling

Today

Quality, safety, and ethics live in separate libraries with incompatible conventions.

Gaussia

A single, extensible library that houses all three concerns under one scientific contract.

Gap 02

Absence of scientific traceability

Today

Scores are just numbers; the originating paper, methodology, and validation data are rarely exposed.

Gaussia

Every metric ships with the paper title, authors, year, DOI/arXiv link, and a ready-to-cite BibTeX entry.

Gap 03

Lock-in to a single language

Today

Metric logic is tied to a specific runtime or stack, making it hard to apply consistently across environments.

Gaussia

The metric definition lives in the scientific source; implementations follow the same spec across languages.

Gap 04

Wrong unit of analysis

Today

Evaluation assumes the target is an “AI system” rather than the observable behaviour.

Gaussia

Gaussia's modules evaluate any behaviour — model output, human response, or hybrid interaction — by the same criteria.

03The Gaussia Contract

Paper-first. Community-first.

Every metric in Gaussia is provably linked to a peer-reviewed source, and the link is immutable and visible to every user.

  1. Step 01

    Paper proposal

    Anyone opens a Discussion in the Proposals category, cites the peer-reviewed work, and explains the problem the metric solves.

    DiscussionDOI / arXivMotivation
  2. Step 02

    Community debate

    Reviewers examine the proposal publicly — the debate is visible, dissent is recorded, decisions are traceable.

    NoveltySoundnessClarityFeasibility
  3. Step 03

    Implementation

    Any language can implement from the paper. The code must include a metadata block mapping the implementation to the methodology.

    PythonTypeScriptRustGoC++
  4. Step 04

    Merge & credit

    After two reviewer approvals the PR is merged. The paper joins the official framework and authors are credited in the code and citation list.

    PR mergedAuthors creditedImmutable link
04How it works

One spec. Many implementations.

All steps are public, version-controlled, and require no vendor approval. Self-hosted usage with zero outbound telemetry.

Scientific paper
arXiv / DOI
Open discussion
Review & interpretation
RFC
Formal specification
Implementations
Across environments
Validation scripts
Reproducible checks
Self-hosted usage
No telemetry

The methodology always precedes the code. A paper becomes a conversation, a conversation becomes a formal RFC, and only then does it become an implementation — every step in the open, every step traceable.

05Modules at a glance

Three modules. One scientific contract.

Every metric page lists the paper reference, the validation dataset, and a one-line SDK call. Pick your module, the lineage ships with every score.

06Get started

Install locally. Run anywhere.

All SDKs are MIT-licensed, install locally, and run without any outbound telemetry.

1from gaussia.metrics.toxicity import Toxicity
2from gaussia.core.retriever import Retriever
3from gaussia.schemas.common import Dataset, Batch
4
5# Define a custom retriever to load your data
6class MyRetriever(Retriever):
7 def load_dataset(self) -> list[Dataset]:
8 return [
9 Dataset(
10 session_id="session-1",
11 assistant_id="my-assistant",
12 language="english",
13 context="",
14 conversation=[
15 Batch(
16 qa_id="q1",
17 query="Tell me about AI safety",
18 assistant="AI safety is important...",
19 )
20 ]
21 )
22 ]
23
24# Run the toxicity metric
25results = Toxicity.run(
26 MyRetriever,
27 group_prototypes={
28 "gender": ["women", "men", "female", "male"],
29 "race": ["Asian", "African", "European"],
30 },
31 verbose=True,
32)
33
34# Analyze results
35for metric in results:
36 print(f"DIDT Score: {metric.group_profiling.frequentist.DIDT}")
07Contribute

A quick guide to shaping the library.

All contributions stay under the MIT licence. Reviewers receive permanent citation credit. The debate is public, and the record is immutable.

  1. 01

    Propose your idea

    Open a Discussion in Proposals. Include the paper citation, problem statement, and target SDKs.

  2. 02

    Write the paper

    Fork the repo, copy template/ to papers/YYYY-MM-your-title/, write in LaTeX, add figures and references.bib.

  3. 03

    Open a PR

    Target main, fill the PR template, link the original discussion, and ensure the paper compiles.

  4. 04

    Community review

    At least two reviewers approve on novelty, soundness, clarity, and feasibility.

  5. 05

    Merge & implement

    An implementation issue opens in the relevant SDK repo; authors are credited in code and citation list.