Role
Process control
Makes dataset preparation explicit enough that the outputs can be audited.
Protects
Measurement clarity
Reduces avoidable noise from formatting, transcription, and corpus-boundary mistakes.
Output
Reviewable charts
Keeps public evidence visual and aggregated while the source audit remains private.
Evidence Frame
Data hygiene layers
1. Source collection
2. Cleaning and normalization
3. RMR and TQT transform generation
4. Segment and slice definition
5. Tokenizer-specific analysis
6. Metric generation
7. Summary workbook creation
8. Public-safe chart packaging
What hygiene protects
Hygiene protects the analysis from avoidable ambiguity. Consistent naming, transform rules, segmentation, tokenizer handling, and workbook outputs make it easier to explain what was measured.
It also protects the public boundary. Public charts can describe aggregate behavior while private manifests, source mappings, and protected text remain inside controlled review materials.
Hygiene practices
- title and metadata handling
- consistent transform rules
- segment naming discipline
- duplicate and collision review
- public-safe anonymization
- separation of public aggregate results from private source manifests
- versioned outputs and chart exports
Review checks
Good hygiene leaves evidence of its own process. The public site can show coverage dashboards, exception tracking, tokenizer robustness, and metric-index outputs. Private review can go deeper into manifests, mappings, and source-level documentation.
The important principle is separation: the public site should be useful without becoming a source dump, and the private review path should be detailed enough for serious inspection.
Why this matters
The credibility of PRM depends not only on the scores, but on whether the same procedure can be explained, repeated, and reviewed.
If the dataset layer is messy, the metric layer becomes harder to trust. If the dataset layer is disciplined, the findings read as outputs of a controlled process rather than a one-off presentation.
How to read the charts
Start with the coverage story dashboard to see the public-safe shape of the dataset evidence. Then use the deep metric coverage dashboard to understand the breadth of workbooks, metrics, tokenizers, segments, and slice sizes behind the site.
The public-safe gauntlet dashboard shows how the cleaned and packaged dataset enters comparison. Exception tracking shows where the headline result is not first. The metric index dashboard shows how cleaned outputs become the EURE, LDI, and RACS presentation layer.
Public-safe limits
This page describes dataset preparation and hygiene practices. It does not publish raw writing, protected excerpts, private manifests, source-level mappings, third-party source labels, artist names, album titles, or song titles.
Public-safe boundary
Public pages show aggregate evidence, metric behavior, method provenance, and corpus structure. Protected text, identities, source titles, and reconstructable mappings stay private.
