Dataset Method

Dataset Method

RMR vs TQT

RMR vs TQT

RMR and TQT stay separate because they serve different jobs: human-readable review on one side, metric-clean text on the other.

RMR and TQT stay separate because they serve different jobs: human-readable review on one side, metric-clean text on the other.

/prm/dataset-structure/rmr-vs-tqt

/prm/dataset-structure/rmr-vs-tqt

RMR

Review-friendly

Preserves structure for human review, provenance checks, and source-facing QA.

TQT

Measurement-clean

Normalizes text for statistical work so metric behavior is not driven by formatting artifacts.

Separation

Review vs stats

Keeps public claims tied to the right layer of the pipeline.

Evidence Frame

RMR

RMR is the review-friendly transform.

It preserves more readable structure so a reviewer can inspect flow, segmentation, and preparation choices inside a controlled environment.

RMR is not a public source publication. It is the layer that helps explain how material moved through preparation without turning the site into a corpus dump.

TQT

TQT is the measurement-friendly transform.

It is used for metric consistency, reduced punctuation interference, tokenizer analysis, and statistical comparison.

TQT makes the corpus easier to compare across tokenizers, slice sizes, and metric families by reducing formatting noise before quantitative analysis.

Why both exist

RMR and TQT answer different questions.

RMR helps preserve review readability.

TQT helps stabilize quantitative analysis.

Together they let PRM be reviewed as both writing and measurable language data without merging those jobs into one unstable layer.

How they fit the pipeline

The pipeline uses transforms to keep review and measurement separate. RMR supports controlled human review. TQT supports repeatable metric work.

That separation matters because a public-safe website cannot expose protected source text, private manifests, or source-level mappings. The transform layer lets the project describe the process while raw material remains inside the controlled review boundary.

What the comparison does not mean

RMR and TQT are not competing claims about which version is “real.” They are different working views of the same underlying material.

The public page does not ask readers to reconstruct the corpus from either transform. It explains why two transform layers exist and how they support reviewability, measurement consistency, and public-safe reporting.

How to read the charts

Start with deep metric coverage to see why transform discipline matters: the analysis moves through many workbooks, metrics, tokenizers, segments, and slice sizes.

Tokenizer robustness is the most direct TQT-facing check. It asks whether the result survives different tokenizer lenses after the measurement transform is applied. Raw trends and the metric-slice heatmap show how aggregate outputs behave downstream from the transform layer.

The 65536 macro lens gives a large-window view of public-safe aggregate behavior. That matters because transform choices should support scale, not only small excerpts.

Public-safe limits

This page describes transform roles and aggregate outputs. It does not publish raw writing, protected excerpts, private transform files, source-level mappings, third-party source labels, artist names, album titles, or song titles.

Public-safe boundary

Public pages show aggregate evidence, metric behavior, method provenance, and corpus structure. Protected text, identities, source titles, and reconstructable mappings stay private.