Project Rocket Man / PRM
A rights-controlled, human-authored lyrical language corpus and benchmark environment documented through public-safe aggregate evidence.
Protected Corpus
The private source writing behind PRM. Public pages describe its behavior through aggregate outputs, not raw text.
Public-Safe Evidence
Charts, metric summaries, method notes, and boundaries that make the project legible without making protected material reconstructable.
Controlled Review
A non-public review path for technical, research, or licensing diligence under appropriate terms.
Aggregate Metrics
Measurements summarized across slices or corpus groups, so readers see behavior without source-level text.
Source Mapping
Private structure connecting source material to analysis inputs. Public pages do not expose it.
Reconstructable Manifest
Any listing, map, or label set that could help rebuild protected source structure. PRM keeps this out of the public site.
Entropy
A measure of distribution and uncertainty in text behavior. In PRM, entropy helps show how information is spread across a measured slice.
Inverse Entropy Behavior
A way of reading how predictability, repetition, or compression pressure changes across a text set.
Shannon Entropy
The primary entropy measure for information dispersion. Higher Shannon entropy generally means the text is less predictable and more evenly distributed.
Rényi Entropy
An entropy-family measure that can be more sensitive to concentration effects than Shannon alone. PRM uses it as a second view of distribution behavior.
Tsallis Entropy
An entropy-family measure that reads diversity and concentration through a different mathematical lens, helping test whether entropy strength is broad or metric-specific.
Lexical Diversity
A vocabulary-range signal. It asks how much variety appears in the measured text relative to the amount of language being measured.
Unique Tokens
The count of distinct measured tokens in a slice or corpus group. It helps separate vocabulary reach from repeated reuse of the same terms.
MTLD
Measure of Textual Lexical Diversity. It estimates how long a text can sustain vocabulary variety before repetition lowers the diversity signal.
MSTTR50
Mean Segmental Type-Token Ratio at 50-token windows. It averages vocabulary variety across controlled chunks so one long passage does not dominate the read.
HDD42
Hypergeometric Distribution Diversity using PRM’s comparison setting. It estimates lexical variety through probability rather than a simple unique-word count.
Herdan’s C
A vocabulary-growth metric that compares distinct vocabulary against total length. It helps show whether vocabulary keeps expanding as text volume grows.
Yule’s K
A repetition and lexical-concentration metric. Lower raw Yule’s K usually means less repetition pressure, so PRM inverts it before index scoring.
Maas
A lexical narrowness metric. Lower raw Maas generally indicates broader vocabulary behavior, so PRM inverts it before index scoring.
Tokenizer
A system that breaks text into units for analysis. Tokenizer behavior can change how language patterns appear.
Slice Size
The amount of text measured at once. PRM uses controlled slices so comparisons face similar pressure.
Slice Trend
A stability check across different measurement windows. It asks whether a signal survives changing slice sizes or depends on one favorable window.
Transform
A repeatable preparation step that converts protected source material into analysis-ready form without publishing the source.
RMR
A review-facing layer used for human-readable checking, provenance work, and source-facing quality control.
TQT
A measurement-clean layer used for statistical analysis after formatting and handling choices are controlled.
EURE
Efficient Complexity. A PRM index combining entropy strength, unique-token behavior, and repetition control into a public-safe efficient-complexity read.
LDI
Lexical Discipline Index. A PRM index for vocabulary control and sustained word-choice behavior outside the entropy formula.
RACS
Repetition-Adjusted Complexity Score. A PRM index for complexity that remains after repetition pressure and lexical narrowness are discounted.
Public-Safe Gauntlet
A set of aggregate checks that stress the claim without exposing source text, titles, or private mappings.
Reference Baseline
A comparison coordinate used to interpret metric behavior. It is not an artistic ranking.
Adversarial Writing Set
A difficult comparison set used to test whether the metric stack stays coherent under pressure.
Provenance
The record of where inputs, transforms, versions, and outputs came from, kept public-safe on the open site.
Reproducibility Package
Controlled materials that may support independent review without turning the protected corpus into a public download.
