Filtering Heuristics
How AI models filter and select training data from Common Crawl.
Coming Soon
Detailed filtering heuristics documentation is being prepared.
Overview
Foundation models apply quality filters to Common Crawl data before training.
Key Filters
- Quality scoring - Content quality assessment
- Deduplication - Remove duplicate content
- Safety filtering - Remove harmful content
- Language detection - Filter by language
- Domain reputation - Trust signals
For more on training data, see Common Crawl.