Skip to main content

Filtering Heuristics

How AI models filter and select training data from Common Crawl.

Coming Soon

Detailed filtering heuristics documentation is being prepared.

Overview

Foundation models apply quality filters to Common Crawl data before training.

Key Filters

  1. Quality scoring - Content quality assessment
  2. Deduplication - Remove duplicate content
  3. Safety filtering - Remove harmful content
  4. Language detection - Filter by language
  5. Domain reputation - Trust signals

For more on training data, see Common Crawl.