Duplicate Image Detection

Created by: whittenator

📄 Description

Implement an automated duplicate image detection system within Intel Geti to identify and flag identical or near-identical images in datasets before model training. This feature should analyze uploaded images using perceptual hashing algorithms (such as pHash, dHash, or aHash) and structural similarity measures to detect:

  • Exact duplicates: Pixel-perfect identical images, including those with different file formats or compression levels

  • Near duplicates: Visually similar images with minor variations such as slight cropping, rotation, brightness adjustments, or JPEG compression artifacts

  • Scaled duplicates: Same image content at different resolutions

The system should provide a configurable similarity threshold (e.g., 90-99% similarity) and present detected duplicates in an intuitive interface where users can review, compare, and decide whether to keep, merge annotations, or remove duplicates. The feature should integrate seamlessly with the existing annotation workflow and provide batch processing capabilities for large datasets.

Key technical requirements:

  • Configurable similarity thresholds with visual preview

  • Batch processing for efficient analysis of large datasets

  • Integration with existing project management and annotation workflows

  • Performance optimization to handle enterprise-scale datasets

🎯 Objective

Primary Goal: Prevent model overfitting and training degradation caused by duplicate images in training datasets, thereby improving model generalization and performance.

Secondary Goals:

  • Reduce dataset storage requirements and processing overhead

  • Improve data quality and curation efficiency for machine learning workflows

  • Provide data scientists and annotators with better visibility into dataset composition

  • Minimize manual effort required for dataset cleaning and preparation

  • Ensure compliance with data governance standards that may require deduplication

Success Metrics:

  • Reduction in training dataset redundancy (target: identify 95%+ of duplicate pairs)

  • Improved model performance metrics on validation sets

  • Decreased storage footprint of curated datasets

  • Reduced time spent on manual dataset review and cleaning