Duplicate Image Detection

Created by: whittenator

📄 Description

Implement an automated duplicate image detection system within Intel Geti to identify and flag identical or near-identical images in datasets before model training. This feature should analyze uploaded images using perceptual hashing algorithms (such as pHash, dHash, or aHash) and structural similarity measures to detect:

Exact duplicates: Pixel-perfect identical images, including those with different file formats or compression levels
Near duplicates: Visually similar images with minor variations such as slight cropping, rotation, brightness adjustments, or JPEG compression artifacts
Scaled duplicates: Same image content at different resolutions

The system should provide a configurable similarity threshold (e.g., 90-99% similarity) and present detected duplicates in an intuitive interface where users can review, compare, and decide whether to keep, merge annotations, or remove duplicates. The feature should integrate seamlessly with the existing annotation workflow and provide batch processing capabilities for large datasets.

Key technical requirements:

Configurable similarity thresholds with visual preview
Batch processing for efficient analysis of large datasets
Integration with existing project management and annotation workflows
Performance optimization to handle enterprise-scale datasets

🎯 Objective

Primary Goal: Prevent model overfitting and training degradation caused by duplicate images in training datasets, thereby improving model generalization and performance.

Secondary Goals:

Reduce dataset storage requirements and processing overhead
Improve data quality and curation efficiency for machine learning workflows
Provide data scientists and annotators with better visibility into dataset composition
Minimize manual effort required for dataset cleaning and preparation
Ensure compliance with data governance standards that may require deduplication

Success Metrics:

Reduction in training dataset redundancy (target: identify 95%+ of duplicate pairs)
Improved model performance metrics on validation sets
Decreased storage footprint of curated datasets
Reduced time spent on manual dataset review and cleaning