identifAI has been recognised as a pioneer of real-time content validation by Gartner.
Read More >

Sharing Performance Metrics for Degenerative AI Models

As part of our ongoing commitment to transparency and responsible AI development, we are sharing the performance metrics of our degenerative AI models.

We believe that open communication around model capabilities and limitations is essential not only for building trust with the broader community, but also for fostering collaboration, accountability, and innovation within the field of artificial intelligence. By publicly reporting these performance indicators, we aim to demonstrate the progress we are making in continuously improving our models.
ai development

Performance indicators

These metrics, benchmarked against a diverse dataset comprising both academic and proprietary (privately generated) samples, allow us—and others—to assess how well our models perform across various tasks and use cases, and provide a clear baseline for future enhancements. This transparency is especially important in the context of degenerative AI, where outputs are often complex, nuanced, and subject to interpretation.

To offer a consistent and interpretable view of model performance, we will report the following standard evaluation metrics:

Each of these metrics offers unique insights into model behavior and collectively provide a robust framework for evaluating and comparing degenerative model performance over time. By making these results available, we invite constructive feedback from the research community, foster shared learning, and ultimately strive for more responsible and effective deployment of degenerative AI technologies.

Accuracy

This metric measures the overall correctness of the model's outputs by calculating the proportion of total predictions that are correct. It is a broad indicator of performance but can be misleading in the presence of class imbalance.

Precision

Precision evaluates the proportion of correct positive predictions relative to the total number of positive predictions made. It answers the question: Of all the outputs labeled as positive by the model, how many were actually correct? This is particularly important in scenarios where false positives carry a high cost.

Recall

Also known as sensitivity, recall measures the proportion of actual positive cases that the model correctly identified. It reflects the model’s ability to capture all relevant cases and is critical in contexts where missing a true positive is particularly undesirable.

F1 Score

The F1 score is the harmonic mean of precision and recall, providing a single metric that balances both concerns. It is especially useful when there is an uneven class distribution or when a balance between precision and recall is essential.

Model Statistics

Accuracy

Human Precision

Artificial Precision

Human Recall

Artificial Recall

Human F1Score

Artificial F1Score

Accuracy
Human Precision
Artificial Precision
Human Recall
Artificial Recall
Human F1Score
Artificial F1Score

V01

Release date: Aug 2024
Model type: Images/Video

V012

Release date: Aug 2024
Model type: Images/Video

BrokenWand 2

Release date: Aug 2024
Model type: Images/Video

Revelio II

Release date: Jan 2025
Model type: Images/Video

Zebra-Ita

Release date: Feb 2025
Model type: Speech

Zebra-Comb

Release date: Feb 2025
Model type: Speech

Revelio III

Release date: May 2025
Model type: Images/Video

Phantom

Release date: July 2025
Model type: Speech

Zebra II

Release date: July 2025
Model type: Speech

Deepfake-Eval-2024: A Multi-Modal In-the-Wild Benchmark of Deepfakes Circulated in 2024

Deepfake-Eval-2024 is an in-the-wild deepfake dataset. Deepfake-Eval-2024 contains 44 hours of videos, 56.5 hours of audio, and 1,975 images, encompassing contemporary manipulation technologies, diverse media content, 88 different website sources, and 52 different languages. Deepfake-Eval-2024 contains manually labeled real and fake media. Deepfake-Eval-2024 is designed to facilitate deepfake detection research. Deepfake-Eval-2024 was created by a team from TrueMedia.org, the University of Washington, Miraflow AI, Georgetown University, Chung-Ang University, and Yonsei University.

RILIEVO-III (images) PERFORMANCE:

-Accuracy: 84.61%
-Human precision: 77.80%
-Artificial precision: 91.55%
-Human recall: 90.36%
-Artificial recall: 80.2%
-Human f1 score: 83.61%
-Artificial f1 score: 85.50%
-Auc score: 85.28%
Original paper

Defending truth - global crisis

Subscribe to our regular deep-dives into deepfake

Subscription Form
@ P.Iva 13570670961
TermsPrivacy