Currently, there are no standardized requirements for deploying AI algorithms in production, and there is a lack of transparency for users and clients. As a result, individuals can become unsuspecting targets of discrimination. This issue affects various sectors, including finance and healthcare. To address this, we standardize existing metrics in the AI field for evaluating fairness, accuracy, toxicity, and data quality. Our goal is to enhance confidence and reliability in AI models, raise awareness, assist model creators in improving fairness, and provide users and clients with tools to participate in model validation.
We evaluate models using blockchain technology and create a leaderboard that allows users to choose the best methods. This leaderboard also provides access to detailed metrics reports on the quality of the models, ensuring a transparent and informed selection process.
How it works
- 1- Submit your model: Submit a model's code, weights, test data and targets. We use zero knowledge to create a proof of the model and it's execution with the output data. We register the verifier for the model on chain. Then, with input, predicted and target values we calculate the scores on chain.
- 2- Obtain a report: A report for the model is created on chain. You will see our analysis for metrics, data quality, data distribution and suggested improvement dimensions and techniques.
- 3- Be part of the leaderboard: Models will be rated on a leaderboard, allowing anybody to publicly see their evaluation reports. This helps promote your model, your code, or you as a scientist.
Metrics
Fairness: AI systems should make decisions impartially and without bias. It encompasses ensuring that the outcomes are equitable across different demographic groups, such as gender, race, and age.
Naive approaches, like simply balancing datasets, can overlook deeper issues such as systemic biases and intersectionality. Common metrics include demographic parity, equal opportunity, equalized odds, and disparate impact ratio. These metrics should be computed and reported for various subgroups.
Accuracy: A measure of correctness for the predictions.
It is typically evaluated using metrics such as precision, recall, F1 score, and overall accuracy.
Toxicity: Toxicity refers to harmful, abusive, or inappropriate content generated by AI systems, which can degrade user experience and lead to negative societal impacts. This metric focus on generative AI.
We will utilize NLP techniques to evaluate the sentiment and flag potentially harmful language.
Data Quality: Data quality refers to the accuracy, completeness, and reliability of the data used to train and evaluate AI models.
We provide report of missing value, distribution of variables, outliers, inconsistencies in data, variability.
Age, Contextual Relevance: The relevance of data over time, acknowledging that some data may become outdated and less useful. Ensuring the data is relevant to the application domain. Recognizing the historical context, such as past voting restrictions for women, which may affect data interpretation.
Data Updates: Certain algorithms will need updating with new data to keep being reliable, and thus, in certain cases time of last verification will be crucial, and the metrics will need to be updated with new datasets updating the model weights.Definitions for next susggested update and reliability of the AI given date expiration.
Data Privacy and copywright compliance: Ensuring the privacy and confidentiality of data used. Source transparency, copyright compliance and anonymization
Fakeness: The authenticity and reliability of the data used.
Source verification to ensuring data is sourced from credible and verified sources.and implementation of algorithms to datect fakeness and manipulated data.
Code Verification
Ensuring good practices of the code. We source code audit code for good practices in AI that can lead to overfitting, bias or missleading results. Such avoiding data augmentation on test set, and splitting correctly datasets.
Replication
The ability to reproduce the results in AI is crucial for evaluating models. For that reason our solution allow multiple submissions for verification of an already submitted model. Our limitation is that to ensure the model is the same, the source code must be exactly the same.