Is your AI system fair? Can you explain AI decisions?
We test fairness, explainability, and data quality using scientifically validated methodsβand show concrete ways to improve.
Bias in AI systems is real and measurable: gender discrimination in recruiting, ethnic bias in credit scoring, age discrimination in insurance. The EU AI Act calls for fairness testing and transparencyβwe do both.
In 4-6 weeks, you will receive detailed analyses, concrete bias metrics, and practical mitigation strategies. From “Do we have a problem?” to “Here’s how we solve it.”
β±οΈ 4-6 weeks | π 80-120 hours/1-3 AI systems | π° β¬8,000 – β¬20,000 | βοΈ Compliance: EU AI Act Art. 10, 13, 15
β 70+ fairness metrics (IBM AIF360, Fairlearn)
β Gold standard explainability (SHAP)
β Production-ready mitigation strategies
β Scientifically validated methodology
Why AI Bias & Transparency Testing?
Regulatory: The EU AI Act requires it
π Key articles:
Art. 10 (Data Governance):
Training data must be “representative” β Fairness testing required
Art. 13 (Transparency):
High-risk systems must be “interpretable” β Explainability testing
Art. 15 (Accuracy & Robustness):
Performance must be consistent across subgroups β Subgroup analyses
Consequences of non-compliance:
Up to β¬15 million or 3% of global annual revenue
When does it become mandatory:
High-risk systems from August 2026
Business: Bias costs money and reputation
πΌ Real consequences:
Financial risks:
- Apple Card (2019): Gender Bias β Regulatory Investigation
- Amazon HR Tool (2018): Gender Bias β Tool discontinued, millions lost
Reputational damage:
- Public scandals = loss of trust
- Shitstorm on social media
- Customers are leaving
Legal risks:
- AGG lawsuits (General Equal Treatment Act)
- Class action lawsuits (USA)
- EU AI Act fines
Operational Issues:
- Wrong decisions = poor business outcomes
- Rejecting diverse talents = talent shortage
- Rejecting good borrowers = loss of revenue
π‘ AI bias and transparency testing is risk management: it’s better to identify problems early on before they become costly.
The 3 test dimensions
We test your system from three complementary perspectives:
βοΈ
Fairness & Bias Detection
What we test:
- Demographic parity (equal approval rates across groups?)
- Equal Opportunity (equal true positive rates?)
- Equalized Odds (equal error rates?)
- Individual fairness (similar people = similar outcomes?)
- Intersectional fairness (gender Γ age Γ ethnicity)
Metrics (70+ available):
- Disparate Impact Ratio (Benchmark: β₯0.80)
- Statistical Parity Difference
- Average Odds Difference
- Calibration via groups
Output:
- Bias heat map (which groups, how strong?)
- Root cause analysis (which features drive bias?)
- Severity rating (Critical/High/Medium/Low)
Example finding:
“Gender Disparate Impact: 0.65 (critical)
β Women are 35% less likely to be approved than men
β Root Cause: Feature ‘Years_Experience’ correlates with career breaks”
π‘
Explainability & Transparency
What we test:
- Global explainability (what is important for the model as a whole?)
- Local explainability (why this specific decision?)
- Feature Interactions (which features interact how?)
- Counterfactual analysis ("What would need to change for a different outcome?")
Methods:
- SHAP (SHapley Additive exPlanations) β Gold Standard
- LIME (Local Interpretable Model-agnostic Explanations)
- Feature Importance Rankings
- Partial dependence plots
Output:
- SHAP Summary Plots (Global: which features are most important)
- SHAP Force Plots (Local: how was this decision made)
- Model Card (compliant with EU AI Act Art. 13)
Example explanation:
“Loan denied for applicant #12345:
β Income too low: β15 points
β Age <30: β8 points
β No home ownership: β5 points
β Total score: 62 (threshold: 70)”
π
Data Quality & Representativeness
What we test:
- Representativeness (does the training data represent the target population?)
- Completeness (systematic missing values?)
- Consistency (contradictions in data?)
- Bias in training data (label bias, selection bias)
Analyses:
- Distribution matching (training vs. production)
- Subgroup coverage (are all groups adequately represented?)
- Label quality (are labels themselves biased?)
- Feature Correlation Analysis
Output:
- Data Quality Scorecard (0-100)
- Coverage matrix (which subgroups are missing?)
- Label Bias Report
- Recommendations for data collection
Example finding:
“Training data only has 12% women, but target population is 48% women
β Model learns mainly from male examples
β Recommendation: Re-sampling or collect more female data.”
Our methodology & tools
We combine best-in-class open-source tools with our own expertise. All methods are scientifically validated and have been tested in hundreds of projects.
π§ Core tools:
IBM AI Fairness 360 (AIF360)
- 70+ fairness metrics
- 10+ mitigation algorithms (pre/in/post-processing)
- Intersectional analyses
Microsoft Fairlearn
- Constraint-based fairness optimization
- Trade-off analyses (fairness vs. accuracy)
- Better visualizations than AIF360
SHAP (SHapley Additive exPlanations)
- Model-agnostic explainability
- Global + Local Explanations
- Feature Interactions
Great Expectations
- Data Quality Framework
- Automated testing pipelines
- Audit trail for data governance
Custom German NLP Probes (for chatbots/text systems)
- Bias detection in German language models
- Detection of gendered language
- Name-based bias testing
What you will receive specifically
π Executive Summary (5-10 pages)
- High-level findings
- Critical Issues (prioritized by severity)
- Recommended actions (quick wins + long term)
- Business Impact Assessment
π Technical Deep-Dive Report (30-50 pages)
- All fairness metrics (detailed)
- Explainability analyses (SHAP plots, feature importance)
- Data Quality Assessment
- Subgroup performance matrices
- root cause analyses
Mitigation strategies (actionable)
- Pre-processing options (data level)
- In-processing options (model level)
- Post-processing options (prediction level)
- Trade-off analyses (fairness vs. accuracy, business impact)
- Implementation guidance (specific steps)
π Model Card (compliant with EU AI Act Art. 13)
- Intended Use
- Training Data Documentation
- Performance metrics (overall + per subgroup)
- Fairness Metrics
- Limitations & Risks
π» Code & Notebooks (Optional)
- Jupyter notebooks with all analyses
- Reproducible results
- For your data scientists (knowledge transfer)
π€ Presentation & Workshop (4 hours)
- Present findings (for management and tech teams)
- Q&A session
- Discuss mitigation strategies
- Define next steps
Additionally (optional):
β Implementation support (after testing) We help with the implementation of mitigation strategies.
β Monitoring setup Continuous fairness monitoring after go-live
β Follow-up assessment (after 6-12 months) Verify improvements, retest after changes
Who needs AI bias and transparency testing?
| use case | Common issue | EU AI Act |
|---|---|---|
| Credit scoring (finance) | Gender/Age/Zip Bias Discrimination Risk | Annex III.5b – High risk |
| HR Recruiting (All Industries) | Gender/Name Bias Diversity Issues | Annex III.4 – High risk |
| Insurance / Underwriting | Age/Gender Bias Discriminatory | Annex III.5a – High risk |
| Fraud Detection (Finance/E-Commerce) | False Positive Bias Subgroup Impact | Article 15 – Accuracy Requirements |
| Chatbots (Customer Service) | Language Bias Response Fairness | Article 13 – Transparency |
| Medical AI (Healthcare) | Racial/gender bias Life-critical | Annex III.1 – High risk |
Checklist: Do you need testing?
β You MUST test if:
- High-risk system (EU AI Act Annex III)
- Decisions about people (HR, credit, insurance)
- Protected attributes involved (gender, age, ethnicity)
β You SHOULD test if:
- Fairness concerns within the team
- System has been running for >6 months (drift risk)
- New regulation pending (EU AI Act)
- "Black box" models (deep learning, ensembles)
β Testing may be optional:
- Non-critical systems (recommendations, content ranking)
- No people affected (Pure Technical Systems)
- Very simple models (logistic regression, decision trees)
Process: 4-6 weeks from kickoff to report
π Week 1: Discovery & Data Access
- Kickoff call (2 hours): Understanding the system, use case, goals
- Data Access Setup: Access to training data, predictions, metadata
- Scope Finalization: Which fairness metrics, which subgroups
- NDA & Data Protection Agreement
π Weeks 2-3: Testing & Analysis
- Fairness testing: Calculate 70+ metrics, subgroup analyses
- Explainability testing: SHAP analyses, feature importance
- Data quality testing: representativeness, completeness
- Root cause analyses: Why bias? Which features?
π Week 4: Mitigation & Reporting
- Develop mitigation strategies (3-5 options)
- Trade-off analyses (fairness vs. accuracy)
- Create report (executive + technical)
- Create Model Card (EU AI Act compliant)
π Weeks 5-6: Presentation & Follow-up
- Draft Report Review (Your Feedback)
- Final Report Delivery
- Presentation Workshop (4-8 hours with your teams)
- Q&A, Define next steps
- Handoff (code, notebooks, documentation)
Your effort
- Week 1: 4-6 hours (kickoff, data access)
- Weeks 2-3: 2-4 hours (check-ins, answering questions)
- Weeks 4-6: 4-8 hours (review, workshop)
Total: ~12-20 hours over 4-6 weeks
Frequently asked questions
Start your testing
π¬ Free initial consultation (30 min)
We analyze your system and estimate the effort involved.
π COMING SOON: View sample report
See an anonymized example of our reports.
