Is your AI system fair? Can you explain AI decisions?

We test fairness, explainability, and data quality using scientifically validated methodsβ€”and show concrete ways to improve.

Bias in AI systems is real and measurable: gender discrimination in recruiting, ethnic bias in credit scoring, age discrimination in insurance. The EU AI Act calls for fairness testing and transparencyβ€”we do both.

In 4-6 weeks, you will receive detailed analyses, concrete bias metrics, and practical mitigation strategies. From “Do we have a problem?” to “Here’s how we solve it.”

⏱️ 4-6 weeks | πŸ“Š 80-120 hours/1-3 AI systems | πŸ’° €8,000 – €20,000 | βš–οΈ Compliance: EU AI Act Art. 10, 13, 15

βœ“ 70+ fairness metrics (IBM AIF360, Fairlearn)
βœ“ Gold standard explainability (SHAP)
βœ“ Production-ready mitigation strategies
βœ“ Scientifically validated methodology

Why AI Bias & Transparency Testing?

Regulatory: The EU AI Act requires it

πŸ“œ Key articles:

Art. 10 (Data Governance):
Training data must be “representative” β†’ Fairness testing required

Art. 13 (Transparency):
High-risk systems must be “interpretable” β†’ Explainability testing

Art. 15 (Accuracy & Robustness):
Performance must be consistent across subgroups β†’ Subgroup analyses

Consequences of non-compliance:
Up to €15 million or 3% of global annual revenue

When does it become mandatory:
High-risk systems from August 2026

Business: Bias costs money and reputation

πŸ’Ό Real consequences:

Financial risks:

  • Apple Card (2019): Gender Bias β†’ Regulatory Investigation
  • Amazon HR Tool (2018): Gender Bias β†’ Tool discontinued, millions lost

Reputational damage:

  • Public scandals = loss of trust
  • Shitstorm on social media
  • Customers are leaving

Legal risks:

  • AGG lawsuits (General Equal Treatment Act)
  • Class action lawsuits (USA)
  • EU AI Act fines

Operational Issues:

  • Wrong decisions = poor business outcomes
  • Rejecting diverse talents = talent shortage
  • Rejecting good borrowers = loss of revenue

πŸ’‘ AI bias and transparency testing is risk management: it’s better to identify problems early on before they become costly.

The 3 test dimensions

We test your system from three complementary perspectives:

βš–οΈ
Fairness & Bias Detection

What we test:

  • Demographic parity (equal approval rates across groups?)
  • Equal Opportunity (equal true positive rates?)
  • Equalized Odds (equal error rates?)
  • Individual fairness (similar people = similar outcomes?)
  • Intersectional fairness (gender Γ— age Γ— ethnicity)

Metrics (70+ available):

  • Disparate Impact Ratio (Benchmark: β‰₯0.80)
  • Statistical Parity Difference
  • Average Odds Difference
  • Calibration via groups

Output:

  • Bias heat map (which groups, how strong?)
  • Root cause analysis (which features drive bias?)
  • Severity rating (Critical/High/Medium/Low)

Example finding:
“Gender Disparate Impact: 0.65 (critical)
β†’ Women are 35% less likely to be approved than men
β†’ Root Cause: Feature ‘Years_Experience’ correlates with career breaks”

πŸ’‘
Explainability & Transparency

What we test:

  • Global explainability (what is important for the model as a whole?)
  • Local explainability (why this specific decision?)
  • Feature Interactions (which features interact how?)
  • Counterfactual analysis ("What would need to change for a different outcome?")

Methods:

  • SHAP (SHapley Additive exPlanations) – Gold Standard
  • LIME (Local Interpretable Model-agnostic Explanations)
  • Feature Importance Rankings
  • Partial dependence plots

Output:

  • SHAP Summary Plots (Global: which features are most important)
  • SHAP Force Plots (Local: how was this decision made)
  • Model Card (compliant with EU AI Act Art. 13)

Example explanation:
“Loan denied for applicant #12345:
β†’ Income too low: βˆ’15 points
β†’ Age <30: βˆ’8 points
β†’ No home ownership: βˆ’5 points
β†’ Total score: 62 (threshold: 70)”

πŸ“Š
Data Quality & Representativeness

What we test:

  • Representativeness (does the training data represent the target population?)
  • Completeness (systematic missing values?)
  • Consistency (contradictions in data?)
  • Bias in training data (label bias, selection bias)

Analyses:

  • Distribution matching (training vs. production)
  • Subgroup coverage (are all groups adequately represented?)
  • Label quality (are labels themselves biased?)
  • Feature Correlation Analysis

Output:

  • Data Quality Scorecard (0-100)
  • Coverage matrix (which subgroups are missing?)
  • Label Bias Report
  • Recommendations for data collection

Example finding:
“Training data only has 12% women, but target population is 48% women
β†’ Model learns mainly from male examples
β†’ Recommendation: Re-sampling or collect more female data.”

Our methodology & tools

We combine best-in-class open-source tools with our own expertise. All methods are scientifically validated and have been tested in hundreds of projects.

πŸ”§ Core tools:

IBM AI Fairness 360 (AIF360)

  • 70+ fairness metrics
  • 10+ mitigation algorithms (pre/in/post-processing)
  • Intersectional analyses

Microsoft Fairlearn

  • Constraint-based fairness optimization
  • Trade-off analyses (fairness vs. accuracy)
  • Better visualizations than AIF360

SHAP (SHapley Additive exPlanations)

  • Model-agnostic explainability
  • Global + Local Explanations
  • Feature Interactions

Great Expectations

  • Data Quality Framework
  • Automated testing pipelines
  • Audit trail for data governance

Custom German NLP Probes (for chatbots/text systems)

  • Bias detection in German language models
  • Detection of gendered language
  • Name-based bias testing

What you will receive specifically

πŸ“„ Executive Summary (5-10 pages)

  • High-level findings
  • Critical Issues (prioritized by severity)
  • Recommended actions (quick wins + long term)
  • Business Impact Assessment

πŸ“Š Technical Deep-Dive Report (30-50 pages)

  • All fairness metrics (detailed)
  • Explainability analyses (SHAP plots, feature importance)
  • Data Quality Assessment
  • Subgroup performance matrices
  • root cause analyses

Mitigation strategies (actionable)

  • Pre-processing options (data level)
  • In-processing options (model level)
  • Post-processing options (prediction level)
  • Trade-off analyses (fairness vs. accuracy, business impact)
  • Implementation guidance (specific steps)

πŸ“‹ Model Card (compliant with EU AI Act Art. 13)

  • Intended Use
  • Training Data Documentation
  • Performance metrics (overall + per subgroup)
  • Fairness Metrics
  • Limitations & Risks

πŸ’» Code & Notebooks (Optional)

  • Jupyter notebooks with all analyses
  • Reproducible results
  • For your data scientists (knowledge transfer)

🎀 Presentation & Workshop (4 hours)

  • Present findings (for management and tech teams)
  • Q&A session
  • Discuss mitigation strategies
  • Define next steps

Additionally (optional):

β˜‘ Implementation support (after testing) We help with the implementation of mitigation strategies.

β˜‘ Monitoring setup Continuous fairness monitoring after go-live

β˜‘ Follow-up assessment (after 6-12 months) Verify improvements, retest after changes

Who needs AI bias and transparency testing?

use caseCommon issueEU AI Act
Credit scoring (finance)Gender/Age/Zip Bias
Discrimination Risk
Annex III.5b – High risk
HR Recruiting (All Industries)Gender/Name Bias
Diversity Issues
Annex III.4 – High risk
Insurance / UnderwritingAge/Gender Bias
Discriminatory
Annex III.5a – High risk
Fraud Detection (Finance/E-Commerce)False Positive Bias
Subgroup Impact
Article 15 – Accuracy Requirements
Chatbots (Customer Service)Language Bias
Response Fairness
Article 13 – Transparency
Medical AI (Healthcare)Racial/gender bias
Life-critical
Annex III.1 – High risk

Checklist: Do you need testing?

βœ… You MUST test if:

  • High-risk system (EU AI Act Annex III)
  • Decisions about people (HR, credit, insurance)
  • Protected attributes involved (gender, age, ethnicity)

βœ… You SHOULD test if:

  • Fairness concerns within the team
  • System has been running for >6 months (drift risk)
  • New regulation pending (EU AI Act)
  • "Black box" models (deep learning, ensembles)

❓ Testing may be optional:

  • Non-critical systems (recommendations, content ranking)
  • No people affected (Pure Technical Systems)
  • Very simple models (logistic regression, decision trees)

Process: 4-6 weeks from kickoff to report

πŸ“… Week 1: Discovery & Data Access

  • Kickoff call (2 hours): Understanding the system, use case, goals
  • Data Access Setup: Access to training data, predictions, metadata
  • Scope Finalization: Which fairness metrics, which subgroups
  • NDA & Data Protection Agreement

πŸ“… Weeks 2-3: Testing & Analysis

  • Fairness testing: Calculate 70+ metrics, subgroup analyses
  • Explainability testing: SHAP analyses, feature importance
  • Data quality testing: representativeness, completeness
  • Root cause analyses: Why bias? Which features?

πŸ“… Week 4: Mitigation & Reporting

  • Develop mitigation strategies (3-5 options)
  • Trade-off analyses (fairness vs. accuracy)
  • Create report (executive + technical)
  • Create Model Card (EU AI Act compliant)

πŸ“… Weeks 5-6: Presentation & Follow-up

  • Draft Report Review (Your Feedback)
  • Final Report Delivery
  • Presentation Workshop (4-8 hours with your teams)
  • Q&A, Define next steps
  • Handoff (code, notebooks, documentation)
Your effort
  • Week 1: 4-6 hours (kickoff, data access)
  • Weeks 2-3: 2-4 hours (check-ins, answering questions)
  • Weeks 4-6: 4-8 hours (review, workshop)

Total: ~12-20 hours over 4-6 weeks

Frequently asked questions

Minimum:

  • Training data (or sample, min. 10,000 rows)
  • Model Predictions (on test set, min 1,000)
  • Metadata (feature descriptions, protected attributes)

Ideal:

  • Complete training data
  • Production Predictions (last 3-6 months)
  • Model artifacts (for SHAP: model file)

Privacy policy:

  • GDPR-compliant processing
  • Standard NDA
  • On-premise analysis possible (if sensitive)

Typical flow:

  1. Document findings (severity, impact)
  2. Root cause analysis (why?)
  3. Mitigation strategies (3-5 options)
  4. Trade-off analysis (costs, benefits)
  5. You decide the next steps

Options in case of critical bias:

  • Quick fix (post-processing, fast)
  • Model retraining (takes time, but better)
  • Data collection (more diverse data)
  • Feature engineering (removing problematic features)

We provide implementation support (optional, $150-200/hour)

Yes, if you have:

  • In-house ML/fairness expertise
  • Time (2-4 weeks full-time data scientist)
  • Tools (AIF360, SHAP, etc. setup)
  • Methodological knowledge (which metrics when?)

Our advantages:

  • 25+ years of experience
  • Know which metrics are critical
  • Identify root causes faster
  • Know best practice mitigations
  • Audit-proof documentation
  • External validation (more credibility)

Typical: In-house does basic testing, we do deep dives for critical systems.

Start your testing

πŸ’¬ Free initial consultation (30 min)

We analyze your system and estimate the effort involved.

Request a sample