Why Vaultless Tokenization is the Engineering Response to Modern Data Risk

Apr 10
12 min read

Updated: Apr 13

Vaultless tokenization example showing personal data transformed into protected tokenized values

Data and AI risk are no longer abstract boardroom concerns. As AI adoption accelerates and regulatory expectations expand in every major jurisdiction, organisations are being forced to rethink how they protect sensitive data. For organisations operating in regulated industries - whether financial institutions contending with GDPR and DORA obligations, healthcare institutions bound by longstanding privacy and compliance requirements, or other businesses subject to the most intense oversight - the challenge is especially acute. Organisations need to not only understand the changing data protection landscape but also have modern controls in place to protect sensitive data.

To help, we’ve compiled statistics that map the current state of the landscape, covering breach costs, sensitive data exposure across cloud and analytics environments, AI-related risk, and the regulatory consequences of falling short. A practical framework for building a data protection plan follows, along with an explanation of how SecureToken Vaultless Tokenization supports the implementation work within Cloudera.

Data Protection Statistics

The statistics below provide more insight into the data protection landscape, including how sensitive data risk is changing across the AI era.

AI Adoption (GenAI and agentic AI)

78% of organisations have incorporated GenAI in at least one business function, up from 72% the previous year. (McKinsey & Company State of AI Survey)
74% of companies using GenAI are seeing an ROI within a year, with 86% of these companies have also improved their revenues by 6% or more. (Google Cloud Survey)
Agentic AI adoption varies significantly by region (Deloitte State of AI in the Enterprise 2026)
- 52% of companies in the United States have adopted it
- 50% in the UK and EU
- 41% in Asia-Pacific
One-third of enterprises (25%) using GenAI are expected to deploy AI agents in 2025, growing to 50% by 2027. (Deloitte Global’s 2025 Predictions Report)
Gartner projects that by the end of 2026, 40% of enterprise applications will include task-specific AI agents. (Gartner Press Release, 2026)

Data Challenges and Shadow AI

More than 68% of the companies believed that their staff use unapproved AI tools, so-called Shadow AI (SAP, The Value of AI in the UK)
60% of organisations are unable to track specific requests employees make to GenAI tools or detect unapproved "Shadow AI" deployments (Cisco Cybersecurity Readiness Index, 2025).
In an ongoing research study of over 4,000 full-time employees, nearly three-quarters (73%) believe that generative AI introduces new security risks. (Salesforce Generative AI Ethic Survey)
Among 3,470 security and C-suite business leaders with firsthand knowledge of the data breach incidents at their organisations, nearly all (97%) stated that their organisations had suffered an AI-related incident due to a lack of proper AI-specific access controls (IBM Cost of a Data Breach Report,2025).
Shadow AI was a factor in 20% of all data breaches in the past year, adding an average of $670,000 to breach costs per incident. (IBM Cost of a Data Breach Report, 2025)
86% of organisations experienced AI-related security incidents in the past year, yet only 49% of respondents believe their employees fully understand AI-related threats. (Cisco Cybersecurity Readiness Index, 2025).
In a recent AI Data Security and Compliance Risk Survey of 461 cybersecurity, IT, risk management, and compliance professionals, 26% of organisations report that over 30% of what employees paste into public AI tools is private data. Only 17% have technical controls to stop it. (Centiment's/Kiteworks, AI Data Security and Compliance Risk Survey)
Despite the exposure risks, 63% of breached organisations lacked a formal AI governance policy (IBM Cost of a Data Breach Report,2025).
Security oncerns represent another major barrier, with 43% of organisations identifying security as a challenge when implementing AI agents.
In a recent Forrester survey of 1,524 AI decision-makers, security and risk were the most frequently cited concerns regarding AI usage in organisations. Key concerns included data privacy leaks (10%), regulatory compliance risks (7%), unsanctioned “bring-your-own-AI” usage by employees (6%), and insecure AI-generated software assets (6%). Other concerns included unpredictable AI outcomes (22%), internal readiness and governance challenges (22%), and difficulties obtaining high-quality training data (7%). (Forrester State of AI Survey)
30% of generative AI projects will be abandoned after proof of concept, often due to (Gartner Technology Press Release):
- Poor data quality
- Inadequate risk controls
- Escalating costs
- Unclear business value.

AI Workforce Readiness and Skills

By 2026, 90% of global enterprises are projected to face critical AI skills shortages — potentially costing the global economy $5.5 trillion in delayed products, lost revenue, and reduced competitiveness. (IDC Closing the Gap: Verifying AI Skills in the Enterprise)
There are only about 40% of companies that believe employees are sufficiently trained to use AI tools. (SAP, The Value of AI in the UK: Growth, People & Data report)
Key barriers to AI workforce readiness include (IDC Closing the Gap: Verifying AI Skills in the Enterprise):
- Lack of talent (46%)
- Data privacy concerns (43%)
- Poor data quality (40%)
- High implementation costs (40%)
- Unclear ROI on AI programmes (26%)
A majority of employees in the UK who use artificial intelligence at work do so without any formal training, with 70% experimenting with tools on the job, while 19% have actively taken AI courses (The Access Group)
More than a quarter of employees (26%) say they have no plans to use AI at all, while a further 8% are only in AI pilot programmes not yet accessible to them — representing a significant population at risk of being left behind. (Staffing Industry Analysis)
62% of finance workers feel increased pressure to perform due to AI-driven data collection in the workplace (OECD, 2025).
From a 2025 IBM CEO study, roughly 31% of the workforce may require retraining and/or reskilling over the next three years due to AI adoption.(IBM Study: CEOs Double Down on AI While Navigating Enterprise Hurdles)
Gartner predicts 80% of the engineering workforce will need significant upskilling by 2027 due to AI. By 2030, CIOs expect that no IT work will be done without AI involvement in some form. (Gartner Press Release, 2025)

AI Governance, Visibility and Risk

ERM decision-makers ranked their top five enterprise risks today as (Forrester Business Risk Survey, 2025):
- Information security / cyber risk
- Financial risk
- AI risk
- Data governance risk
- Operational risk
Data Integrity Fears: 65% of security leaders cite the corruption of AI models and data integrity as a top security concern (Thales Data Threat Report, 2025).
New "model inversion" attacks can now infer sensitive personal data used to train AI systems by observing model outputs (UK ICO, 2025).
Attacks on the AI supply chain, including malicious models in public stores, are on the rise (ENISA ETL, 2025).
44% of executives rank AI and data regulations in their top 3 factors driving them to rethink their company's short-term strategy, with 18% ranking it the single most important factor. (PwC Pulse Survey, 2025)

AI Investment

92% of executives plan to increase AI investment within three years, which creates a strong demand for AI-skilled employees. (McKinsey & Company Superagency in the Workplace)
88% of senior leaders say their team or business function plans to increase AI-related budgets in the next 12 months, driven largely by the rise of agentic AI. (PwC's AI Agent Survey, 2025)
GenAI model spending alone is projected to grow 80.8% in 2026, with its share of the overall software market rising by 1.8%. (Gartner, February 2026)
Year-over-year spending on AI is expected to grow 31.9% between 2025 and 2029, pushing total AI investment to $1.3 trillion by 2029. (IDC Agentic AI to Dominate IT Budget, 2025)

AI Use Case

The top AI use case priorities for the next 12 months, ranked by the share of surveyed organisations including each in their top three (Informatica CDO Insights, 2026):
- 29% enhancing customer experience and loyalty
- 28% improving business‑intelligence analytics and decision‑making
- 27% complying with regulatory and ESG standards
- 26% enhancing employee collaboration and workflows
- 26% optimising employee education / HR support
- 25% optimising post‑sale customer support
- 25% improving risk management / fraud prevention
- 25% optimising internal business‑process efficiency
Looking ahead, agentic AI is expected to have the highest impact in customer support, supply chain management, R&D, knowledge management, and cybersecurity. (Deloitte State of AI in the Enterprise, 2026)

How to Create a Data Protection Plan

1. Define the scope of sensitive data

Start by clearly stating what data needs to be protected. This step should include collating and identifying existing data types associated with your business or a specific project that require protection. Make sure you document key data items generated, along with their duplication that flows through different systems and functions. The very first step is to identify where exposure exists, and controls are required.

2. Define the data protection risk assessment process

Now it’s time to define how data protection risks will be assessed. Explain how you will evaluate the likelihood and impact of sensitive data exposure, misuse, or unauthorised access.

For higher-risk processing activities, this process may include completing a formal Data Protection Impact Assessment (DPIA) or equivalent risk analysis, such as CIA score (Confidentiality, Impact, and Availability)

3. Define data access and authorisation rules

Not every user, system, or process needs access to sensitive data. Access to sensitive data must always be tied to the principle of "least privilege". Generally speaking, the C-suite should have full visibility into who can access, under what conditions, and for what purposes. This step will also turn data protection from an abstract goal into enforceable rules, a vital step in protecting an organisation from data breaches.

4. Develop encryption and data security architecture

Protecting data both at rest and in transit is a no-brainer. This architectural step should involve designing infrastructure to use strong encryption (e.g., AES-256, data masking or different vaultless tokenization) for stored data, implementing secure transfer protocols (like TLS) for data in motion, and ensuring that keys are managed securely.

For organisations handling sensitive data in live production environments, Vaultless Tokenization is worth considering alongside or instead of data masking. You can find out the reason here.

5. Define monitoring and incident response processes

Finally, establish how data access and usage will be monitored over time. This should include defining logging, audit, and reporting requirements, as well as the steps to follow in the event of a data protection incident. As a bare minimum, your reporting or dashboard should address the “Five Ws”: Who did What to Whom, Where, and When. This practice should be covered in the immediate incident report, updates and past-incident report.

6. Review and update the data protection plan regularly

Business use cases change over time. New AI risks will emerge, and existing data risks may evolve, so it’s important to review and update the data protection plan regularly to reflect changes in the organisation’s data environment and regulatory landscape. This includes how data is ingested, transformed, queried, shared, and used in analytics or AI models.

Understanding how data is actually used helps identify where exposure risk exists and where controls are required.

How SecureToken Vaultless Tokenization delivers strengthened data protection

SecureToken Vaultless Tokenization makes protecting sensitive data easier without disrupting how data and security teams work. With our solution in Cloudera, you can:

Protect sensitive data automatically:Instead of exposing plaintext data during queries or processing, tokenize sensitive fields in real time as data is accessed or transformed. When exposure risk is typically highest during operational workloads, your sensitive values are not exposed to users who don’t need to see them
Standardise protection across all data workloads:Enforce the same tokenization and detokenization controls across open-source native processing and analytics databases, such as Apache Spark, Hive, and Impala, for batch, interactive SQL, and analytical use cases. Utilising these fast, large-scale computing engines, your security scales automatically with the platforms without moving data or introducing performance bottlenecks.
Control who can see sensitive data:Decide which users, services, systems, or departments are allowed to access or reverse tokenized/redacted data using encryption keys with Ranger KMS and centralised policy-based or role-based authorisation.
Preserve existing data usability for analytics:Format-preserving, reversible tokenization keeps data usable in its existing structure, so the analytics lifecycle - dashboards, reports, models and pipelines- keeps working as expected. Your teams do not need to redesign the data storage format or application logic, and can move faster with zero compliance risk.
Gain immutable auditability across your security operations: SecureToken has developed a management server acting as the centralised service for managing audit trail entries related to data security operations. What that means is that from tokenization/detokenization operations, policy governance, assets, to key management, every single trial is auditable. Combined with centralised key usage logs, you can gain end-to-end visibility into who tokenizes/detokenizes what data asset, when, and why, and for what purpose: a critical requirement for regulatory reporting and internal compliance.

Learn more about how SecureToken can help you strengthen data protection and enable secure analytics and AI by engaging with our team today.

FAQs

What is Vaultless Tokenization?

Vaultless Tokenization is a data protection method that replaces sensitive data values with non-sensitive tokens. These randomly generated tokens do not need to be stored in a token database or vault. Since tokenized data carries no intrinsic value, it is typically not subject to the same handling requirements as the original sensitive data, which makes Vaultless Tokenization well suited to compliance-focused organisations responsible for handling sensitive data under frameworks such as the General Data Protection Regulation (GDPR), the Digital Operational Resilience Act (DORA), the EU Artificial Intelligence Act, and the Health Insurance Portability and Accountability Act (HIPPA).

What is Format-Preserving Encryption?

Format-Preserving Encryption (FPE) is a form of tokenization that maintains both the length and data type of the original sensitive value. A 16-digit Primary Account Number (PAN), for example, is tokenized as 16 numeric tokens. A date of birth is tokenized as a random sequence of numeric tokens in the same date format.

FPE also supports partial tokenization: part of a string can be tokenized while another part remains in clear text. Common use cases include:

Reducing compliance scope and supporting broader compliance initiatives
Providing a structurally less disruptive alternative to standard AES-256 encryption
De-identifying data in development, test, cloud, and AI environments
Preventing unauthorised access to sensitive data by administrators, external actors, and other users without a business need

Why use tokenization instead of data masking?

Tokenization is the appropriate control when you need to protect highly sensitive data, including Personally Identifiable Information (PII), Protected Health Information (PHI), and payment card data, while retaining the ability to recover the original value. This is particularly important for AI systems running in live production within regulated environments.

Data masking works differently. Masked data is typically irreversible: the original value is permanently altered or destroyed. That makes masking suitable for testing and analytics use cases, but not for production workloads that depend on real transactions and decisions.

For AI use cases, the distinction is significant. AI systems operating in production may process or surface sensitive data. For regulated enterprises, even limited data exposure can carry serious compliance and reputational consequences. Vaultless Tokenization preserves the original value securely, recoverable only by authorised users, which means production AI workloads can operate without exposing the underlying data.

Is Vaultless Tokenization part of data protection and compliance requirements?

Vaultless Tokenization is not explicitly mandated by regulation. However, modern data protection requirements centre on minimising the presence of sensitive data wherever it is not operationally required. Vaultless Tokenization supports that principle across four practical areas.

Testing internal IT systems

Using Vaultless Tokenization to replicate production data, organisations can build a consistent testing environment across internal systems. Testing environments are typically less secure than production systems, and maintaining many separate environments, each containing sensitive data, multiplies risk and management overhead. A single tokenized testing environment reduces that exposure by providing the following benefits:

All systems can be tested against data that reflects real production structures, supporting accurate model training
There is no requirement to generate or maintain synthetic test data sets
Test data management is simplified
Performance testing reflects real-world conditions more accurately
Security exposure is reduced because fewer environments contain sensitive data, and fewer people require access to it

Data security and data usage

All critical data can be tokenized to maximise protection while continuing to support data usage requirements under regulations such as GDPR.

Critical incident reporting

Regulations frequently require critical incidents to be reported within short timeframes, in some cases within four hours. Where the data involved in an incident is tokenized, incident assessment and reporting are faster and simpler, reducing response time and regulatory risk.

Operational resilience

Tokenized data simplifies the creation of backup systems for critical operational infrastructure. The following table summarises recovery time expectations across key standards and frameworks.

ISO/IEC 27001	Defined by organization (often ≤ 24 hours for critical systems)	Requires documented RTOs and DR testing; no fixed time mandated
ISO 22301	4–24 hours (critical services)	Strong emphasis on BIA-driven RTOs
NIST (SP 800-53 / 800-61)	As defined in IR/BCP plans (often ≤ 24 hours)	Focus on rapid containment and restoration
GDPR	No fixed RTO; availability must be restored “in a timely manner”	Focus on risk to data subjects rather than uptime
HIPAA	Reasonable and appropriate (often ≤ 24–72 hours)	Requires contingency and emergency mode plans
SOX	Typically ≤ 24 hours	Emphasis on integrity and auditability
FISMA	Defined in system security plans (often ≤ 24 hours)	Mandatory DR and continuity documentation
FFIEC	2–4 hours (critical); ≤ 24 hours (important)	One of the strictest recovery expectations
DORA	“Near-zero disruption” for critical services	Requires resilience testing and rapid recovery

How is SecureToken vaultless tokenization different from other vaultless, vaulted tokenization or masking?

SecureToken is a reversible, end-to-end vaultless tokenization solution that adds auditability and control. Unlike architectures that require sensitive data to leave your infrastructure for tokenization and then be returned for detokenization or further processing, SecureToken processes data within Cloudera and that means your data does not need to be moved or duplicated. In addition to its reversible feature, every tokenization and detokenization action is logged, and access is governed via role-based permissions. This creates a win-win for organisations that need strong data protection while continuing to comply with regulatory requirements in production environments.

Bluemetrix Platform

PLATFORM

Native Vaultless Tokenization, Purpose Built for Cloudera

Expert Services at Bluemetrix help maximise impact