Sign up for our newsletter! →

Detecting and Protecting PII in AWS in 2025

Written By
Patrick Davis Principal Security Consultant Blog Graphic

Detecting and Protecting PII in AWS in 2025

Cloud services offer on-demand, scalable computing resources, making them invaluable for handling large datasets, including sensitive information like Personally Identifiable Information (PII) and Protected Health Information (PHI). However, this convenience comes with risks, as these environments are prime targets for cybercriminals. Without adequate protection, these criminals can exploit PII to steal identities, commit fraud, or engage in other malicious activities. Therefore, protecting PII and PHI must be a paramount consideration when planning to store data in the cloud.

Thankfully, Amazon Web Services (AWS), as one of the world’s leading cloud providers, offers a wide range of tools for detecting and protecting PII and PHI. Many of these services leverage Artificial Intelligence (AI) to automate the detection and protection of sensitive data. Whether you’re using Amazon Macie to discover PII in S3 BucketsAWS Glue DataBrew to transform your datasets, or Amazon Comprehend to detect PII in text documents, AWS provides robust, native capabilities that go beyond encryption mechanisms. In 2025, a defense-in-depth approach to data security is more important than ever before. In this blog, I’ll walk you through some basic compliance requirements and some practical ways to detect, redact, mask, and otherwise protect sensitive data using AWS’s native services.

PII and Compliance Requirements

So what is PII, and why should you care? The National Institute of Standards and Technology (NIST) defines PII in Special Publication (SP) 800-122 as “(1) any information that can be used to distinguish or trace an individual’s identity, such as name, social security number, date and place of birth, mother’s maiden name, or biometric records; and (2) any other information that is linked or linkable to an individual, such as medical, educational, financial, and employment information.” In short, PII is any information that can be used to identify a person, along with personal details that, if exposed, could be used to cause harm to that person.

While we have a duty to safeguard data under our care, there are also legal and industry regulations that require it. The General Data Privacy Regulation (GDPR) guarantees citizens of the European Union the right to privacy by governing how personal data is collected, processed, and stored, with heavy penalties for non-compliance. Like the GDPR, the California Consumer Privacy Act (CCPA) focuses on consumer privacy rights in the state of California; it gives individuals more control over their personal data and imposes obligations on businesses to ensure transparency and safeguard personal information. At the federal level in the United States, the Health Insurance Portability and Accountability Act (HIPAA) applies to healthcare organizations, requiring stringent safeguards to protect Protected Health Information (PHI).

Exposing PII carries severe consequences for both individuals and organizations. For the individual, a PII leak can result in identity theft, financial fraud, and personal harm, such as harassment or discrimination. For organizations, the exposure of PII can lead to substantial financial penalties, including fines and punitive damages due to non-compliance with privacy regulations, along with reputational damage and loss of customer trust. Additionally, organizations may face operational disruptions as they work to contain and mitigate breaches, as well as the risk of class-action lawsuits from affected individuals. The tangible and intangible costs of data breaches underscore the critical necessity of implementing strong measures to safeguard PII. One blog can’t possibly hope to cover all of the potential requirements that you will have in your organization, but this list gives us a great place to start with our walkthrough of AWS AI and Data Security services.

Key Services

Amazon Comprehend

Amazon Comprehend is a tool designed to help organizations make sense of unstructured text using natural language processing (NLP). One of its standout features is the ability to automatically identify and classify PII, such as names, addresses, or credit card numbers, in documents. It also supports multiple languages, which makes it especially useful for businesses managing global operations. By automating the process of detecting PII, Comprehend reduces the chance of human error and ensures that critical data is flagged for protection before it’s stored or shared. Whether you’re dealing with customer support logs, legal documents, or other large volumes of text, Comprehend helps you analyze the information quickly and accurately.

Amazon Macie

Amazon Macie is AWS’s fully managed data monitoring solution that uses AI to automatically discover, classify, and protect sensitive data in S3. It can identify a wide range of data types including PII, financial data, credentials, and even custom data defined by the customer. By continuously monitoring S3 buckets, Macie helps detect misconfigurations or unauthorized data exposure and alerts security teams to take immediate action.

AWS Glue DataBrew

AWS Glue DataBrew is a visual data preparation tool that helps you clean, transform, and enrich data without the need to write code. It allows you to identify and handle sensitive data, including PII, by applying transformations such as masking, redaction, or tokenization. With over 250 pre-built transformations, DataBrew simplifies ETL (extract, transform, load) processes and ensures sensitive data is protected before it reaches your analytics or machine learning pipelines. This makes it a valuable tool for organizations looking to secure data at scale without relying solely on manual processes.

Lambda Function Chart

A Simple Detection and Redaction Pipeline

There are several options for detection pipelines, ranging from simple pipelines like the one I’m going to demonstrate to more elaborate configurations that require advanced data transformation. We’ll make a couple of assumptions for this demonstration:

  1. The bucket is known to store PII data.
  2. Adequate access controls and encryption are applied to the source bucket.
  3. The data must be redacted for use elsewhere.

Setup

  1. Infrastructure
    • The pipeline infrastructure is deployed using Terraform (Infrastructure as Code) to allow for easy updates and destruction when complete.
    • In this case, only a single bucket is used, but raw and processed data are split into unredacted/ and redacted/ folders.
  1. Data
    • Data is generated using Faker.
    • 10,000 total fake records, 100 CSVs with 100 records each, containing SSNs, birthdays, bank accounts, and other information.
  1. Lambda Code
    • Runtime: Python 3.12
    • Workflow
      • An event triggers the function.
      • The function retrieves the S3 object and text.
      • The function submits the text to Comprehend to detect PII.
      • The function uses the returned entities to locate and redact the detected PII in the text.
      • The function writes the redacted file to a new location.

Pipeline Walkthrough

1. The first step for any pipeline is to identify files requiring redaction. In this case, we assume all files in the source directory contain PII, so they will require redaction. For this use case, we will configure an S3 Bucket Notification targeting the Lambda function on Object creation.

2. When a notification triggers the function, the text is submitted to Comprehend for PII detection using Boto3 (AWS Python SDK). Comprehend uses a built-in AI model to detect PII and return Entities that are used to determine which text to redact.

3. The function then redacts the Entities and saves the new file to the new bucket (or in this case, the redacted/ folder.

While this is a simple example, it shows what’s possible using AWS services like Lambda and Comprehend to detect PII in your data. Pipeline infrastructure and code: Gist

Data Security Best Practices

Encryption

Data stored in the cloud should always be encrypted in transit and at rest. AWS services that help with this are Key Management Service (KMS) and AWS Certificate Manager (ACM), which integrate seamlessly across AWS’s service offerings. If you need dedicated hardware encryption services, AWS CloudHSM is also an option.

Access Control

Use role-based access control, and implement least-privilege permissions across your environment. This ensures that only those roles that require access to a dataset will have access, and it limits the actions they can take only to the required actions. AWS Identity and Access Management (IAM) provides the ability to assign roles to resources and users that have granular policies applied.

Regular Reviews

Perform regular audits and compliance checks, preferably continuously, to ensure your data stays secure and your organization remains compliant. AWS offers many services that help in this respect, including but not limited to: AWS CloudTrail for audit logging, AWS Config for continuous configuration compliance checks, and AWS Audit for performing audits.

Future Trends

Looking ahead, we can expect AWS to continue to innovate data protection by integrating more AI/ML capabilities like Amazon Comprehend. They will continue to add support for more languages and sensitive data types going forward. You can count on AWS to continue to add PII detection to more of its services, as it already has with Amazon SNS. All in all, we will see more support for protecting data end-to-end and ensuring compliance with industry and regulatory requirements like PCI DSS and HIPAA as those are continuously updated.

Conclusion

With services like Amazon Comprehend, AWS aims to make the task of detecting and protecting PII easier for you than ever. With data breaches becoming increasingly prevalent and larger in scope every year, having native solutions like this in our pocket is crucial. AWS provides the tools you need to classify and protect data–the only thing left is to do it. With so many options, it’s great to have someone in your corner who can assess your environment and provide guidance on how to best protect your data. At HanaByte, we have decades of experience building and securing infrastructure. If you’re interested in learning more, contact us today!

Relevant Blogs

hanabyte blog, HanaByte Hearts, Gwinnett County Parks and Rec
Corporate Outreach

HanaByte Hearts: Gwinnett County Parks & Recreation

Beyond the premises where the old data once existed, still exists people coding and working on security in the cloud from the comfort of their homes, and there the conversation started: must we not protect where we physically exist if we are to continue to protect what conceptually exists?…

Read More →
HanaByte blog by Simon Abisoye for CCSK
Cloud Security

How CCSK makes for better DevSecOps and Agile practices

When it comes to technical certifications, there is no shortage of options to study for and exams to sit through. One in particular that has enjoyed ongoing relevance in cloud security best practices is the CCSK (Certificate of Cloud Security Knowledge), which was first introduced by the Cloud Security Alliance (CSA) in 2010…

Read More →
Landing zones by Jenny Tang
Compliance

What is a Landing Zone?

For most companies shifting to the cloud, the cloud environment and resources needed to set up numerous accounts is complex. The challenge grows when balancing efficiency with security–organizations want complete cloud environments as soon as possible without overlooking key elements such as establishing firewalls or access controls. Addressing this issue begins with a landing zone, a secured and well-architected multi-account cloud environment that acts as a starting point or template allowing organizations to quickly deploy users, accounts, and environments for business needs…

Read More →