High-Risk Data Scanning Tools

Proposed By

Michael Olson

Number of Attendees

Where will the conversation continue?

Not likely

Summary

Discussed existing tools used to scan large volumes of data for restricted/sensitive information. Most are based on regular expressions and patterns. Goals is balancing sensitivity vs. false positives and adjudication of hits.

Notes

Michael Olsen - Stanford Libraries leading the session

Libraries contain sensitive information
- Labs (2009+)
  - RWC
  - Green
Scanning for High Risk data
How to document results of scans?
- images from disks
Problem: Want to archive data but limit access to sensitive data
Tools
- see links below
Volume
- 190TB on server sanctioned for high risk

UIT - using ProvePoint to actively scan

scans emails already
OnPrem - called Data Discovery
- scans UIT servers, etc..
Cloud Based
- scans Google / Slack / Amazon / and a bunch more
- lots of rules for common API keys and tokens for cloud resources keys...
How does it work?
- pattern matching with scoring
- If score is high, can automate owner notification
Exact data matching
- matching MRNs next to data...
Tool is paid for - if others want to use it, contact UIT for information
- INTERESTED IN PROVEPOINT - CONTACT UIT!

Other tools:

Spirion (formerly Identity Finder)
- Similar to Prove Point
- On hit, encrypts and leaves text pointer with instructions.

Questions / Discussions:

Do any tools offer API for incorporation into other tools?
- Doesn't seem like it.
Working with students is a challenge as they are more likely to share sensitive information
Common issues are employees placing personal information on Stanford Google Drive (such as tax documents) and then sharing outside of Stanford (e.g. accountant)
Are there tools for finding faces to blur/fade out prior to sharing?
- Not sure...
- must be some ML tools out there for this purpose
Is it possible to create 'guard rails' to encourage people
- goal of ProvePoint
- Dashlane can help get around post-it notes for passwords
  - no audit data, no SSO, no expiration on shares
Image scanning for text?
- On roadmap for ProvePoint
- DICOM images have been largely scrubbed for identifiers as part of 2019 project in ResearchIT (TDS)
- Google has multiple ML models for this

Stanford Libraries High Risk scanning tools:

BitCurator
Bulk Exactor
- multithreaded / customizable / pretty powerful / limited gui
AccessData Forensic Toolkit
Epadd

Tools the Libraries want to Test:

Bulk Reviewer

Record your name and contact info HERE if your interested in a 1.5 hour tour of the Libraries Digital Archeology Lab:

Name, email, slack.

Year

2019