High-Risk Data Scanning Tools

Proposed By
Michael Olson
Number of Attendees
33
Where will the conversation continue?
Not likely
Summary
Discussed existing tools used to scan large volumes of data for restricted/sensitive information. Most are based on regular expressions and patterns. Goals is balancing sensitivity vs. false positives and adjudication of hits.
Notes

Michael Olsen - Stanford Libraries leading the session

  • Libraries contain sensitive information
    • Labs (2009+)
      • RWC
      • Green
  • Scanning for High Risk data
  • How to document results of scans?
    • images from disks
  • Problem:  Want to archive data but limit access to sensitive data
  • Tools
    • see links below
  • Volume
    • 190TB on server sanctioned for high risk

 

UIT - using ProvePoint to actively scan

  • scans emails already
  • OnPrem - called Data Discovery
    • scans UIT servers, etc..
  • Cloud Based
    • scans Google / Slack / Amazon / and a bunch more
    • lots of rules for common API keys and tokens for cloud resources keys...
  • How does it work?
    • pattern matching with scoring
    • If score is high, can automate owner notification
  • Exact data matching
    • matching MRNs next to data...
  • Tool is paid for - if others want to use it, contact UIT for information
    • INTERESTED IN PROVEPOINT - CONTACT UIT!

 

Other tools:

  • Spirion (formerly Identity Finder)
    • Similar to Prove Point
    • On hit, encrypts and leaves text pointer with instructions.

Questions / Discussions:

  • Do any tools offer API for incorporation into other tools?
    • Doesn't seem like it.
  • Working with students is a challenge as they are more likely to share sensitive information
  • Common issues are employees placing personal information on Stanford Google Drive (such as tax documents) and then sharing outside of Stanford (e.g. accountant)
  • Are there tools for finding faces to blur/fade out prior to sharing?
    • Not sure...
    • must be some ML tools out there for this purpose
  • Is it possible to create 'guard rails' to encourage people
    • goal of ProvePoint
    • Dashlane can help get around post-it notes for passwords
      • no audit data, no SSO, no expiration on shares
  • Image scanning for text?
    • On roadmap for ProvePoint
    • DICOM images have been largely scrubbed for identifiers as part of 2019 project in ResearchIT (TDS)
    • Google has multiple ML models for this

 

Stanford Libraries High Risk scanning tools:

Tools the Libraries want to Test:

 Record your name and contact info HERE if your interested in a 1.5 hour tour of the Libraries Digital Archeology Lab:

  • Name, email, slack.

 

 

 

Year
2019