Elastic Search/ELK@Scale: 1.5Tb/day

Elastic search usage by ISO, use cases, implementation, and challenges so far. Go over the current architecture, and future plans on how to scale that horizontally and vertically.

Notes

Elastic Search/ ELK@Scale: 1.5TB/Day

Todd Simpson, Satish Nair

ISO/Sicops department

Current use:

52 nodes, 100 machines for search

Upstream side:

- Net4 data, duplicates

- 1.5tb of log data daily

- how fast the pipeline can be

o on average day, 25 diff types of logs, real time searching within 30 seconds

o search through 10 billion records in about 5 seconds

- 3x data than what splunk allowed

Architecture:

Annui -> Bro -> Kafka -> logstash nodes -> Elastic cluster

Hardware

Annui – used by networking team (tcp/udp layer)

Bros - bro box

- network stream data

- protocol analyzer – understands protocol at a higher level

- can take up to 10gb limit of data

Kafka – virtual nodes

- queue to hold the data

- 2 copies of each data

- running on hyperv, upgrading to essxi

- how much ram and size of jvm most important

- limiting factors: jvm size

- suggestion: jvm size about 31 gb, higher doesn’t help much

- kafka – 100gb

- 70-90 gb ram currently in use

o 31gb for jvm

o rest for filesystem cache

- 7tb of data storage per node

- 18 months of network data, lots of data to storage

logstash

- pulls data from kafka

- subscription model

- input – filter – output

- input -> what topics are you going to subscribe to from kafka

- not all nodes might have the same filters

Hot Node - hypervisor

- Most analytics will require recent data

- SSD for recent data

- 1 replica of the data for redundancy

- hold 1 month of data on hot nodes

- Anything beyond 1 month is moved to warm node

Warm node - hypervisor

- Slower than hot node

- Data older than 1 month

Kirbana

Storage

- Centralized storage

- Not ideal, but easier management

- 100 nodes, each one has its own disk, difficult to handle when scaling up

How to calculate scaling for kafka?

- Blackbox, still need to explore

- Know when things get bad

- Don’t go up to that point

Why not use google elastic platform?

- 18 months of data, close to a petabyte

- cost is huge, 3x of what locally owned hardware

- GSB quantum vs google cloud

o Google cloud had better pricing

- Esxi cluster maintenance very minimal, not enough to justify moving to google or aws

GSB

- Research analytics/research support

- On premise and aws computing. Moving to google cloud in future

- Looking at metric beat, setup instance on aws, 1tb of data

o Structured metric data

o Influxdb supports time series data

o Ganglia db

- How user functional are the new systems

- Are people running batch jobs in environments that they shouldn’t be in

- Wrong size of ec2 instances that people have access too

How has the experience been so far?

- Influxdb -> tic stack

o Timeseries purpose data

- Easier to work with, modeled with sql

- Cleaner dashboard than kabana

- Json based database vs timeseries database specifically for timeseries

Problems and solutions

- ISO use only for data analysis

- Currently ISO reaches out to other departments for their data

o If further funding, and policy works out, this feature will be open to others

o Maybe allow departments to tap into the feature

- Size of index at rest should be important

- Early 2018 – maintenance mode

- No backups

o Data already on netapp

o Replicated data

o Can rebuild a system in 10 minutes

o Snapshot option to store the index in a cheap disk space

- Using puppet and powershell script to spin up new nodes