Elastic Search/ ELK@Scale: 1.5TB/Day
Todd Simpson, Satish Nair
ISO/Sicops department
Current use:
52 nodes, 100 machines for search
Upstream side:
- Net4 data, duplicates
- 1.5tb of log data daily
- how fast the pipeline can be
o on average day, 25 diff types of logs, real time searching within 30 seconds
o search through 10 billion records in about 5 seconds
- 3x data than what splunk allowed
Architecture:
Annui -> Bro -> Kafka -> logstash nodes -> Elastic cluster
Hardware
Annui – used by networking team (tcp/udp layer)
Bros - bro box
- network stream data
- protocol analyzer – understands protocol at a higher level
- can take up to 10gb limit of data
Kafka – virtual nodes
- queue to hold the data
- 2 copies of each data
- running on hyperv, upgrading to essxi
- how much ram and size of jvm most important
- limiting factors: jvm size
- suggestion: jvm size about 31 gb, higher doesn’t help much
- kafka – 100gb
- 70-90 gb ram currently in use
o 31gb for jvm
o rest for filesystem cache
- 7tb of data storage per node
- 18 months of network data, lots of data to storage
logstash
- pulls data from kafka
- subscription model
- input – filter – output
- input -> what topics are you going to subscribe to from kafka
- not all nodes might have the same filters
-
-
Hot Node - hypervisor
- Most analytics will require recent data
- SSD for recent data
- 1 replica of the data for redundancy
- hold 1 month of data on hot nodes
- Anything beyond 1 month is moved to warm node
Warm node - hypervisor
- Slower than hot node
- Data older than 1 month
Kirbana
Storage
- Centralized storage
- Not ideal, but easier management
- 100 nodes, each one has its own disk, difficult to handle when scaling up
How to calculate scaling for kafka?
- Blackbox, still need to explore
- Know when things get bad
- Don’t go up to that point
Why not use google elastic platform?
- 18 months of data, close to a petabyte
- cost is huge, 3x of what locally owned hardware
- GSB quantum vs google cloud
o Google cloud had better pricing
- Esxi cluster maintenance very minimal, not enough to justify moving to google or aws
GSB
- Research analytics/research support
- On premise and aws computing. Moving to google cloud in future
- Looking at metric beat, setup instance on aws, 1tb of data
o Structured metric data
o Influxdb supports time series data
o Ganglia db
- How user functional are the new systems
- Are people running batch jobs in environments that they shouldn’t be in
- Wrong size of ec2 instances that people have access too
How has the experience been so far?
- Influxdb -> tic stack
o Timeseries purpose data
- Easier to work with, modeled with sql
- Cleaner dashboard than kabana
- Json based database vs timeseries database specifically for timeseries
Problems and solutions
- ISO use only for data analysis
- Currently ISO reaches out to other departments for their data
o If further funding, and policy works out, this feature will be open to others
o Maybe allow departments to tap into the feature
- Size of index at rest should be important
- Early 2018 – maintenance mode
- No backups
o Data already on netapp
o Replicated data
o Can rebuild a system in 10 minutes
o Snapshot option to store the index in a cheap disk space
- Using puppet and powershell script to spin up new nodes
-

