nisha talagala keynote_inflow_2016

22
1 The New Storage Applications: Lots of Data, New Hardware and Machine Intelligence Nisha Talagala Parallel Machines INFLOW 2016

Upload: nisha-talagala

Post on 21-Apr-2017

223 views

Category:

Data & Analytics


0 download

TRANSCRIPT

Page 1: Nisha talagala keynote_inflow_2016

1

The  New  Storage  Applications:Lots  of  Data,  New  Hardware  and  

Machine  Intelligence

Nisha  TalagalaParallel  MachinesINFLOW  2016

Page 2: Nisha talagala keynote_inflow_2016

2

Storage  Evolution  &  Application  Evolution  Combined

Disk  &  Tape

Flash

DRAM

Persistent  Memory

GeographicallyDistributed

Clustered

Local

Key-­Value

File,  Object

Block Data  ManagementClassic  Enterprise

TransactionsBusiness  Intelligence

Search  etc.

Advanced  Analytics(Machine  Learning,  Cognitive  

Functions)

Page 3: Nisha talagala keynote_inflow_2016

3

In  this  talk

• What are the new data apps? – with a heavy focus on Advanced Analytics, particularly Machine Learning and Deep Learning

• What are their salient characteristics when it comes to storage and memory?

• How is storage optimized for these apps today?

• Opportunities for the storage stack?

Page 4: Nisha talagala keynote_inflow_2016

4

Teaching  AssistantsElderly  CompanionsService  Robots

Personal  Social  Robots

Smart  Cities

Robot  Drones

Smart  Homes

Intelligent  vehicles

Personal  Assistants  (bots)Smart  Enterprise

Edited  version  of  slide  from  BalintFleischer’s  talk:  Flash  Memory  Summit  2016,  Santa  Clara,  CA

X

Growing  Sources  of  Data

Page 5: Nisha talagala keynote_inflow_2016

5

Classic Enterprise Transactions,  Business  Intelligence

Advanced  Analytics

“Killer”  use  cases OTLPERPEmail

eCommerceMessaging

Social NetworksContent  Delivery

Discovery  of    solutions,  capabilitiesRisk  Assessment

Improving  customer  experienceComprehending    sensory  data

Key  functions RDBMSBI

Fraud  detection

DatabasesSocial  Graphs

SQL  and  ML  AnalyticsStreaming

Natural  Language  UnderstandingObject Recognition

Probabilistic  ReasoningContent  Analytics

Data  Types StructuredTransactional

StructuredUnstructuredTransactional

StreamingMixed

Graphs,  Matrices

Storage  Types Enterprise ScaleStandards  drivenSAN/NAS,  etc

Cloud ScaleOpen  sourceFile/Object

???

Edited  version  of  slide  from  BalintFleischer’s  talk:  Flash  Memory  Summit  2016 Santa  Clara,  CA

The  Application  Evolution

Page 6: Nisha talagala keynote_inflow_2016

6

Libraries LibrariesMachine  Learning,  Deep  Learning,  SQL,  Graph,  CEP    etc.

Data LakeData  RepositoriesSQL

NoSQL

Data LakeData  Streams

A  Sample  Analytics  Stack

Processing  Engine

Data  from  Repositories  or  Live  Streams

Optimizers/Schedulers

Language  Bindings,  APIs

Frequently  in  memory

Python,  Scala,  Java  etc

Page 7: Nisha talagala keynote_inflow_2016

7

Data LakeData  RepositoriesSQL

NoSQL

Data LakeData  Streams

Machine  Learning  Software  Ecosystem  – a    Partial  View

Data  from  Repositories  or  Live  Streams

Flink /  ApexSpark  Streaming

Storm  /  Samza /  NiFi

CaffeTheano

Tensor  Flow

Hadoop  /  SparkFlink

Tensor  Flow

Mahout,  Samsara,  Mllib,  FlinkML,  Caffe,  TensorFlow

Stream  Processing  Engine

BatchProcessing  Engine

Domain  focused  back  end  engines

Algorithms  and  Libraries

Beam  (Data  Flow),  StreamSQL,  Keras

Layered  API  Providers

Page 8: Nisha talagala keynote_inflow_2016

8

In  this  talk

• What are the new apps? – with a heavy focus on Advanced Analytics, particularly Machine Learning and Deep Learning

• What are their salient characteristics when it comes to storage and memory?

• How is storage optimized for these apps today?• Opportunities?

Page 9: Nisha talagala keynote_inflow_2016

9

How  ML/DL  Workloads  think  about  Data  – Part  1• Data Sizes• Incoming datasets can range from MB to TB • Models are typically small. Largest models tend to be in deep neural networks

and range from 10s MB to single digit GB• Common Data Types• Time series and Streams• Multi-dimensional Arrays, Matrices and Vectors• DataFrames

• Common distributed patterns• Data Parallel, periodic synchronization• Model Parallel

• Network sensitivity varies between algorithms. Straggler performance issues can be significant• 2x performance difference between IB and 40Gbit Ethernet for some algorithms

like KMeans and SVM

Page 10: Nisha talagala keynote_inflow_2016

10

The  Growth  of  Streaming  Data

• Continuous data flows and continuous processing• Enabled & driven by sensor data, real time information feeds• Enables native time component “event time”• Allows complex computations that can combine new and old data in

deterministic ways• Several variants with varied functionality• True Streams, Micro-Batch (an incremental batch emulation)

• Possible with existing models like SQL, supported natively by models like Google DataFlow / Apache Beam

• The performance of in-memory streaming enables a convergence between stream analytics (aggregation) and Complex Event Processing (CEP)

Page 11: Nisha talagala keynote_inflow_2016

11

Convergence  of  RDBMS  and  Analytics• In-Memory DBs are moving to continuous queries• Ex: StreamSQL interfaces, Pipeline DB (based on PostgreSQL)

• Stream and batch analytic engines support SQL interfaces • Ex: SQL support on Spark, Flink

• SQL parsers with pluggable back ends – Apache Calcite

• Good for basic analytics but need extensions to support machine learning and deep learning• Joins, sorts, etc. good for feature engineering, data cleansing• Many core machine & deep learning operations require linear algebra ops

If the idea of a standard database is "durable data, ephemeral queries" the idea of a streaming database is "durable queries, ephemeral data”

http://www.databasesoup.com/2015/07/pipelinedb-streaming-postgres.html

Page 12: Nisha talagala keynote_inflow_2016

12

The  Growing  Role  of  the  Edge• Closest to data ingest, lowest latency.• Benefits to real time processing

• Highly varied connectivity to data centers

• Varied hardware architectures and resource constraints

• Differs from geographically distributed data center architecture • Asymmetry of hardware• Unpredictable connectivity• Unpredictable device uptime ioT Reference  Model

Page 13: Nisha talagala keynote_inflow_2016

13

How  ML/DL  Workloads  think  about  Data  – Part  2• The older data gets – the more its “role” changes• Older data for batch- historical analytics and model reboots• Used for model training (sort of), not for inference

• Guarantees can be “flexible” on older data• Availability can be reduced (most algorithms can deal with some data loss)• A few data corruptions don’t really hurt J• Data is evaluated in aggregate and algorithms are tolerant of outliers• Holes are a fact of real life data – algorithms deal with it

• Quality of service exists but is different • Random access is very rare • Heavily patterned access (most operations are some form of array/matrix)• Shuffle phase in some analytic engines

Page 14: Nisha talagala keynote_inflow_2016

14

Correctness,  Determinism,  Accuracy  and  Speed• More complex evaluation metrics than

traditional transactional workloads• Correctness is hard to measure

• Even two implementations of the “same algorithm” can generate different results

• Determinism/Repeatability is not always present for streaming data• Ex: Micro-batch processing can produce

different results depending on arrival time Vs event time

• Accuracy to time tradeoff is non-linear• Exploratory models can generate massive

parallelism for the same data set used repeatedly (hyper-parameter search)

00.20.40.60.81

1.2

0 0.02 0.04 0.06 0.08 0.1 0.12 0.14

Error

Time

SVM  V1  

0

0.2

0.4

0.6

0.8

1

1.2

0 0.02 0.04 0.06 0.08 0.1 0.12 0.14Error

Time

SVM  V2  

Page 15: Nisha talagala keynote_inflow_2016

15

The  Role  of  Persistence

• For ML functions, most computations today are in-memory• Data flows from data lake to analytic engine and results flow back• Persistent checkpoints can generate large write traffic for very long running

computations (streams, large neural network training, etc.)• Persistent message storage to enforce exactly once semantics and

determinism, latency sensitive write traffic

• For in-memory databases, persistence is part of the core engine• Log based persistence is common

• Loading & cleaning of data is still a very large fraction of the pipeline time• Most of this involves manipulating stored data

Page 16: Nisha talagala keynote_inflow_2016

16

In  this  talk

• What are the new apps? – with a heavy focus on Advanced Analytics, particularly Machine Learning and Deep Learning

• What are their salient characteristics when it comes to storage and memory?

• How is storage/memory optimized for these apps today?• Opportunities?

Page 17: Nisha talagala keynote_inflow_2016

17

Abstractions  and  the  Stack• ML/DL applications use common

abstractions that combine linear algebra, tables, streams etc

• These are stored as independent entities inside Key-Value pairs, Objects or Files

• File system used as common namespace

• Information is lost at each level down, along with opportunities to optimize layout, tiering, caching etc

Data copies (or transfers denoted by red lines) occur frequently, sometimes more

than once!

Block

File

Key-­Value  and  Object

Matrices,  Tables,  Streams,  etc

Page 18: Nisha talagala keynote_inflow_2016

18

Optimizing  Storage:  Some  Examples• Time series optimized databases• Examples BTrDB (FAST 2016) and Gorrilla DB (Facebook/VLDB 2015)• Streamlined data types, specialized indexing, tiering optimized for access

patterns

• API pushdown techniques• Iguazio.io• Streams and Spark RDDs as native access APIs

• Lineage• Alluxio (Formerly Tachyon)• Link data history & compute history, cache intermediate stages in machine

learning pipelines

• Memory expansion• Many studies on DRAM/Persistent Memory/Flash tiering for analytics

Page 19: Nisha talagala keynote_inflow_2016

19

Opportunities:  Places  to  Start  • Persistent Memory and Flash offer several opportunities to

improve ML/DL capacity and efficiency

• Fast/Frequent Checkpointing for long running jobs• Note: will put pressure on write endurance

• Low latency logging for exactly-once semantics

• Memory expansion: DRAM/Persistent Memory/Flash hierarchies • exploit the highly predictable access patterns of ML algorithms

• Accelerate data load/save stages of ML/DL pipelines

Page 20: Nisha talagala keynote_inflow_2016

20

Opportunities  – More  Fundamental  Shifts• Role of storage types in analytics optimizers and schedulers –

superficially similar to DB query optimization

• Exploit the more relaxed set of requirements on persistence• Even correctness can be relaxed • Example in compute land for flexibility in synchronization (HogWild!

approach to SGD, plus Asynchronous SGD etc.)

• Leverage Persistent Memory to unify low latency streaming data requirements and high throughput batch data requirements

• New(er) data types and repeatable access patterns

• Converged systems with analytics and storage management for cross stack efficiency

Page 21: Nisha talagala keynote_inflow_2016

21

Takeaways

• The use of ML/DL in enterprise is at its infancy and expanding furiously

• These apps put ever larger pressure on data management, latency, and throughput requirements

• These apps also introduce another layer of abstraction and another layer of workload intelligence• Further away from block and file

• Opportunities exist to significantly improve storage and memory for these use cases by understanding and exploiting their priorities and non-priorities for data

Page 22: Nisha talagala keynote_inflow_2016

22

Thank  You