Monday, October 12, 2015

New comers of Big Data

Druid

http://druid.io
Druid (from Yahoo!) is a column-oriented open-source distributed data store designed to quickly ingest massive quantities of event data, making that data immediately available to queries - real-time data.
Fully deployed, Druid runs as a cluster of specialized nodes to support a fault-tolerant architecture where data is stored redundantly and there are multiple members of each node type. In addition, the cluster includes external dependencies for coordination (Apache ZooKeeper), storage of metadata (Mysql), and a deep storage facility (e.g., HDFS, Amazon S3, orApache Cassandra).

Druid's native query language is JSON over HTTP, although the community has contributed query libraries in numerous languages, including SQL.
Limited power compared to RDBMS/SQL and doesn't support joins or distinct count (inspired by Google Dremel). But very quick to answer predefined questions.

Presto 

https://prestodb.io
Presto (from Facebook) is an open source distributed SQL query engine for running interactive analytic queries against data sources of all sizes ranging from gigabytes to petabytes.
Presto was designed and written from the ground up for interactive analytics and approaches the speed of commercial data warehouses while scaling to the size of organizations like Facebook.
Presto allows querying data where it lives, including Hive, Cassandra, relational databases or even proprietary data stores. A single Presto query can combine data from multiple sources, allowing for analytics across your entire organization.

Samza

Apache Samza is a distributed stream processing framework. It uses Apache Kafka for messaging, and Apache Hadoop YARNto provide fault tolerance, processor isolation, security, and resource management.

Drill

https://drill.apache.org
Schema-free SQL Query Engine for Hadoop, NoSQL and Cloud Storage.Drill is designed from the ground up to support high-performance analysis on the semi-structured and rapidly evolving data (including JSON) coming from modern Big Data applications, while still providing the familiarity and ecosystem of ANSI SQL.
Dynamic queries on self-describing data in files (such as JSON, Parquet, text) and HBase tables, without requiring metadata definitions in the Hive metastore.
Drill supports a variety of NoSQL databases and file systems, including HBase, MongoDB, MapR-DB, HDFS, MapR-FS, Amazon S3, Azure Blob Storage, Google Cloud Storage, Swift, NAS and local files. A single query can join data from multiple datastores. For example, you can join a user profile collection in MongoDB with a directory of event logs in Hadoop.

Sunday, April 5, 2015

Lambda architecture

http://lambda-architecture.net

eric baldeschwieler

The Lambda Architecture aims to satisfy the needs for a robust system that is fault-tolerant, both against hardware failures and human mistakes, being able to serve a wide range of workloads and use cases, and in which low-latency reads and updates are required. The resulting system should be linearly scalable, and it should scale out rather than up.

  1. All data entering the system is dispatched to both the batch layer and the speed layer for processing. 
  2. The batch layer has two functions: (i) managing the master dataset (an immutable, append-only set of raw data), and (ii) to pre-compute the batch views. 
  3. The serving layer indexes the batch views so that they can be queried in low-latency, ad-hoc way. 
  4. The speed layer compensates for the high latency of updates to the serving layer and deals with recent data only. 
  5. Any incoming query can be answered by merging results from batch views and real-time views.

Example realtime application:
Recommender
Newsfeed






Message bus like Kafka, Flume, Scribe
Realtime engine: Kafka, Storm, Samza, DataTorent
Serving Store: Cassandra, MySQL
Service: Slider, Twill, Hbase, Sqooq

Wednesday, April 1, 2015

AWS stack

http://en.wikipedia.org/wiki/Amazon_Web_Services#List_of_products

Database: http://aws.amazon.com/products/?nc2=h_ql_summits#databases-1
Analytics: http://aws.amazon.com/products/?nc2=h_ql_summits#analytics-1
Compute: http://aws.amazon.com/products/?nc2=h_ql_summits#compute-1
Storage: http://aws.amazon.com/products/?nc2=h_ql_summits#storage-1

Thursday, March 19, 2015

Machine Learning

http://cs229.stanford.edu/materials.html
https://www.youtube.com/watch?v=UzxYlbK2c7E

Four parts of the class:

  • Supervise learning: providing the computer existing datasets that have the right answers.
    • Regression (continuous data point) or classification (discrete data) are example of supervise learning
  • Learning theory: understand how and why learning algorithm work (how to prove that algorithm that reads zip code works); what algorithm can approximate function, what size for the learning data we need.
  • Unsupervised learning: being given a dataset and ask to find interesting structure (vs. giving the right answer)
    • Clustering is one example
  • Reinforcement learning: asked to make a sequence of decision over time;
    • reinforce good behavior vs bad behavior: example flying helicopter: good behavior doesn't crash.

Thursday, January 29, 2015

About big data

From: http://www.slideshare.net/timoelliott/ukisug2014-big-data-presentation

Types of big data:


  • Human generated data:
    • Swipe of credit card
    • Scan of bar code
    • Actions captured on mobile phones
  • Machine generated data:
    • Data logers
    • Sensors


From hindsight, to insight to foresight:

  • Descriptive: what did happen
  • Diagnostic: why did it happen
  • Predictive: will it happen again
  • Prescriptive: how can we make it happen
Types of data (sorted by % use in descending order) - ref2:
  • Time series
  • Business transactions
  • Geospatial/location
  • Graph (network)
  • Clickstream
  • Health records
  • Sensor
  • Image
  • Genomic
According to Gartner (Ref3):
  • Rising up to the peak of inflated expectation:
    • Data science
    • Predictive analytics
  • In the downslope of desillusion
    • Complexe event processing
    • Big Data
    • Content analytics

Big data can be used for:
  • Engage and empower data consumers:
    • Discovering new business opportunities
    • Identify new product opportunities
    • More reliable decision making
  • Plan and optimize: Improve operation efficiency
  • Personalize experience
Data lake: is a storage repository that holds a vast amount of raw data in its native format until it is needed.
HTAP = OLTP + OLAP = new generation of in-memory data platforms that can perform both online transaction processing (OLTP) and online analytical processing (OLAP) without requiring data duplication.

Ref 2: from http://www.paradigm4.com/wp-content/uploads/2014/06/P4-data-scientist-survey-FINAL.pdf
Ref 3: Gartner 2014 hype cycle: http://www.gartner.com/newsroom/id/2819918

Wednesday, June 25, 2014

Hadoop - Beyond Map Reduce, Pig and Hive

Hadoop 2.0

6/8/14 - Hadoop Maturity Summit: http://gigaom.com/2014/06/08/hadoop-maturity-summits/

Yarn

Part of Hadoop 2.0 Yarn (Yet Another Resource Manager) is a resource manager that enable non MapReduce jobs to work on Hadoop and leverage HDFS. YARN provide a generic resource management framework for implementing distributed applications.
MapReduce only allows batch processing, but YARN unlock the real time data processing on Hadoop.

Spark http://projects.apache.org/projects/spark.html

Apache Spark is a fast and general engine for large-scale data processing. It offers high-level APIs in Java, Scala and Python as well as a rich set of libraries including stream processing, machine learning, and graph analytics.
Spark powers a stack of high-level tools including Shark for SQL, MLib for machine learning , GraphX and Spark Streaming.

MLib

MLib is spark's scallable machine learning library. It fits into Spar's APIs and interoperates with NumPy in Python. It can leverage any Hadoop data source (HDFS, HBase or local files).
Machine Learning Library (MLib) guide: http://spark.apache.org/docs/latest/mllib-guide.html

Cassandra http://cassandra.apache.org/

Cassandra is a massively scalable open source NoSQL. Cassandra is perfect for managing large amounts of structured, semi-structured, and unstructured data across multiple data centers and the cloud. Cassandra delivers continuous availability, linear scalability, and operational simplicity across many commodity servers with no single point of failure, along with a powerful dynamic data model designed for maximum flexibility and fast response times.
Cassandra sports a “masterless” architecture meaning all nodes are the same.

Kafka http://kafka.apache.org/

Kafka is a distributed pub/sub and message queuing system.
Kafka consumes streams called topics that are partitioned and replicated across multiple machine named brokers.
A single Kafka broker can handle hundreds of megabytes of reads and writes per second from thousands of clients. Kafka is designed to allow a single cluster to serve as the central data backbone for a large organization. It can be elastically and transparently expanded without downtime. Data streams are partitioned and spread over a cluster of machines to allow data streams larger than the capability of any single machine and to allow clusters of co-ordinated consumers. Kafka has a modern cluster-centric design that offers strong durability and fault-tolerance guarantees. Messages are persisted on disk and replicated within the cluster to prevent data loss. Each broker can handle terabytes of messages without performance impact.
Basic messaging terminology:

  • Kafka maintains feeds of message in categories called topics.
  • Process that publish message to a Kafka topic is called producer.
  • Process that subscribes to topics and process the feed of a published message is called consumer.
  • Kafka is run in a cluster comprised of one or more servers each of which is called a broker.




Storm
http://vimeo.com/40972420

HBase
http://hbase.apache.org/book.html#other.info.videos
Use Apache HBase™ when you need random, realtime read/write access to your Big Data. This project's goal is the hosting of very large tables -- billions of rows X millions of columns -- atop clusters of commodity hardware. Apache HBase is an open-source, distributed, versioned, non-relational database modeled after Google's Bigtable. Apache HBase provides Bigtable-like capabilities on top of Hadoop and HDFS.
HDFS is a distributed file system that is well suited for the storage of large files. Its documentation states that it is not, however, a general purpose file system, and does not provide fast individual record lookups in files. HBase, on the other hand, is built on top of HDFS and provides fast record lookups (and updates) for large tables.

Both Cassandra and HBase are NoSQL databases, a term for which you can find numerous definitions. Generally, it means you cannot manipulate the database with SQL. However, Cassandra has implemented CQL (Cassandra Query Language), the syntax of which is obviously modeled after SQL. Cassandra uses the Gossip protocol for internode communications, and Gossip services are integrated with the Cassandra software. HBase relies on Zookeeper -- an entirely separate distributed application -- to handle corresponding tasks.

Sunday, June 22, 2014

Hadoop - Intro

http://youtu.be/d2xeNpfzsYI

Map Reduce process: 1)Map (key:value pair)-- 2)Shuffle-sort (sort, computation phase)-- 3)Reduce (aggregate results)