Hadoop Ecosystem
by Ed Sarausad
1. Non-relational
2. Maps query onto nodes
3. Coordinator jobs are recurrent Oozie Workflow jobs that are triggered by time and data availability.
4. Reduces aggregated results into answers
5. Links jobs
5.1. Workflow processing
6. Bundle provides a way to package multiple coordinator and workflow jobs and to manage the lifecycle of those jobs
6.1. Connects non-Hadoop stores (RDBMS)
6.2. Moves data to & from RDBMS to Hadoop
7. Workflow jobs are Directed Acyclical Graphs (DAGs), specifying a sequence of actions to execute. The Workflow job has to wait
8. Hive
8.1. SQL-like querying
8.2. Combiner can be used to optimize reducer performance
8.3. Structured data warehousing
8.4. Partition columns instead of indexes
9. Pig
9.1. Scripting for Hadoop
10. HBase
10.1. Column store
10.2. Transactional lookups
11. Flume
11.1. Log collector
11.2. Integrates into Hadoop
12. Oozie
13. Avro
13.1. Data parsing
13.2. Binary data serialization
13.3. RPC
13.4. language-neutral
13.5. optional codegen
13.6. schema evolution
13.7. untagged data
13.8. dynamic typing
14. Mahout
14.1. Machine learning
14.2. Applied to MR
15. Sqoop
15.1. Autogens Java InputFormat code for data access
16. MapReduce
16.1. Distributed compute
17. Ambari
17.1. Cluster deployment and admin
17.2. Driven by Hortonworks
18. ZooKeeper
18.1. Coordinator of shared state between apps
18.2. Naming, configuration, and synchronization services
19. YARN
19.1. cluster management
19.2. Hadoop 2
19.3. resource manager
19.4. job scheduler
20. BigTop
20.1. Package Hadoop ecosys
20.2. Test Hadoop ecosys package
21. Related Apache Ecosystems
22. HDFS
22.1. Distributed storage
23. Spark
24. Impala
24.1. SQL query egnine
24.2. Query data stored in HDFS and HBase
24.3. Real time
25. Cascading
25.1. Higher abstraction from MR
25.2. Creates Flow that assembles Map/Reduce jobs