Hadoop talk, Nathan Milford, Outbrain

Kom i gang. Det er Gratis
eller tilmeld med din email adresse
Hadoop talk, Nathan Milford, Outbrain af Mind Map: Hadoop talk, Nathan Milford, Outbrain

1. Several schedulers

1.1. Fair scheduler

1.1.1. Most used

1.1.2. developed in FaceBook

1.1.3. all jobs get fair share

1.1.4. eventually battles will start

2. @NathanMilford

3. Test nodes

4. less good for random access

5. If you loose it, your whole cluster is useless

6. Hadoop HDFS

6.1. expects failure

6.1.1. only one single point of failure Name node

6.2. Nodes

6.2.1. NameNode Coordinates storage & access Holds metadata (file names, permissions &c) Shouldn't be commidity hardware Should have lots of RAM Yahoo's 1400 nodes cluster has 30GB in name node

6.2.2. Data node monitored by name node periodically if it doesn't answer, it will be removed holds data in opaque blocks stored on the local file system due to replication of at least 3, you can lose 2 data nodes without losing data

6.2.3. Should be backed-up frequently

6.2.4. secondary name node badly named not a backup of name node needs as much RAM as the name node periodic process of combining the RAM & filesystem

6.3. intended for commodity hardware

6.3.1. still enterprise servers, but not monsters

6.4. by default, replication of 3

6.4.1. the more, the better it will perform

6.5. optimized for large files

6.5.1. files smaller than 64MB (block size) won't be padded, in both file-system & name node RAM

6.6. All NameNode data stored in RAM

6.6.1. for fast lookup

6.7. Workflow

6.7.1. client ask a file from name node

6.7.2. redirect to specific data node

7. About

7.1. Ops Engineer in Outbrain

8. Usage

8.1. 15 Hadoop nodes

8.2. Used for Reports

8.2.1. Customer facing reports

9. Why

9.1. Scales really well

9.2. Alternatives working in scale-up reach some wall

9.3. When using sharding, you need a reliable framework to work with it

9.4. When writing scripts yourself (e.g., ETL) they usually start |eating their own tail"

9.5. Why not commercial alternatives?

9.5.1. Popular

9.5.2. Initially in science now also banks &c

9.5.3. Community also many companies that support it

10. 2 distro's of Hadoop

10.1. ASF (Apache) vs. CDH

10.2. ASF has the bleeding edge features

10.3. according to the use-case, you'll need to setup the cluster differently

10.3.1. e.g. if in the shuffle stage lots of data gets transferred, you'll need a hardware & network solution

10.4. most use CDH

10.4.1. RPM's

10.4.2. stable

11. abstraction between ops & dev

11.1. dev treat the data as an abstract dataset, not caring where it is

11.2. ops just do the plumbing & make it work

12. design principles

12.1. support

12.1.1. partial failure

12.1.2. data recovery

12.1.3. consistent

12.1.4. scalable

12.2. "scatter & gather"

12.2.1. moving data is expansive

12.2.2. moving calculation is cheap

12.2.3. chatter is as little as possible

12.3. share nothing

12.4. map/reduce is very simple, but can be applied to almost anything

13. know before you build

14. Ecosystem

14.1. Hadoop MapReduce

14.1.1. MapReduce daemons 2 daemons Job Tracker Task Tracker anything that fails, will be restarted if failed too many times, will get into black list

14.1.2. MapReduce workflow Client Job JobTracker Task TaskTracker Task Child JVM of Task

14.1.3. Processing slots for machines Rule of thumb number of cores + 2

14.1.4. Scheduling You can set min/max requirements for tasks

14.2. Hive

14.2.1. Datawarehouse built on top of Hadoop

14.2.2. Looks like SQL HQL interpreted into MapReduce jobs

14.3. Pig

14.3.1. Scripting language Developed in Yahoo Similar in functionality to Hive, but with more programmatic approach Nice web interface to monitor nodes, trackers & tasks

15. Other awesome stuff used in Outbrain

15.1. Scoop

15.1.1. Developed by Cloudera Move data from/into your cluster

15.2. oozie

15.2.1. Like LinkedIn's Escaban

15.3. Flume

15.3.1. Streaming log transort

15.3.2. Stream logs directly into Hive

15.3.3. Use case Meebo needs decide in 2 hours whether to retire an ad Processing is longer Use Hive decorator Analyze a sample in Hive in real-time to make a decision

15.4. More stuff from Cloudera

15.4.1. Hue Python based Open source

15.4.2. Beeswax

16. Outbrain's Cluster

16.1. 15 nodes

17. Resources

17.1. Eric Sammer's Productizing hadoop talk

17.1.1. best

17.1.2. from last Hadoop conference

17.2. ...

18. Tips

18.1. Replace the default javabased compression