Become an expert in Hadoop by getting hands-on knowledge on MapReduce, Hadoop Architecture, Pig & Hive, Oozie, Flume and Apache workflow scheduler. Also, get familiar with HBase, Zookeeper, and Sqoop concepts while working on industry-based, use-cases and projects.
Why get Big Data Hadoop Developer Certification from Collabera TACT?
As new job opportunities are arising for IT professionals in the field of “Big Data & Hadoop,” there is an enormous scope for them. According to the recent study, in 2018, there will be 181,000 Big Data roles within the U.S. By 2020, the Big Data & Hadoop market is estimated to grow at a compound annual growth rate (CAGR) 58% surpassing $16 billion.
Big Data Hadoop Developer certification offered by Collabera TACT brings out the key ideas and proficiency for managing Big Data with Apache’s open source platform – Hadoop. Gaining in-depth knowledge on core ideas through the course and executing it on wide-ranging industry use-cases. It imparts new opportunities to organizations of all sizes and equips you to write codes on MapReduce framework. The course also consists of advanced modules like Yarn, Zookeeper, Oozie, Flume and Sqoop.
Big Data Hadoop Developer Course Objective
- Learn to write complex codes in MapReduce on both MRv1 & MRv2 (Yarn) and understand Hadoop architecture.
- Perform analytics and learn high-level scripting frameworks Pig & Hive.
- Get full understanding of Hadoop system and its advance elements like Oozie, Flume and apache workflow scheduler.
- Get familiar with other concepts: Hbase, Zookeeper and Sqoop.
- Get hands-on expertise in numerous configurations surroundings of Hadoop cluster.
- Learn about optimization & troubleshooting.
- Acquire in-depth knowledge on Hadoop architecture by learning about Hadoop Distribution file system (vHDFS one.0 & vHDFS a pair of.0).
- Get to work on Real Life Project on Industry standards.
Any individual who wants to pursue their career in Big Data and Hadoop should have a basic understanding of Core Java. However, it is not mandatory as Collabera TACT offers complementary Java (self paced) tutorials that will assist you to brush up your Java skills.
Project 1: “Twitter Analysis”
The general observation is that 80% of the data is unstructured, while the remaining 20% is said to be in structured form. With the help of RDBMS, we can store/process only the structured data while Hadoop enables us to store or process unstructured data as well.
Today Twitter has become a significant source of data and a reliable one at that to analyze what the consumer is thinking about something (sentimental analysis). This helps in figuring out the trending topics/ discussions. During this case study we will be gathering data from Twitter, using various means, for some interesting analysis.
Project 2: “Click Stream Analysis”
E-commerce websites have been observed to impact the economy of their region in a huge way. This trend has been observed globally. Every e-commerce website keeps a record of user-activity and stores it as clickstream. This activity is used to analyze the browsing patterns of a particular user thus helping the sites to recommend products, with high accuracy, when the user visits the website the next time. This also helps the e-commerce websites to design personalized promotional emails for its users.
In this case study we will see how we can analyze the clickstream and user-data by using Pig and Hive. We will be gathering the user data with the help of RDBMS and will capture the user-behaviour (clickstream) by using Flume in HDFS. Thereafter, we will analyze this data using Pig and Hive. We will also be automating the Click Stream Analysis by putting workflow engine Oozie, to use.
Introduction/ Installation of Virtual Box and the Big Data VM Introduction to Linux, Why Linux?, Windows and the Linux equivalents, Different flavors of Linux, Unity Shell (Ubuntu UI), Basic Linux Commands (enough to get started with Hadoop)
3V (Volume- Variety- Velocity) characteristics, Structured and Unstructured Data, Application and use cases of Big Data, Limitations of traditional large Scale systems, How a distributed way of computing is superior (cost and scale), Opportunities and challenges with Big Data
HDFS Overview and Architecture, Deployment Architecture, Name Node, Data Node and Checkpoint Node (aka Secondary Name Node), Safe mode, Configuration files, HDFS Data Flows (Read v/s Write)
CRC Check Sum, Data replication, Rack awareness and Block placement policy, Small files problem
Command Line Interface, File System, Administrative, Web Interface
Load Balancer, Dist cp (Distributed Copy), HDFS Federation, HDFS High Availability, Hadoop Archives
MapReduce overview, Functional Programming paradigms, How to think in a MapReduce way
Legacy MR v/s Next Generation MapReduce, ( aka YARN/ MRv2), Slots v/s Containers, Schedulers, Shuffling, Sorting, Hadoop Data Types, Input and Output Formats, Input Splits – Partitioning ( Hash Partitioner v/s Customer Partitioner), Configuration files, Distributed Cache
Adhoc querying, Graph Computing Engines
Stand alone mode ( in Eclipse), Pseudo distributed mode ( as in the Big Data VM), Fully distributed mode ( as in Production), MR API, Old and the new MR API, Java Client API, Hadoop data types and custom Writable
Different input and output formats, Saving Binary Data using Sequence Files and Avro Files, Hadoop Streaming (developing and debugging non Java MR program s – Ruby and Python)
• Speculative execution • Combiners • JVM Reuse • Compression
Sorting, Term Frequency, Inverse Document Frequency, Student Data Base, Max Temperature, Different ways of joining data, Word Co- Occurrence
PageRank, Inverted Index
Introduction and Architecture, Different Modes of executing Pig constructs, Data Types, Dynamic invokers Pig streaming Macros, Pig Latin language Constructs (LOAD, STORE, DUMP, SPLI T, etc), User Defined Functions, Use Cases
Introduction and Architecture, Different Modes of executing Hive queries, Metastore Implementations, HiveQL (DDL & DML Operations) External v/s, Managed Tables Views, Partitions & Buckets User Defined Functions, Transformations using Non Java Use Cases
NoSQL Databases – 1 (Theoretical Concepts), NoSQL Concepts, Review of RDBMS
Need for NoSQL, Brewers CAP Theorem, ACI D v/s BASE, Schema on Read vs. Schema on Write, Different levels of consistency, Bloom filters
Key Value, Columnar, Document, Graph
HBase Architecture, Master and the Region Server, Catalog tables ( ROOT and META), Major and Minor compaction, Configuration files, HBase v/s Cassandra
Java API, Client API, Filters, Scan Caching and Batching, Command Line Interface, REST API
HBase Data Modeling, Bulk loading data in HBase, HBase Coprocessors – Endpoints (similar to Stored Procedures in RDBMS), HBase Coprocessors – Observers (similar to Triggers in RDBMS)
Introduction to RDD, Installation and Configuration of Spark, Spark Architecture, Different interfaces to Spark, Sample Python program s in Spark
Usecase of YARN, YARN Architecture, YARN Demo
Usecase of Oozie, Oozie Architecture, Oozie Demo
Usecase of Flume, Flume Architecture, Flume Demo
Usecase of Sqoop, Sqoop Architecture, Sqoop Demo
Cloudera Hadoop cluster on the Amazon Cloud (Practice), Using EMR ( Elastic Map Reduce), Using EC2 ( Elastic Compute Cloud)
Stand alone mode (Theory) Distributed mode (Theory), Pseudo distributed, Fully distributed
Hadoop industry solutions, Importing/ exporting data across RDBMS and HDFS using Sqoop Getting real- time events into HDFS using Flume , Creating workflows in Oozie Introduction to Graph processing Graph processing with Neo4J, Using the Mongo Document Database, Using the Cassandra Columnar Database, Distributed Coordination with Zookeeper
Click Stream Analysis using Pig and Hive, Analyzing the Twitter data with Hive, Further ideas for data analysis
Our instructors/trainers are Cloudera and Hortonworks certified professionals. They have industry experience of more than 12 years and are Subject Matter Experts of Big Data.
To attend the live virtual training, one would require at least 2 Mbps of internet speed.
Yes, the Collabera TACT’s Virtual Machine can be installed on any local systems. The training team of Collabera will assist you with the same.
To install Hadoop environment, one needs to have 4GB RAM, a 32/64 bit OS, 50 GB free space on hard disk and a Virtualization Technology enabled processor within their systems.
The online live training course will be conducted for 8 weekends, 16 sessions and 46 to 48 hrs in total.
The candidates need not worry about losing any training session. They will be able to view the recorded sessions available on the LMS. We also have a technical support team to assist the candidates in case they have any query.
The access to the Learning Management System (LMS) will be for lifetime, which includes – Class recordings, presentations, sample code and projects. One will also have 2 years of access for Hadoop cluster.
Yes, we do have an option of group discount. To know more about group discount, contact firstname.lastname@example.org.
Yes, the course completion certificate is provided once you successfully complete the training program, you will be evaluated on few parameters like – Attendance in sessions, Objective examination and others. Based on you overall performance you will be certified by Collabera TACT.