|Platform MapReduce Distributed Runtime Engine
|Platform MapReduce --- Proven, Reliable, Efficient
Platform MapReduce is the best-of-breed, next generation distributed runtime engine for enterprise -class HadoopMapReduce applications. Platform MapReduce is compatible with many popular MapReduce distributions. Unlike distributed workload engines found in open source and other commercial MapReduce distributions, Platform MapReduce is designed to provide enterprise-class MapReduce runtime capabilities by delivering high resource utilization and predictability, high availability, an open architecture supporting multiple applications and file systems, better manageability and enterprise -class security. Platform MapReduce is built on Platform Computing's years of expertise in distributed workload scheduling and resource management capabilities, both are proven technologies that are powering many Fortune 500 companies for their mission critical, most demanding workloads. As a best-of-breed solution, Platform MapReduce delivers unprecedented distributed workload runtime services for your MapReduce applications.
Platform Computing's MapReduce solution includes (see architecture diagram below):
- Application adapter technology for executing HadoopMapReduce jobs without the requirement to change code or recompile. Support includes MapReduce Java, Pig, and Hive application code. In addition, other Hadoop projects such as Oozie are also supported.
- Multiple Application Programming Interfaces (APIs) for other commercial application execution. Examples include R, C/C++, C#/.NET, Java, Python, R, native binaries & others.
- Support for mixed application workloads (MapReduce and Non-MapReduce APIs) executing on the same set of shared resources within the same cluster.
- Platform MapReduce Workload Engine that automates, distributes, and manages MapReduce workloads according to users' Service Level Agreements (SLAs).
- Platform Resource Orchestrator that allocates and manages distributed pools of resources including clusters, servers, CPUs and memories. It allows multiple applications to share a common set of resources.
- Advanced file system and data access framework - Provides connectivity to different types of file systems and database architectures, eliminating the need to migrate existing data while optimizing resource utilization.
- Rich set of management, troubleshooting, and reporting tools - Single GUI interface for managing and troubleshooting multiple MapReduce applications across a shared set of resources. Full version compatibility with support for rolling upgrades.
|Benefits and Features
High Resource Utilization and Predictability
Enterprise-class Manageability and Security
- Support for multiple MapReduce applications sharing the MapReduce distributed workload cluster
- Sophisticated scheduler to manage various types of workloads simultaneously across the cluster
- Application deployment, workload scheduling policies, tuning, and general monitoring and administration
- Automated failover of HDFS NameNode, Job Tracker and Task Tracker
- Automatic job restart
Multiple Application Workload Capable
- Open architecture for application development and end user access
- Open architecture supporting various types of distributed file systems and different data input from data output
Guaranteed SLA Requirements
- Supports multiple MapReduce applications and mixed workloads running simultaneously across a set of common resources
Hadoop Compatibility And Optional Support For HDFS
- Priority based scheduling with large degrees of granularity
- Supports up to 10,000 levels of prioritization for extremely granular scheduling logic
- 100% compatibility with Hadoop applications and fully integrated, optional support for HDFS for Platform MapReduce customers
- A complete, fully supported MapReduce stack from a single vendor
|Key Features and Functions|
Policy Driven Workload Scheduler
The Platform MapReduce policy driven workload scheduler allows multiple parallel jobs being executed based on a powerful priority scheduling engine. Platform MapReduce provides 10,000 levels of priority and fair share scheduling of Mapper and Reducer jobs, all done at the job level to provide better granularity and control. In contrast, the open source Hadoop solution does not provide fair share scheduling of its Mapper and Reducer jobs from its default deployment. Instead, it uses a FIFO methodology. Even with its fair share scheduling plug-in options, open source Hadoop is restricted to five levels of priorities.
One of the distinct advantages of the Platform MapReduce policy driven workload scheduler is to deliver resource priority for preemptive jobs. Once a preemptive job is requested, it gets all the resources needed to complete the job, existing jobs will wait for additional resources until the preemptive job is done. The previously running jobs will again resume and provide fair-share scheduling priorities once the pre-emptive job has completed.
High Resource Availability
Platform MapReduce guarantees uptime within the distributed runtime engine-there is no single point of failure. If the server running the Master Job Tracker fails the Platform MapReduce job(s) will continue by automatically restarting a Master Job tracker on a new server, followed by automatically restarting the currently running MapReduce tasks. Recovery of any in-progress tasks at the time of failure is automatic and completed tasks do not have to be re-run. The open source Hadoop implementation does not provide such capability. In addition, Platform MapReduce provides automatic recovery of the individual Map task and Reduce task if the task's compute server running these tasks fails. They are either restarted on the current compute server (if possible) or restarted on an alternative server immediately.
For Hadoop file systems, Platform MapReduce offers automatic failover of the NameNode within the Hadoop Distributed file system and provides file system recovery and dependent job recovery.
Open Architecture For Application Development And End User Access
Platform MapReduce is built on an open architecture to support multiple MapReduce applications including 100% Hadoop application compatibility for Java based MapReduce jobs. Application integration for jobs built with HadoopMapReduce technology (Java, Pig, Hive) require no changes to the programming logic for execution on PlatformMapReduce. With no changes to the programming logic, customers avoid any requirements to recompile code. Examples of applications supported by the Platform Application Adapter technology include: Pig, Hive, Java, Oozie, Dumbo, natively written Java MapReduce programs, and others.
Open Architecture For Choice Of File Systems
Platform MapReduce provides a method for leveraging multiple file system types as well as database architectures. Platform MapReduce fully supports HDFS, GPFS and other distributed file system types and data types. In addition, for MapReduce processes, the input data source file system type can be different from the output data source file system. This provides support for many uses, including extract, transformation, and load (ETL) workflow logic. For example, the input data source could be HDFS while the output data source could be a database. This eliminates the need to stage data prior to a batch load process.
Supporting Multiple MapReduce Applications Running On The Same Cluster
Platform MapReduce will support up to 300 separate MapReduce applications (also known as Job Trackers) running on a same distributed file system cluster. Multiple applications (or Job Trackers) can run simultaneously and dynamically share resources across application boundaries. This eliminates siloed IT operations in which a cluster is dedicated to a single job tracker. Such capability also increases resource utilization dramatically while still maintaining a single management interface. In addition, Platform MapReduce supports mixed types of workloads (MapReduce as well as other distributed workloads) running on a single cluster. Such capability allows customers to leverage existing resources, drive up utilization of all resources, balance the workload across applications dynamically, and thus maximize their IT infrastructure.
Platform MapReduce offers 3 times scalability over its open source alternatives per distributed file system cluster:
- Up to 5,000 nodes and 40,000 cores
- 40,000 concurrent tasks
- 1,000,000 total tasks in a single job.
- 1,000 concurrent jobs with 300 MapReduce (Job Tracker) applications.
Platform MapReduce is architected so for virtually an unlimited number of priority levels for job scheduling. Note open source Hadoop can only support 5 levels.
Greater monitoring and troubleshooting capabilities
Platform MapReduce monitors CPU and memory utilization level and allocates resources accordingly. It provides the ability to pull log data from individual servers and manage them from a single interface.
Support Rolling Upgrade
Platform MapReduce supports multiple versions of MapReduce applications running on the same clusters and there is no need to take down the entire cluster for software upgrade. Customers have the option of selecting the servers needed for the upgrade. Once upgraded, these servers can co-exist with the previous version of the product on other nodes and thus allow upgrades to be done incrementally over a set of servers without taking down the entire cluster.
Platform MapReduce automatic ally cleans up intermediate and temporary files upon job completion. As part of the MapReduce logic, all temporary and intermediate files are removed from the local servers at the completion of the last reduce task associated with a job. With Platform MapReduce, cleanup is done automatically at the job level. In addition, there is a dependency for all reduce jobs to complete prior to the clean-up task starting. This dependency ensures files are not deleted and can thus be leveraged again if a failure event were to occur in the middle of a job.
Platform MapReduce is the industry's fastest distributed resource infrastructure solution. It has the ability to harness resources from distributed clusters in remote data centers. Platform MapReduce helps organizations run complex data simulations with sub-millisecond latency with data throughput over 7,300 tasks per second.
Flexible Resource Sharing
Platform MapReduce is able to react quickly and dynamically to changes based on application demand. It brings a flexible amount of computing power to your applications even with data streams to the distributed resources. Based on workload volume, Platform MapReduce distributed resources can grow or shrink by re-allocating up to 1,000 CPUs per second to adjust to the current workload, in order to reduce cost while maximizing results.
Platform MapReduceMultiCore Optimizer
Platform MapReduceMultiCore Optimizer helps increase application performance and lower infrastructure cost through higher utilization of multi-core servers. This capability makes the most out of multi-core servers for both multi-threaded and single-threaded applications. Application performance and scalability are improved through reduced I/O and memory contention in multi-core systems, increasing utilization for multiple I/O-intensive tasks per core, efficiently matching resources with non-uniform workload.
Platform MapReduce Data Affinity
Platform MapReduce includes powerful data affinity capabilities to significantly improve application performance and resource utilization by taking into account data locality when scheduling MapReduce workloads. Its data affinity solution virtually eliminates the time it takes to access large data volumes required by data intensive MapReduce applications. It increases overall application performance by up to 400% through faster file access.