«CONCURRENCY AND COMPUTATION: PRACTICE AND EXPERIENCE Concurrency Computat.: Pract. Exper. (2015) Published online in Wiley Online Library ...»
CONCURRENCY AND COMPUTATION: PRACTICE AND EXPERIENCE
Concurrency Computat.: Pract. Exper. (2015)
Published online in Wiley Online Library (wileyonlinelibrary.com). DOI: 10.1002/cpe.3595
SPECIAL ISSUE PAPER
Evaluating map reduce tasks scheduling algorithms over cloud
Qutaibah Althebyan1,*,†, Yaser Jararweh2, Qussai Yaseen3, Omar AlQudah2
and Mahmoud Al-Ayyoub2 1 Software Engineering Department, Jordan University of Science and Technology, Irbid, Jordan 2 Computer Science Department, Jordan University of Science and Technology, Irbid, Jordan 3 Computer Information Systems Department, Jordan University of Science and Technology, Irbid, Jordan SUMMARY Efﬁciently scheduling MapReduce tasks is considered as one of the major challenges that face MapReduce frameworks. Many algorithms were introduced to tackle this issue. Most of these algorithms are focusing on the data locality property for tasks scheduling. The data locality may cause less physical resources utilization in non-virtualized clusters and more power consumption. Virtualized clusters provide a viable solution to support both data locality and better cluster resources utilization. In this paper, we evaluate the major MapReduce scheduling algorithms such as FIFO, Matchmaking, Delay, and multithreading locality (MTL) on virtualized infrastructure. Two major factors are used to test the evaluated algorithms: the simulation time and the energy consumption. The evaluated schedulers are compared, and the results show the superiority and the preference of the MTL scheduler over the other existing schedulers. Also, we present a comparison study between virtualized and non-virtualized clusters for MapReduce tasks scheduling. Copyright © 2015 John Wiley & Sons, Ltd.
Received 12 June 2015; Accepted 16 June 2015 map reduce; schedulers; cloud computing; virtualization; Hadoop scalability
1. INTRODUCTION Over the past few years, it has been noticed that massive amounts of data are being produced on a daily (or even an hourly) basis. These amounts are expected to increase signiﬁcantly each year. For example, Mcafee and Brynjolfsson  estimated that in 2012, the world would produce 2.5 Exabytes of data daily, whereas Villars et al.  estimated the data to be produced in 2014 to be 7 Zettabytes.
There are many reasons for this explosion of data such as the wide and ubiquitous use of electronic devices such as computers, sensor networks, and smartphones. They are also generated by the many healthcare devices and sites that generate and transmit huge amounts of data. Another reason that is in fact on top of all other reasons is the extensive use of socialcommunication and media sharing sites in several daily activities. Consequently, this huge amount of generated data needs to be handled using novel and efﬁcient data management systems giving rise to the term ‘big data’.
The big data term is currently used to refer to huge and complex datasets that no traditional data processing system can handle efﬁciently. Many researchers presented different deﬁnitions to the big data term. One of the most popular ones is that of McKinsey  in which big data is deﬁned *Correspondence to: Qutaibah Althebyan, Software Engineering Department, Jordan University of Science and Technology, Irbid, Jordan.
† E-mail: firstname.lastname@example.org
as ‘datasets whose size are beyond the ability of typical database software tools to capture, store, manage and analyze.’ New technologies are needed to be able to extract values from those datasets;
such processed data might be used in other ﬁelds such as artiﬁcial intelligence, data mining, health care, and social networks. International Business Machine researchers  characterized big data with the 3Vs: variety, volume, and velocity. Variety is used to refer to the multiple types/formats in which big data is generated such as digits, texts, audios, videos, and log ﬁles.
The second characteristic is the huge volume of big data which can reach hundreds or thousands of terabytes. The third characteristic is the velocity where processing and analyzing data must be performed in a fast manner to extract value of data within an appropriate time. These characteristics drive for developing new methodologies to deal with such huge amounts of data. So, comes to existence the term ‘big data management’.
Big data operations are widely used in many technologies, for example, cloud computing, distributed systems, data warehouse, Hadoop, and MapReduce. MapReduce is one of these technologies that are utilized to handle such big data. It is a software framework introduced by Google for processing large amounts of data in a parallel manner . In fact, it provides a set of features such as user-deﬁned functions, automatic parallelization and distribution, fault tolerance, and high availability by data replicating.
MapReduce works in two phases: the map phase and the reduce phase. In the map phase, a dedicated node called the master node takes the input, divides it into smaller shared data splits, and assigns them to worker nodes. The worker nodes may perform the same splitting operation, leading to a hierarchal tree structure. The worker node processes the assigned splits and sends the results back to the master node. The reduce phase then begins with sorting and shufﬂing the partitions that are produced by the map phase. The master node conﬁrms the reduce workers to obtain their intermediate pairs through remote procedure calls. After acquiring the data for each reducer, the data are grouped and sorted by the key. Once all reducers get sorted, the reducer calls reduce function for pairs of data. The output of reduce function writes to the corresponding output ﬁles leading to obtain all the needed processing achieved. MapReduce framework suffers from many drawbacks. Such drawbacks are related to the MapReduce task scheduling, which severely harm the data management efﬁciency and performance. Hence, the problems in MapReduce scheduling techniques need some amendments, especially in the task scheduling in order to obtain more cluster utilizations, as well as other factors that affect the performance of the whole system. So, a need for a scalable scheduler to face these kinds of problems is needed.
A term that is highly related to cloud computing and MapReduce is virtualization. Virtualization is the technology, which gives the ability to run multiple operating systems simultaneously on a single physical machine sharing its underlying resources that are referred to as virtual machines (VMs). Some of the beneﬁts of sharing those resources are (i) the ability of running multiple and different operating systems (e.g., Windows and Linux) and (ii) using several separated operating systems maximizes the use of the machine’s resources . A hypervisor is the software layer that provides the virtualization. It also emulates the underlying hardware resources to the different VMs.
Normally, operating systems have direct access to the physical resources hardware. In case of virtualization, operating systems do not have direct access to hardware resources; instead, they access those resources through the hypervisor. The hypervisor in its role executes the privileged instructions on behalf of VMs. Virtualization technology allows resource allocation of a single physical machine among multiple different users. Some of the popular virtualization technologies are XEN, KVM, and VMware.
In cloud computing, virtualization technology is used to dynamically allocate or reduce resources of applications by creating or destroying VMs. It also helps to co-locate VMs to a small number of physical machines which may reduce the number of active physical machines. This approach is called server consolidation .
Hence, a new scheduling algorithm called multithreading locality scheduler (MTL) has been proposed [7, 8]. To prove the superiority of this scheduler over other existing algorithms and to make sure that this new scheduler outperforms other existing scheduler, an evaluation process is conducted. This evaluation process tests the performance of the new scheduler against other existing schedulers (First in First out (FIFO), Delay, and Matchmaking (MM)) over a virtualized
environment. The evaluation process is conducted to test the system performance in terms of both the simulation time and the energy consumption over a virtualized environment. The evaluation process is ﬁnally concluded by a comparison study between virtualized and non-virtualized clusters for MapReduce tasks scheduling. It is important to mention that (for the reader not to be confused) the two terms ‘virtualized infrastructure’ and ‘cloud infrastructure’ have been used interchangeably throughout the whole paper.
The rest of the paper is organized as follows. Section 2 discusses some of the related works.
Section 3 highlights some details about various existing scheduling algorithms including our proposed scheduler (MTL). In Section 4, we show our evaluation and experimental results. Finally, we conclude our work and drew some future highlights in Section 5.
2. BACKGROUND AND RELATED WORK
Many technologies produce huge amounts of data that need to be handled using big data management schemes. Examples of these include, but not limited to, distributed systems, cloud computing, data warehouse, Hadoop, and MapReduce. In distributed systems, the system is decomposed into small tasks. These tasks are distributed among multinodes that are connected through networks.
These tasks are processed in parallel to achieve better performance while reducing cost. Another system that produces huge amounts of data is the data warehouse. The data warehouse is a special database designed for managing, storing, and processing huge amounts of data. The data warehouse uses different technologies such as the ETL process which has three phases (extract, transform, and load). The ETL process is used to upload data from operational stores and business intelligence.
These three phases are usually run in parallel to guarantee better and more efﬁcient outcomes.
Cloud computing is a parallel distributed system with scalable resources that is continuously evolving and spreading. It provides services via networks. Cloud computing refers to the use of some common and shared computing resources (including the software, the hardware, the infrastructure, etc.) to deliver a set of utilities and serves as an alternative approach to having local servers, database storages, or even computing services and functions. In fact, cloud computing is presented to provide good environment for techniques that need scalable resources and distributed systems such as MapReduce framework . There are several examples for emerging cloud computing platforms, namely, Amazon EC2, Google App Engine, Aneka, and Microsoft Azure. Hadoop is one of the most well-known cloud computing frameworks. Hadoop is an open source implementation software framework of Google’s MapReduce developed by Apache. It is presented to handle huge amounts of data utilizing its core feature of having the distributed systems feature. Hadoop uses MapReduce framework  that upgrades the processing from single servers to several thousands of servers. Hadoop hides the details of the parallel processing that is performed for several distributed nodes. Moreover, it hides any other arising issue that the system might undergo, such as the failure of any node, the failure of any task or subtask, and the consolidation of the results after processing and computation. Hadoop is usually decomposed into two main components: Hadoop distributed ﬁle systems ‘HDFS’ and MapReduce framework. MapReduce is a software framework introduced by Google for parallel processing of large amounts of data . It is mainly based on parallel processing of computing infrastructure that exploits the massive computing infrastructure available to handle many major issues introduced by the rising use of the big data. The MapReduce framework consists of several components which are master, mappers, and reducers. A master is responsible for scheduling job component tasks for mappers and reducers. Mappers are responsible for processing the input data to be suitable as input to reducers to be processed. Reducers are responsible for processing the data which comes from multiple mappers and return the result .
The huge amounts of data are a result of several jobs of different diversity, which usually share large-scale computing clusters. These need to be processed in a specialized manner. A major step in this specialized processing (in the map phase component of the MapReduce framework as we just mentioned) of the data is scheduling. Unless the scheduling is not optimized, dealing with this big data will not produce good results that will enhance the system utilization as a whole. Examples of
such utilization include (but are not limited to) high availability, high cluster utilization, fairness, data locality (computation near to its input data), scalability, and so on. Incorporating such factors makes it not easy to achieve an efﬁcient scheduling model.
In fact, there are many techniques that are currently used for scheduling tasks in the MapReduce framework. FIFO is one of those basic traditional schedulers . More details about FIFO will be provided in the next section. Another scheduler is the MM scheduler . More elaboration about this scheduler will be drawn in the next section.
Another scheduler is the fair share scheduler that gives every job a fair share of the cluster over a respected time slot. This time slot is predeﬁned in order to prevent any greedy jobs from resources reserving . An alternative scheduler is the capacity scheduler that makes partitions for the resources and divides them into multipools with dedicated job queue for every pool. The authors of this scheduler observed that the order of executing job has an impact on all execution times for all jobs. So, they try to concentrate more on ordering jobs to achieve best system performance. Balanced pool uses Johnson’s algorithm that has optimal scheduling for two stages problems such as map and reduce scheduling .