FREE ELECTRONIC LIBRARY - Abstracts, online materials

Pages:   || 2 | 3 | 4 |

«CONCURRENCY AND COMPUTATION: PRACTICE AND EXPERIENCE Concurrency Computat.: Pract. Exper. (2015) Published online in Wiley Online Library ...»

-- [ Page 1 ] --


Concurrency Computat.: Pract. Exper. (2015)

Published online in Wiley Online Library (wileyonlinelibrary.com). DOI: 10.1002/cpe.3595


Evaluating map reduce tasks scheduling algorithms over cloud

computing infrastructure

Qutaibah Althebyan1,*,†, Yaser Jararweh2, Qussai Yaseen3, Omar AlQudah2

and Mahmoud Al-Ayyoub2 1 Software Engineering Department, Jordan University of Science and Technology, Irbid, Jordan 2 Computer Science Department, Jordan University of Science and Technology, Irbid, Jordan 3 Computer Information Systems Department, Jordan University of Science and Technology, Irbid, Jordan SUMMARY Efficiently scheduling MapReduce tasks is considered as one of the major challenges that face MapReduce frameworks. Many algorithms were introduced to tackle this issue. Most of these algorithms are focusing on the data locality property for tasks scheduling. The data locality may cause less physical resources utilization in non-virtualized clusters and more power consumption. Virtualized clusters provide a viable solution to support both data locality and better cluster resources utilization. In this paper, we evaluate the major MapReduce scheduling algorithms such as FIFO, Matchmaking, Delay, and multithreading locality (MTL) on virtualized infrastructure. Two major factors are used to test the evaluated algorithms: the simulation time and the energy consumption. The evaluated schedulers are compared, and the results show the superiority and the preference of the MTL scheduler over the other existing schedulers. Also, we present a comparison study between virtualized and non-virtualized clusters for MapReduce tasks scheduling. Copyright © 2015 John Wiley & Sons, Ltd.

Received 12 June 2015; Accepted 16 June 2015 map reduce; schedulers; cloud computing; virtualization; Hadoop scalability


1. INTRODUCTION Over the past few years, it has been noticed that massive amounts of data are being produced on a daily (or even an hourly) basis. These amounts are expected to increase significantly each year. For example, Mcafee and Brynjolfsson [1] estimated that in 2012, the world would produce 2.5 Exabytes of data daily, whereas Villars et al. [2] estimated the data to be produced in 2014 to be 7 Zettabytes.

There are many reasons for this explosion of data such as the wide and ubiquitous use of electronic devices such as computers, sensor networks, and smartphones. They are also generated by the many healthcare devices and sites that generate and transmit huge amounts of data. Another reason that is in fact on top of all other reasons is the extensive use of socialcommunication and media sharing sites in several daily activities. Consequently, this huge amount of generated data needs to be handled using novel and efficient data management systems giving rise to the term ‘big data’.

The big data term is currently used to refer to huge and complex datasets that no traditional data processing system can handle efficiently. Many researchers presented different definitions to the big data term. One of the most popular ones is that of McKinsey [3] in which big data is defined *Correspondence to: Qutaibah Althebyan, Software Engineering Department, Jordan University of Science and Technology, Irbid, Jordan.

† E-mail: qaalthebyan@just.edu.jo

–  –  –

as ‘datasets whose size are beyond the ability of typical database software tools to capture, store, manage and analyze.’ New technologies are needed to be able to extract values from those datasets;

such processed data might be used in other fields such as artificial intelligence, data mining, health care, and social networks. International Business Machine researchers [4] characterized big data with the 3Vs: variety, volume, and velocity. Variety is used to refer to the multiple types/formats in which big data is generated such as digits, texts, audios, videos, and log files.

The second characteristic is the huge volume of big data which can reach hundreds or thousands of terabytes. The third characteristic is the velocity where processing and analyzing data must be performed in a fast manner to extract value of data within an appropriate time. These characteristics drive for developing new methodologies to deal with such huge amounts of data. So, comes to existence the term ‘big data management’.

Big data operations are widely used in many technologies, for example, cloud computing, distributed systems, data warehouse, Hadoop, and MapReduce. MapReduce is one of these technologies that are utilized to handle such big data. It is a software framework introduced by Google for processing large amounts of data in a parallel manner [5]. In fact, it provides a set of features such as user-defined functions, automatic parallelization and distribution, fault tolerance, and high availability by data replicating.

MapReduce works in two phases: the map phase and the reduce phase. In the map phase, a dedicated node called the master node takes the input, divides it into smaller shared data splits, and assigns them to worker nodes. The worker nodes may perform the same splitting operation, leading to a hierarchal tree structure. The worker node processes the assigned splits and sends the results back to the master node. The reduce phase then begins with sorting and shuffling the partitions that are produced by the map phase. The master node confirms the reduce workers to obtain their intermediate pairs through remote procedure calls. After acquiring the data for each reducer, the data are grouped and sorted by the key. Once all reducers get sorted, the reducer calls reduce function for pairs of data. The output of reduce function writes to the corresponding output files leading to obtain all the needed processing achieved. MapReduce framework suffers from many drawbacks. Such drawbacks are related to the MapReduce task scheduling, which severely harm the data management efficiency and performance. Hence, the problems in MapReduce scheduling techniques need some amendments, especially in the task scheduling in order to obtain more cluster utilizations, as well as other factors that affect the performance of the whole system. So, a need for a scalable scheduler to face these kinds of problems is needed.

A term that is highly related to cloud computing and MapReduce is virtualization. Virtualization is the technology, which gives the ability to run multiple operating systems simultaneously on a single physical machine sharing its underlying resources that are referred to as virtual machines (VMs). Some of the benefits of sharing those resources are (i) the ability of running multiple and different operating systems (e.g., Windows and Linux) and (ii) using several separated operating systems maximizes the use of the machine’s resources [6]. A hypervisor is the software layer that provides the virtualization. It also emulates the underlying hardware resources to the different VMs.

Normally, operating systems have direct access to the physical resources hardware. In case of virtualization, operating systems do not have direct access to hardware resources; instead, they access those resources through the hypervisor. The hypervisor in its role executes the privileged instructions on behalf of VMs. Virtualization technology allows resource allocation of a single physical machine among multiple different users. Some of the popular virtualization technologies are XEN, KVM, and VMware.

In cloud computing, virtualization technology is used to dynamically allocate or reduce resources of applications by creating or destroying VMs. It also helps to co-locate VMs to a small number of physical machines which may reduce the number of active physical machines. This approach is called server consolidation [6].

Hence, a new scheduling algorithm called multithreading locality scheduler (MTL) has been proposed [7, 8]. To prove the superiority of this scheduler over other existing algorithms and to make sure that this new scheduler outperforms other existing scheduler, an evaluation process is conducted. This evaluation process tests the performance of the new scheduler against other existing schedulers (First in First out (FIFO), Delay, and Matchmaking (MM)) over a virtualized

–  –  –

environment. The evaluation process is conducted to test the system performance in terms of both the simulation time and the energy consumption over a virtualized environment. The evaluation process is finally concluded by a comparison study between virtualized and non-virtualized clusters for MapReduce tasks scheduling. It is important to mention that (for the reader not to be confused) the two terms ‘virtualized infrastructure’ and ‘cloud infrastructure’ have been used interchangeably throughout the whole paper.

The rest of the paper is organized as follows. Section 2 discusses some of the related works.

Section 3 highlights some details about various existing scheduling algorithms including our proposed scheduler (MTL). In Section 4, we show our evaluation and experimental results. Finally, we conclude our work and drew some future highlights in Section 5.


Many technologies produce huge amounts of data that need to be handled using big data management schemes. Examples of these include, but not limited to, distributed systems, cloud computing, data warehouse, Hadoop, and MapReduce. In distributed systems, the system is decomposed into small tasks. These tasks are distributed among multinodes that are connected through networks.

These tasks are processed in parallel to achieve better performance while reducing cost. Another system that produces huge amounts of data is the data warehouse. The data warehouse is a special database designed for managing, storing, and processing huge amounts of data. The data warehouse uses different technologies such as the ETL process which has three phases (extract, transform, and load). The ETL process is used to upload data from operational stores and business intelligence.

These three phases are usually run in parallel to guarantee better and more efficient outcomes.

Cloud computing is a parallel distributed system with scalable resources that is continuously evolving and spreading. It provides services via networks. Cloud computing refers to the use of some common and shared computing resources (including the software, the hardware, the infrastructure, etc.) to deliver a set of utilities and serves as an alternative approach to having local servers, database storages, or even computing services and functions. In fact, cloud computing is presented to provide good environment for techniques that need scalable resources and distributed systems such as MapReduce framework [5]. There are several examples for emerging cloud computing platforms, namely, Amazon EC2, Google App Engine, Aneka, and Microsoft Azure. Hadoop is one of the most well-known cloud computing frameworks. Hadoop is an open source implementation software framework of Google’s MapReduce developed by Apache. It is presented to handle huge amounts of data utilizing its core feature of having the distributed systems feature. Hadoop uses MapReduce framework [9] that upgrades the processing from single servers to several thousands of servers. Hadoop hides the details of the parallel processing that is performed for several distributed nodes. Moreover, it hides any other arising issue that the system might undergo, such as the failure of any node, the failure of any task or subtask, and the consolidation of the results after processing and computation. Hadoop is usually decomposed into two main components: Hadoop distributed file systems ‘HDFS’ and MapReduce framework. MapReduce is a software framework introduced by Google for parallel processing of large amounts of data [10]. It is mainly based on parallel processing of computing infrastructure that exploits the massive computing infrastructure available to handle many major issues introduced by the rising use of the big data. The MapReduce framework consists of several components which are master, mappers, and reducers. A master is responsible for scheduling job component tasks for mappers and reducers. Mappers are responsible for processing the input data to be suitable as input to reducers to be processed. Reducers are responsible for processing the data which comes from multiple mappers and return the result [10].

The huge amounts of data are a result of several jobs of different diversity, which usually share large-scale computing clusters. These need to be processed in a specialized manner. A major step in this specialized processing (in the map phase component of the MapReduce framework as we just mentioned) of the data is scheduling. Unless the scheduling is not optimized, dealing with this big data will not produce good results that will enhance the system utilization as a whole. Examples of

–  –  –

such utilization include (but are not limited to) high availability, high cluster utilization, fairness, data locality (computation near to its input data), scalability, and so on. Incorporating such factors makes it not easy to achieve an efficient scheduling model.

In fact, there are many techniques that are currently used for scheduling tasks in the MapReduce framework. FIFO is one of those basic traditional schedulers [11]. More details about FIFO will be provided in the next section. Another scheduler is the MM scheduler [11]. More elaboration about this scheduler will be drawn in the next section.

Another scheduler is the fair share scheduler that gives every job a fair share of the cluster over a respected time slot. This time slot is predefined in order to prevent any greedy jobs from resources reserving [11]. An alternative scheduler is the capacity scheduler that makes partitions for the resources and divides them into multipools with dedicated job queue for every pool. The authors of this scheduler observed that the order of executing job has an impact on all execution times for all jobs. So, they try to concentrate more on ordering jobs to achieve best system performance. Balanced pool uses Johnson’s algorithm that has optimal scheduling for two stages problems such as map and reduce scheduling [11].

Pages:   || 2 | 3 | 4 |

Similar works:

«Forêts Domaniales et Jeux de Droits dans le Cameroun Méridional. Le Réveil d’Un Vieux Débat Sans Issue ou La Croisée des Chemins? Un Paquet d’Arguments et d’Outils de Négociation en Support à l’Initiative des Droits et des Ressources et à la Prise de Décision Rapport Principal Phil René Oyono, Centre pour l’Environnement et le Développement (CED) Yaoundé, Janvier 2009 Remerciements Nous remercions infiniment les communautés de villages riverains des Réserves Forestières...»

«The Mounties or the R.C.M.P. Written by Nadine Fabbi, Assistant Director, Canadian Studies Center Henry M. Jackson School of International Studies, University of Washington For a hard copy of this essay, other background materials, or questions contact Na dine at: 206-543-6269 or nfabbi@u.washington.edu Last Updated: March 2003 Introduction Almost everyone has heard of the Mounties, Canada’s federal and provincial police force (except for Québec and Ontario). The Mounties are famous around...»

«FLEXIBLE FURLER 7.0 FLEXIBLE FURLER 9.0 INSTALLATION / OPERATING INSTRUCTIONS 44 James Street, Homer, NY 13077 Tel. 607 749 4599 Fax 607 749 4604 Website: www.sailcdi.com Updated October 2006 WARNINGS – READ BEFORE INSTALLING OR USING YOUR FURLER Improper installation of the Flexible Furler or improper reinstallation of the forestay can cause failure of the forestay, and could result in the loss of the mast and injury to crew members. PRE-INSTALLATION WARNING: You must use toggles at both...»

«Calhoun: The NPS Institutional Archive Theses and Dissertations Thesis Collection 2013-12 The culture of nationalism Uselman, Jerome J Monterey, California: Naval Postgraduate School http://hdl.handle.net/10945/39031 NAVAL POSTGRADUATE SCHOOL MONTEREY, CALIFORNIA THESIS THE CULTURE OF NATIONALISM by Jerome J. Uselman December 2013 Thesis Advisor: Donald Abenheim Co-Advisor: Carolyn Halladay Approved for public release; distribution is unlimited THIS PAGE INTENTIONALLY LEFT BLANK REPORT...»

«A Theoretical Framework and Operational Strategies for Creating ―Rivers‖ of Unrestricted Income for Nonprofit Charities Jeffrey L. Greim Bay Path College jgreim@baypath.edu July 2012 The author acknowledges the assistance of his Bay Path College colleagues including Dr. Melissa Morriss-Olson, Dr. Kristen Barnett, A. Rima Dael, and Dr. James Wilson. Executive Summary Since the financial crisis of 2008, the nonprofit sector has faced a very challenging funding environment even as the demand...»

«Actual and Preferred Work Hours / 1171 You Can’t Always Get the Hours You Want: Mismatches between Actual and Preferred Work Hours in the U.S.* JEREMY REYNOLDS, University of Georgia Abstract Data from the 1997 International Social Survey Programme show that a majority of U.S. employees prefer to work a different number of hours than they actually work. Employees are divided in their preferences: many want to spend less time at work, but there are also many who want to increase their hours....»

«DECEMBER MEETING, 2008 The University of Michigan Ann Arbor December 18, 2008 The Regents convened at 3:05 p.m. in the Regents’ Room. Present were President Coleman and Regents Darlow, Deitch, Maynard, McGowan, Newman, Richner, Taylor, and White. Also present were Vice President and Secretary Churchill, Vice President Forrest, Vice President Harper, Executive Vice President Kelch, Vice President Lampe, Chancellor Little, Vice President May, Chancellor Person, Vice President Scarnecchia,...»

«DISTRIBUTION OF DAWES ANNUITIES (FINANCE MINISTERS' AGREEMENT, 1925) Final protocol and agreement signed at Paris January 14, 1925 Entered into force January 14,1925 Extended and amended by agreements of September 21, 1925/ and January 13, 1927 2 Supplanted by New (Young) Plan of January 20, 1930 a 1925 For. ReI. (II) 145 and 1919 For. ReI. (Paris Peace Conference, XUI) 902 FINAL PROTOCOL The representatives of the Governments of Belgium, France, Great Britain, the United States of America,...»

«Autumn 2016 The Newsletter from Henlow Parish Council : Published Quarterly Henlow Parish Councillors Richard Brown Granary Barn, High Street SG16 6BS 01462 816314 Mark Cooper Old Manor Farm SG16 6BB 01462 814249 Steven Dixon (Chair) Cityfield Farm SG16 6DD 07973 127077 Lucie Flint Brook House, 77 High St SG16 6AB 01462 811239 Phillipa Gray 15 Newtown SG16 6AJ 01462 816130 Michele Joy Louise House, Newtown SG16 6AJ 01462 811783 Lee Matthews 21 The Railway SG16 6FN 07771 561030 David Shelvey 49...»

«Volume IX | Number VI | June 2016 Oversimplification in Target Date Funds Endangers Participants’ Retirement Savings How are custom solutions evolving to mitigate risk? Part III Last month we featured Part II of Oversimplification in Target Date Funds Endangers Participants’ Retirement Savings How are custom solutions evolving to mitigate risk? Part II introduced version 2.0 of target date funds (TDFs), an approach which allows plan sponsors to develop a glidepath best suited for their...»

«Analytic Report Participatory Citizenship in the European Union Institute of Education Bryony Hoskins, David Kerr, Hermann J. Abs, Jan Germen Janmaat, Jo Morrison, Rebecca Ridley, Juliet Sizmur Report 2 European Commission, Europe for Citizens Programme Submitted 10th May 2012 Commissioned by the European Commission, Europe for Citizens Programme, Brussels Tender No. EACEA/2010/02 Contact information Name: Bryony Hoskins Address: Southampton Education School, University of Southampton, UK...»

«Implementation of EU external border cooperation after 2013, particularly on borders with the Russian Federation MINISTRY FOR FOREIGN AFFAIRS OF FINLAND Implementation of EU external border cooperation after 2013, particularly on borders with the Russian Federation Pekka Järviö JARVIO ASSOCIATES HELSINKI The purpose of this discussion paper is to analyse the present situation concerning EU external border cooperation, particularly on borders with the Russian Federation. This paper is based on...»

<<  HOME   |    CONTACTS
2017 www.abstract.dislib.info - Abstracts, online materials

Materials of this site are available for review, all rights belong to their respective owners.
If you do not agree with the fact that your material is placed on this site, please, email us, we will within 1-2 business days delete him.