«CONCURRENCY AND COMPUTATION: PRACTICE AND EXPERIENCE Concurrency Computat.: Pract. Exper. (2015) Published online in Wiley Online Library ...»
Once a thread ﬁnds a local data for this task, it immediately notiﬁes other threads to stop searching for this task and starts searching for next tasks. If all threads could not be able to ﬁnd more local tasks, the threads start in assigning just one non-local map task for each node in the block for this heartbeat.
In this paper, we conducted a simulation over a virtualized environment as well as over a nonvirtualized environment. The main goal is to show the improvements that can be achieved both in terms of the simulation time as well as the energy consumption. So, in order to show the behavior of the aforementioned scheduling algorithms including our scheduler (MTL), we run a simulation for our proposed MapReduce framework. We also need to show a comparison between the virtualized environment and the non-virtualized environment considering both the simulation time and the energy consumption parameters. Our simulation uses map phase only for the same reasons as stated previously. For this simulation, we used the CloudExp simulator, which is an extended version of the CloudSim simulator [24, 25] that supports MapReduce scheduling algorithms. For our MapReduce system, we need to use some of the CloudExp characteristics such as scalability to prove goodness and improvements achieved by our proposed algorithm (MTL). In this simulation, every host is connected to a special storage using the same ID. The physical host speciﬁcation is fully described in Table I, where every host is either virtualized or non-virtualized. The hosts in our proposed scheduler’s evaluation are divided among a number of blocks. Also, the tasks are divided among a number of jobs, where each task has several properties as explained in Table II.
The following parameters will be used in our simulation. The following is a brief description of each.
No-block: is special for our algorithm. It determines number of threads that searches for data locality.
No-host: is the number of nodes that will be used for this experiment.
No-ﬁle: is the number of ﬁles that contain data for tasks and will spread among nodes.
No-job: is the number of jobs that come from different users.
No-task: is the number of tasks that need to be executed. The number of tasks is distributed among the jobs randomly.
In our simulation, we need to show the behavior of the considered schedulers over a nonvirtualized and a virtualized environment. In fact, the simulation results showed the superiority of our proposed scheduler MTL against other existing schedulers in both environments. So, we run our scheduler MTL against other schedulers, speciﬁcally, we chose FIFO scheduler, MM scheduler,
and the Delay scheduler as discussed in Section 3. In fact we conducted comparisons among them in terms of two factors, which are the simulation time factor and the energy consumption of nodes factor.
Different values for the parameters have been conducted. These values vary from small values of nodes and tasks to large numbers while varying the number of VMs from one to four. All the results that will be shown shortly proved the scalability of our proposed scheduler MTL. This, in fact, proves the superiority of our MTL scheduler over all other existing schedulers. The results showed the improvements that will be achieved in both the simulation time and the energy consumption when considering the virtualized environment over the non-virtualized environment. Only samples of our results will be shown in the following subsections.
4.1. Non-virtualized versus virtualized for 50,000 tasks The following two ﬁgures, Figures 2 and 3, show the behavior of all the considered algorithms with 50,000 tasks in a non-virtualized (physical node) environment against a virtualized environment
(with different values of VMs considering two and four VMs) in terms of both the simulation time and the energy consumption, respectively.
Figure 2 shows the simulation results of the simulation time of all the considered schedulers over a virtualized environment and a non-virtualized environment. As can be noticed, there is a good improvement in the simulation time for the virtualized environment over the non-virtualized environment. The reader of the paper can notice the improvement achieved in the virtualized environment especially when he/she moves from physical node to two VMs. This improvement decreased when increasing the number of VMs to four. This in fact refers to the management overhead that will appear with using the VMs. It is also important to emphasize that our built scheduler, the MTL, achieves the best simulation time in both environments. In fact, this proves the superiority of our scheduler over the other considered schedulers. Another point to mention here is that the FIFO scheduler is noticed to have the most improvement from the virtualized environment over the non-virtualized one. In fact, this result is expected because the FIFO scheduler is a scheduler that mostly violates the locality principle; hence, FIFO will be the algorithm that will mostly obtain beneﬁt from the virtualization.
Figure 3 also shows the simulation results of the power consumption of all the considered schedulers over a virtualized environment ( two and four VMs) and a non-virtualized environment (physical node). As can be noticed, there is a good improvement in the energy consumption for the virtualized environment over the non-virtualized environment. It is also important to emphasize that our scheduler, the MTL, achieves the best energy consumption in both environments. In fact, this proves the superiority of our scheduler over other existing schedulers. The same notice will be raised here that concerns the FIFO scheduler, which indicates that the FIFO is the algorithm that achieves the most beneﬁt from the virtualization.
4.2. Non-virtualized versus virtualized for 150,000 tasks In this experiment, we increase the number of tasks to be 150,000 to investigate the behavior of the considered schedulers under both the virtualized and non-virtualized environments. The following two ﬁgures, Figures 4 and 5, show both the simulation time and the energy consumption for all the considered schedulers.
As can be noticed in Figures 4 and 5 in the physical node, the MTL achieved the best values for both the simulation time and the energy consumption. The two ﬁgures also show that a noticeable and a great improvement is achieved in both the simulation time and the energy consumption when considering the two and four VMs. This in fact proves the improvement that can be achieved when considering the virtualization over the non-virtualization and hence, the scalability of our MTL scheduler. Moreover, although the improvement in the four VMs is slight, however, the MTL still achieved the best results in both the simulation time and the energy consumption. This, in the other side, proves the superiority of our built scheduler, the MTL, over all the other schedulers in both the virtualized and non-virtualized environments.
4.3. Non-virtualized versus virtualized for 200,000 tasks In this experiment, we increase the number of tasks to be 200,000 to further investigate the behavior of the considered schedulers under both the virtualized and non-virtualized environments. The Figure 8. Simulation time comparison: virtualized versus non-virtualized.
Figure 9. Energy consumption comparison: virtualized versus non-virtualized.
following two ﬁgures, Figures 6 and 7, show both the simulation time and the energy consumption for all the considered schedulers.
Both ﬁgures, Figures 6 and 7, show the same improvements percentages as achieved in the previous experiments. Again, the reader can notice that all schedulers in the virtualized environment have noticeable improvements over the non-virtualized environment. The reader also can notice that our built scheduler, the MTL, achieved the best results over all the other schedulers considering both the simulation time and the energy consumption factors. The reader also can notice that the improvement when increasing the number of VMs from two to four is slight. This happened as a result of the overhead produced when increasing the number of VMs.
The following two ﬁgures, Figures 8 and 9 emphasize the aforementioned conclusions about the performance improvements achieved when considering the virtualized environment over the nonvirtualized environment and the superiority of our built MTL scheduler in both the the virtualized and non-virtualized environments.
5. CONCLUSIONS AND FUTURE WORK
In this paper, we presented a framework through which we evaluated several scheduling algorithms in big data management. Speciﬁcally, the FIFO scheduler, MM scheduler, Delay scheduler, and our own built scheduler, the MTL, have been evaluated. All the considered schedulers were evaluated in a virtualized environment and a non-virtualized environment. Two factors were considered in the evaluation procedure: the processing time and the energy consumption.
To test our proposed evaluation procedure, we built a MapReduce system using the cloudExp simulator. The conducted experiments showed great improvements when considering the virtualized environment over the non-virtualized environment. Moreover, the conducted experiments showed that our proposed algorithm achieved better results comparing with existing algorithms such as FIFO, Delay, and MM in terms of processing time and energy consumption in both environments.
The experiments proved that our proposed algorithm has speed behavior and scalability infrastructure, to keep pace with the rapid growth of data regardless of the environment whether it is virtualized or non-virtualized. As future directions, we intend to implement and then evaluate our proposed algorithm ‘MTL’ as well as other existing scheduling algorithms in real Hadoop cluster with real big data such as Facebook in social network. This will prove the speed and scalability that is produced by our proposed algorithm. Also, we are planning to extend our ﬁnding to new real big data application that introduces a small-scale cloud computing system as in .
1. Mcafee A, Brynjolfsson E. Big Data: The Management Revolution. Harvard Business Review: MA, USA, 2012.
2. Villars R, Olofson C, Eastwood M. Big Data: What It Is And Why You Should Care. White paper, MA, USA, 2011.
3. Manyika J, Chui M, Brown B, Bughin J, Dobbs R, Roxburgh C, Byers AH. Big Data: The Next Frontier For Innovation, Competition, And Productivity, 2011.
4. Statchuk C, Rope D. Enhancing Enterprise Systems with Big Data. IBM Company: USA, 2013.
5. Schad J. Flying Yellow Elephant: Predictable and Efﬁcient MapReduce in the Cloud. Information Systems Group, Saarland University, 2010.
6. Krishnadhan D, Purushottam K, Umesh B. VirtualCloud - A Cloud Environment Simulator. Department of Computer Science and Engineering Indian Institute of Technology: Bombay, 2010.
7. Althebyan Q, Alqudah O, Jararweh Y, Yaseen Q. Multi-threading based map reduce tasks scheduling. 5th International Conference on Information and Communication Systems (ICICS), Irbid Jordan, 2014; 1–3.
8. Althebyan Q, Alqudah O, Jararweh Y, Yaseen Q. A scalable Map Reduce tasks scheduling: a threading based approach. International Journal of Computational Science and Engineering 2015.
9. The Apache software foundation. Hadoop Apache, 2012. (Available from: http://hadoop.apache.org/) [Accessed on 10 February 2014].
10. Dean J, Ghemawat S. MapReduce: simpliﬁed data processing on large clusters. OSDI 04: In the Proceedings of the 6th Symposium on Operating Systems Design and Implementation, California, USA, 2004.
11. Rao BT, Reddy LSS. Survey on improved scheduling in Hadoop MapReduce in cloud environments. International Journal of Computer Applications 2011; 34(9):0975–8887.
12. Under the hood: scheduling MapReduce jobs more efﬁciently with Corona. (Available from:
https://github.com/facebook/hadoop-20/tree/master/src/contrib/corona) [Accessed on 10 February 2014].
13. Chen J, Wang D, Zhao D. A task scheduling algorithm for Hadoop platform. Journal of Computers 2013; 8(4):
14. Verma A, Cherkasova L, Campbell R. Two sides of a coin: optimizing the schedule of MapReduce jobs to minimize their makespan and improve cluster performance. In the Proceedings of the 2012 IEEE 20th International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems, Washington DC, USA, 2012;
15. Kc K, Anyanwu K. Scheduling Hadoop jobs to meet deadlines. IEEE Second International Conference on Cloud Computing Technology and Science (CloudCom), Indianapolis Indiana, USA, 2010; 388–392.
16. Palanisamy B, Singh A, Liu L, Jain B. Purlieus: locality-aware resource allocation for MapReduce in a cloud.
Proceedings of International Conference for High Performance Computing, Networking, Storage and Analysis, Seattle, Washington, USA, 2011; Article Number 58.
17. Guo Z, Fox G, Zhou M. Investigation of data locality in MapReduce. Proceedings of the 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID 2012), Ottawa, Canada, 2012; 419–426.
18. Hammoud M, Sakr M. Locality-aware reduce task scheduling for MapReduce. Proceedings of the 2011 IEEE Third International Conference on Cloud Computing Technology and Science (CloudCom), Athens, Greece, 2011;
19. Xie J, Meng F, Wang H, Cheng J, Pan H, Qin X. Research on scheduling scheme for Hadoop clusters. International Conference on Computational Science (ICC) 2013; 17:49–52.
20. Liu J, Wu T, Wei Lin M, Chen S. An efﬁcient job scheduling for MapReduce clusters. International Journal of Future Generation Communication and Networking 2015; 8(2):391–398.