INFLUENCE OF SHORTEST PATH ALGORITHMS ON ENERGY CONSUMPTION OF MULTI-CORE PROCESSORS

Modern multi-core processors, operating systems and applied software are being designed towards energy efficiency, which significantly reduces energy consumption. Energy efficiency of software depends on algorithms it implements, and, on the way, it exploits hardware resources. In the paper, we consider sequential and parallel implementations of four algorithms of shortest paths search in dense weighted graphs, measure and analyze their runtime, energy consumption, performance states and operating frequency of the Intel Core i7-10700 8-core processor. Our goal is to find out how each of the algorithms influences the processor energy consumption, how the processor and operating system analyze the workload and take actions to increase or reduce operating frequency and to disable cores, and which algorithms are preferable for exploiting in sequential and parallel modes. The graph exten-sion-based algorithm (GEA) appeared to be the most energy efficient among algorithms implemented sequentially. The classical Floyd-Warshall algorithm (FW) consumed up to twice as much energy, and the blocked homogeneous (BFW) and heterogeneous (HBFW) algorithms consumed up to 52.2 % and 21.2 % more energy than GEA. Parallel implementations of BFW and HBFW are faster by up to 4.41 times and more energy efficient by up to 3.23 times than the parallel implementation of FW and consume less energy by up to 2.22 times than their sequential counterparts. The sequential GEA algorithm consumes less energy than the parallel FW, although it loses FW in runtime. The mul - ti-core processor runs FW with an average frequency of 4235 MHz and runs BFW and HBFW with lower frequency of 4059 MHz and 4035 MHz respectively.


Introduction
Multi-core CPUs are at the heart of modern computing platforms whose share of the total energy consumption is rapidly increasing.The energy consumption of computing systems and devices accounts for 20% of the global electricity demand [1,2], and the prediction is up to 50% of global electricity in 2030.A model for estimating with high accuracy the power consumption of multi-core processors is presented in [3].
Power management is one of the most critical issues in the design of today's microprocessors [4,5].Its goal is to maximize performance within a given power budget.Power management techniques must balance between the demanding needs for higher performance and the impact of aggressive power consumption and negative thermal effects.The most adopted power saving technique for current multi-core processors is the ability of dynamic frequency tuning which is based on Dynamic Voltage and Frequency Scaling (DVFS).Many studies use DVFS to adjust the frequency of processor cores, and to save power.They are classified into two groups: profiling and performance monitors.The profiling techniques measure the behaviors of applications and analyze the obtained results to tune the frequency of processors.The hardware performance monitors collect information about CPU usages in run-time and then tune the frequency of multi-core processor to save power without significant overhead.
Energy consumption can also be decreased by optimizing machine code and creating green software.The contribution of this paper is a methodology of developing and selecting applied algorithms (on example of shortest paths algorithms) which significantly reduce the energy consumption and increase performances.

All pairs shortest path algorithms
Let G = (V, E) be a simple directed dense graph with real edge-weights consisting of a set V, |V| = N, of vertices numbered 1 through N and a set E of edges.Let W be a cost adjacency matrix for G. So, w(i, i) = 0, 1 ≤ I ≤ N; w(i, j) is the cost (weight) of edge (i, j) if (i, j) ∈ E and w(i, j) = ∞ if i ≠ j and (i, j) ∉ E. Let consider the problem and algorithms of shortest paths search in graph G.
Floyd-Warshall algorithm (FW).Let D be a matrix of distances and element D(i, j) be a length of a shortest path from i to j.Let SP(i, j, k) be a function that returns the length of the shortest path from i to j passing through vertices from set {1, 2 … k}.The goal of FW [6][7][8] is to find SP(i, j, N), i, j = 1 … N. If we have SP(i, j, k-1), then SP(i, j, k) can be defined recursively: with the base case SP(i, j, 0) = w(i, j).The FW algorithm is derived from definition (1): The FW algorithm has the same computational complexity of Θ(|V| 3 ) no matter how many edges the graph contains.An advantage of the algorithm is its simplicity in the organization of computations.Its drawback is the recalculation of all elements of matrix D in every iteration of the loop along k.FW can be parallelised by OpenMP.An alternative to FW is proposed in [9].
Graph extension-based algorithm (GEA).The algorithm was proposed in [10,11].In GEA, the process of calculating the shortest paths is associated with stepwise adding of vertices to graph G. Therefore, the shortest path distances are represented by a sequence of matrices

GEA uses two operations: 1) adding row k and column k to matrix D[(k-1)×(k-1)] with obtaining matrix D[k×k];
2) updating matrix D[(k-1)×(k-1)] over row k and column k.These operations are described as: Then the obtained algorithm is formally transformed to a more efficient one using the inference technique proposed in [10,11].The transformation rules of the resynchronization of computations, reordering of instructions and merging of loops are used to do it.The following algorithm, GEA is a result of the transformation: GEA has smaller number of iterations of loops along variables i and j, has fewer accesses to memory and has the improved spatial and temporal data references locality.Therefore, it can reduce the cache pressure in the multi-core processor and can speed up the computations.
Blocked FW algorithm (BFW).It was proposed in [12][13][14][15][16][17][18] as a further development of FW.BFW divides set V of graph vertices into subsets V 1 …V M of size S and splits matrix D into blocks of size S × S each, creating a block-matrix B[M × M], where equality M•S = N holds.The blocks are recalculated in a loop along block count m = 1…M.Three phases are used for the recalculation: 1) calculation of a diagonal D0 block B m, m ; 2) calculation of 2(M-1) cross blocks B v,m and B m,v of types C1 and C2; calculation of (M-1) 2 peripheral blocks of type P3.BFW is described by the following pseudocode: Advantages of BFW are: 1) the localization of data accesses within blocks and increasing the efficiency of hierarchical memory operation; 2) the capability of parallel computation of blocks on multi-core processors.BFW can be parallelised by OpenMP in fork-join style.Cooperative threaded algorithms [19][20][21] are based on BFW.
Heterogeneous blocked FW algorithm (HBFW).The algorithm was proposed in [11,22].It inherits BFW and distinguishes four types of blocks: diagonal D0, vertical C1 of cross, horizontal C2 of cross, and peripheral P3.For each block type it provides a separate block calculation algorithm of higher performance.The separate algorithms account the features of block types.They allow the reduction of the number of iterations in nested loops, the exploit of sequential references locality of data in CPU caches, and the speedup of computations.All the separate algorithms improve the spatial and temporal reference locality in data processing.After replacing in BFW four calls of function BCA with calls of four separate functions D0, C1, C2 and P3 using 1, 2, 2 and 3 unique arguments, we obtain a heterogeneous HBFW: Function D0 implements the GEA algorithm applied to block B 1 .Function C1 is inferred by applying the stepwise graph extension concept to block B 1 calculated over block B 3 [22]: Function C2 is inferred in a similar way by applying the stepwise graph extension concept to block B 1 calculated over block B 2 [22]: Function P3 is inferred from BCA by reordering loops.All four functions improve the spatial and temporal data references locality and make the hierarchical memory operation more efficient.Moreover, functions D0, C1 and C2 reduce the number of iterations in loops and the number of accesses to main memory.HBFW can be parallelised at task level by OpenMP using fork-join parallelization style.

Measuring energy consumption of multi-core processor
We used Intel VTune Profiler 2023.0 and built in Intel SoC Watch utility to measure energy consumption.Intel SoC Watch is a command line tool for monitoring metrics related to power consumption on Intel architecture platforms.It can report power states for the system/CPU/GPU devices, processor frequencies and throttling reasons, total energy consumption over a period, power consumption rate, and other metrics that provide insight into the system's energy efficiency.Intel SoC Watch collects data from both hardware and operating system with low overhead.Our experiments aimed at the measurement of energy consumption in Joules (J).To do it, we measured the runtime of each algorithm represented by single-thread and multithreaded implementations and measured the average rate of energy consumption in Watts (W) of the CPU package for full duration of each algorithm execution.The CPU package energy consumption is related to all cores, eachcore-separate L1 and L2 private caches, shared L3 cache and other hardware components included into the CPU package.
All runs of the program implementations of four shortest path algorithms FW, GEA, BFW and HBFW were carried out on a desktop computer equipped with Intel Core i7-10700 CPU processor which contains 8 cores (16 hardware threads) with the support of "Intel Turbo Boost 2.0", "Intel Turbo Boost Max 3.0" and "Enhanced Intel SpeedStep" technologies.Every core is equipped with private L1 (512KB) and L2 (2MB) caches and shared L3 (16MB) cache.Its base frequency is 2.90 GHz and can increase up to 4.80 GHz.The algorithms were implemented in the C++ language using GNU GCC compiler v12.2.0.

Influence of single-thread implementations of algorithms on processor energy consumption
The sequential versions of algorithms FW, GEA, BFW and HBFW are implemented as single-thread applications written in C++.The single thread executes on one core and one logical processor at any time.Other cores are in idle state; therefore, the energy consumption is related to a part of the processor components: the core, its L1 and L2 caches, and shared cache L3.The experiments show mainly how efficiently the algorithms exploit the processor's hierarchical memory.
The first series of experiments demonstrates how the block size in BFW and HBFW influences the processor energy consumption.Figures 1-3 show that on graph of 4800 vertices HBFW consumes less energy against BFW for all block sizes.The first reason is the runtime of HBFW is less than the runtime of BFW (Figure 2).The second reason is the consumption rate of HBFW is less against BFW for most block sizes (Figure 3).The figures also show that GEA has the lowest energy consumption and runtime of all the algorithms at any size of block; FW appears to be the worst with respect to both runtime and   Figure 5 depicts the speedups the GEA, BFW and HBFW have in comparison with FW.FW is the slowest algorithm; therefore, all the speedups exceed 1. GEA has the lowest runtime; as a result, it has the lowest energy consumption and yields the highest speedup.HBFW has a lower runtime and therefore a lower energy consumption than BFW has.It is interesting that there is a graph size (local optimum at 3600 vertices), for which the speedup of all three algorithms is the highest.

Influence of parallel implementations of algorithms on processor energy consumption
The parallel multi-threaded implementations [23,24] of algorithms FW-OMP, BFW-OMP and HBFW-OMP were generated by the OpenMP compiler.We have not succeeded to generate an acceptable parallel implementation for GEA using OpenMP.In the implementations, the energy consumption is related to all cores, caches, and other components of the CPU package.Figures 7, 8 and 9 show on graph of 4800 vertices how the block size in multi-threaded BFW-OMP and HBFW-OMP influences the processor energy consumption.These algorithms consume less energy than the single-thread GEA and multi-threaded FW-OMP for block-sizes 48-300 (Figure 7).For larger block-sizes, GEA and even FW-OMP can gain BFW-OMP and HBFW-OMP.It is interesting that GEA's energy consumption is about twice lower than one of FW-OMP's.The patterns depicted in Figure 8 for the algorithms' runtimes explain the patterns of energy consumption from Figure 7.The runtimes of BFW-OMP and HBFW-OMP are the lowest for most blocksizes.The runtimes of FW-OMP and GEA are close enough.Figure 9 shows that single-thread GEA's energy consumption rate is significantly lower than those of parallelized multiple-threaded HBFW-OMP and BFW-OMP, which in their turn have lower energy rate than those of FW-OMP.
It can be observed from Figure 10 that FW-OMP loses other algorithms on energy efficiency for any graph-size.The 1-thread GEA gains other multi-threaded algorithms on graph 1200.On larger graphs, BFW-OMP and HBFW-OMP consume less energy than GEA and FW-OMP.The speedup by FW is higher up to 3.26 times for multi-threaded implementations than for single-thread one (Figures 12).The speedup by parallel BFW and HBFW reaches 7.89 and 6.30 times.With the increase of graph size up to 3600 the speedup is being increased and then decreased.Figure 13 shows that the multi-threaded FW-OMP algorithm gains up to 46 % the single-thread GEA algorithm with respect to runtime, but the latter algorithm gains up to 57 % against the former one with respect to power consumption.

Influence of CPU performance state and operating frequency on energy consumption
Intel Core i7-10700 CPU supports 22 states (also known as PX states), where P0 corresponds to top performance in which processor uses its maximum capabilities and therefore may consume maximum power.The P1-P21 states correspond to active states in which processor's performance capabilities are truncated to reduce power consumption.The current PX-state and transitions between the states are determined by hardware.The operating system can request a change of state based on workload requirements and awareness of processor capabilities.However, in addition to the operating system request, the final decision is made accounting for different system constraints such as workload demand and thermal limits.During all conducted experiments all CPU cores were residing in the top performance P0 state.However, depending on the type of workload (different parallel algorithms) the active CPU frequency and percentage of residency in that frequency changed significantly.
Figures 14, 15 and 16 depict a percentage of residency in different CPU frequency intervals alongside an average frequency of each logical processor during execution of FW-OMP, BFW-OMP and HBFW-OMP algorithms respectively on graph of 4800 vertices.Figure 14 shows that algorithm FW-OMP operates over 60 % of its active time in 4600-4501 MHz frequency interval, which is a maximum non-Turbo Boost frequency of the target CPU, and the rest of its active time (around 20 %) in 4100-3901 MHz interval.This gives an average operating frequency of 4400 MHz.
At the same time, both BFW-OMP and HBFW-OMP (Figures 15 and 16) spend most of the active time in frequency intervals of 4400-4200 and 3700-3600 MHz (around 30 % and 25 % of overall time respectively), which leads to an average operating frequency of 4000 MHz.Such significant differences in operating frequencies along with the levels of references locality in big data processing result in an up to 3 times smaller energy consumption of both BFW-OMP and HBFW-OMP algorithms over the FW-OMP algorithm (see Figure 10).

Conclusion
Modern multi-core processors are designed to exploit every possibility to reduce energy consumption.Development of algorithms and computer programs which force the processor's components to consume less energy is an additional external source of increasing the energy efficiency of hardware.On four algorithms of searching for shortest paths in large dense directed weighted graphs and on sequential and parallel implementations of the algorithms we have measured and analyzed how the processor energy consumption depends on the algorithm properties and how the processor accounts for the properties to tune its behavior with the objective of increasing its energy efficiency.The paper has found out the most energy efficient algorithms for searching for shortest paths in dense graphs.
energy consumption.At the same time FW and GEA have the same energy consumption rate (Figure3).

Figure 9 .Figure 10 .
Figure 9. Average rate (W) of FW-OMP (triangle), GEA (circle), BFW-OMP (square) and HBFW-OMP (diamond) algorithms vs. block size on graph of 4800 vertices Figures 11-12 compare multi-threaded ompimplementations against 1-thread implementations of the FW, BFW and HBFW algorithms depending on the graph size.The energy consumption of FW is higher for multithreaded than for 1-thread implementations on almost all graph-sizes (Figures 11).Contrary, the multi-threaded BFW and HBFW have smaller energy consumption than their single-thread conterparts.The speedup by FW is higher up to 3.26 times for multi-threaded implementations than for single-thread one (Figures12).The speedup by parallel BFW and HBFW reaches 7.89 and 6.30 times.With the increase of graph size up to 3600 the speedup is being increased and then decreased.

Figure 13 .
Figure 13.Relative energy consumption (solid line) and runtime (dashed line) given by GEA against FW-OMP vs.graph size