OPTIMIZATION OF DATA ALLOCATION IN HIERARCHICAL MEMORY FOR BLOCKED SHORTEST PATHS ALGORITHMS

This paper is devoted to the reduction of data transfer between the main memory and direct mapped cache for blocked shortest paths algorithms (BSPA), which represent data by a D[M×M] matrix of blocks. For large graphs, the cache size S = δ×M2, δ < 1 is smaller than the matrix size. The cache assigns a group of main memory blocks to a single cache block. BSPA performs multiple recalculations of a block over one or two other blocks and may access up to three blocks simultaneously. If the blocks are assigned to the same cache block, conflicts occur among the blocks, which imply active transfer of data between memory levels. The distribution of blocks on groups and the block conflict count strongly depends on the allocation and ordering of the matrix blocks in main memory. To solve the problem of optimal block allocation, the paper introduces a block conflict weighted graph and recognizes two cases of block mapping: non-conflict and minimumconflict. In first case, it formulates an equitable colorclass-size constrained coloring problem on the conflict graph and solves it by developing deterministic and random algorithms. In second case, the paper formulates a problem of weighted defective colorcount constrained coloring of the conflict graph and solves it by developing a random algorithm. Experimental results show that the equitable random algorithm provides an upper bound of the cache size that is very close to the lower bound estimated over the size of a complete subgraph, and show that a non-conflict matrix allocation is possible at δ = 0.5 for M = 4 and at δ = 0.1 for M = 20. For a low cache size, the weighted defective algorithm gives the number of remaining conflicts that is up to 8.8 times less than the original BSPA gives. The proposed model and algorithms are applicable to set-associative cache as well.


Introduction
The shortest paths search problem in weighted graphs is formulated in different settings [1][2][3][4]. The all-pair shortest paths problem (APSP) has many application domains: from the city traffic optimization to computer games. Although the APSP algorithms (including the Floyd-Warshall one) have polynomial computational complexity and have been studied for a long time, their realization on modern multi-processor computing systems is still an attractive research area since actual graphs can reach very large sizes.
The parallel APSP algorithm execution time mostly depends on how it distributes the work among the processor cores and what is the throughput and load of each core. The hierarchical memory is also a key contributor in the execution time [5,6]. Caches are intermediate level between the CPU and main memory, which accelerate the data access. If a program accesses data and the data is not in cache, a miss has occurred. The key step in improving the cache performance is reducing the miss rate [7][8][9].
The hierarchical memory employs three strategies of mapping main memory blocks to cache blocks: direct mapping, set-associative mapping and full-associative mapping. Usually the cache stores a small number of blocks against the main memory. That is why the main memory blocks are grouped when mapping to a cache block. When executing an algorithm, blocks of the same group compete for the cache block. Conflicts may occur among the blocks simultaneously requested. Optimizing the distribution of the set of blocks on the set of groups may greatly reduce the conflict count and the data miss rate.
The temporal and spatial localities [11] associated with data accesses the executed algorithm generates allow a reduction of data misses in the cache. The locality can also help in the efficient allocation of data in the main memory. The paper considers a complement for the locality approach, which allocates data [12][13][14] of a blocked algorithm in such a way that maps the conflicting blocks of the slow main memory to different 3,2021 SYSTEM ANALYSIS AND APPLIED INFORMATION SCIENCE 3, 2021 SYSTEM ANALYSIS AND APPLIED INFORMATION SCIENCE block locations of the fast cache. The placement order of the main memory blocks determines a group associated with each cache block. The paper formulates the data allocation problem for blocked shortest paths algorithms, proposes a block conflict weighted graph model, and develops efficient extensions of equitable and defective coloring algorithms targeting the minimization of cache size, decreasing the number of remaining conflicts among blocks, and reduction of the algorithm execution time.

Blocked all pairs shortest paths algorithms
are the vertex and edge sets respectively. A weight function assigns a weight w ij to an edge (i, j) ∈ E. Matrix W represents the function, in which The all-pair shortest paths problem is formulated as to find the paths of the shortest length between all pairs of vertices, i, j ∈ V. The Floyd-Warshall (FW) algorithm [1,2] uses a matrix D that describes the all-pair shortest path lengths. The algorithm computational complexity is Ο(N 3 ). For large matrices, the execution time of FW is high, and a significant part of the time is due to the hierarchical memory operation.
Let the matrix D[N×N] be blocked resulting in a M×M matrix of smaller matrices B ij , 0 ≤ i, j < B, where B = N / M. Algorithm 1 known as the blocked Floyd-Warshall (BFW) [3], iteratively calls a function BCA (B 1 , B 2 , B 3 ) realized by Algorithm 2 of calculating block B 1 over blocks B 2 and B 3 . Figure 1 illustrates the behavior of BFW on matrix D [4×4]. In an Iteration, BFW calculates the diagonal D0 block, blocks C1 and C2 of cross, and peripheral blocks P3, and moves the cross from the left-top corner to the right-bottom one. Work [4] extended BFW to the heterogeneous four-type-block algorithm HBFW. BSPA denotes both BFW and HBFW. The computational complexity of BSPA and FW is the same. BSPA's advantage is the ability to localize data and computations within blocks, which is important for efficient cache operation, and for the organization of parallel computation of blocks [7][8][9]. BSPA does not worry about allocating data in hierarchical memory.

Formulation of data allocation problem
In blocked algorithms that processes big data the overall size of blocks is larger than the available cache size, therefore several blocks are mapped to the same slots of the direct mapped cache (Fig.2). Thus, the main memory blocks 0, 4, … are assigned to the slot group 0 of cache. A problem arises when the executed program accesses simultaneously blocks 0 and 4. In this case, the blocks are in conflict, the cache flaking takes place, and the program execution slows down significantly. An appropriate allocation of blocks in the main memory can solve the problem. The conflicting blocks have to be assigned to different cache slots. This leads to reordering of blocks in the main memory. The exhaustive analysis of the executed algorithm is a way to the construction of a non-conflict or minimumconflict block allocation. The paper proposes a model of weighted block-conflict graph, which allows for BSPA to find a block placement with a minimum number of conflicts.
Weighted block-conflict graph Figure 3 shows an enumeration and initial row-major memory layout of 16 blocks of matrix D[4×4] in the main memory. Fig. 4 depicts a matrix of block conflict ternary relation. In the matrix, every filled cell indicates a tuple (i, j, w) of the relation where w is a conflict count between the blocks i and j. For BSPA, w ∈ {1, 2}. For instance, the cell (0, 5) indicates the absence of conflicts between blocks 0 and 5 and does not describes a tuple. The cell (0, 12) describes a tuple (0, 12, 2) that indicates the presence of 2 conflicts between blocks 0 and 12.
In Fig. 4, two right columns edge and weight describe for each block the number of other conflict blocks and the overall conflict count respectively. For instance, block 0 has six other conflict blocks with the overall conflict count of 12.
A weighted undirected graph G T = (T, C), where T is a set of blocks and C is a set of weighted edges (Fig. 5), is an alternative representation of the conflict relation. An edge (i, j) ∈ C has a weight (conflict count) w(i, j). In Figure 5, the edges represented by solid lines have the weight of 2, and the dash-line edges have the weight of 1.
Assertion 1. Graph G T has a complete subgraph whose chromatic number is 2×M-1.
A proof of the assertion is based on the consideration of a subgraph constructed of the vertices, which correspond to the 2×M-1 blocks of a cross. It shows that all the vertices are adjacent in the graph.
The number 2×M-1 is a lower bound of the conflict graph chromatic number χ(G T ). Thus, the

Non-conflict allocation of matrix blocks
In work [15], the authors proposed a graph coloring technique for minimizing the storage consumed by an algorithm. The technique models and evaluates the lifetime of each variable and assigns two variables to the same memory location if their lifetimes are not intersected.
A proper coloring of the graph G T is a mapping µ: T → R µ of a set T of vertices to a set R µ of colors so that for two adjacent vertices t i , t j ∈ T the inequality µ(t i ) ≠ µ(t j ) holds. A color class T µ (r) ⊆ T is a set of vertices labeled by a single color r∈R µ . In a properly colored graph, each color class is an independent vertex set. Let the color classes T µ (1) ∪…∪ T µ (χ) = T represent the coloring µ where χ = |R µ |. Let Ω be a set of all proper colorings of graph G T . Then the chromatic number of G T is The chromatic number χ(G T ) determines the size of direct mapped cache that is sufficient for non-conflict allocation of matrix D [M×M]. Let ο(G T ) be a maximum color class size in the µ coloring. Then (2) determines the number ρ(G T ) of blocks needed for proper allocation of the matrix in the main memory.
The inequality ρ(G T ) ≥ M 2 must hold, and η = ρ(G T ) -M 2 is the number of garbage blocks that are added to matrix D. Fig. 6 shows a result of applying the coloring technique to the block conflict graph G T depicted in Fig. 5. The graph chromatic number χ(G T ) equals 7. The maximum color class size ο(G T ) equals 4. The number of blocks equals 16. As many as 28 main memory blocks are needed for the non-conflict allocation of D[4×4]. Fig. 6a depicts the mapping of 16 block-vertices to 7 colors. Fig. 6b depicts the assignment of blocks to the cache slot groups and the placement of the blocks in main memory. A filled cell represents a garbage block denoted by 'x'. Since the color classes have different size, the placement 0, 1, 2, 3, 4, 8, 9, 5, 11, 7, 6, 14, 13, 12, 10, x, x, x, x, x, x, 15, x, x, x, x, x, x provides a big fragmentation of main memory.

Optimization of non-conflict block allocation
The section targets two goals: first to minimize the size of cache that supports a non-conflict block allocation, and second to reduce the main memory fragmentation. Fig. 6b shows that the known coloring algorithm has introduced too many garbage blocks. This is because the algorithm minimizes the number of colors by generating a color class of possibly maximal size for each color, which leads to high value of ο(G T ) and to misbalancing of cache slot load. As a result, the cache size and main memory fragmentation are large. The algorithm is not capable of generating a satisfactory block matrix placement.
Work [16] introduces equitable coloring, which aims at balancing the size of color classes. It assign colors to vertices in such a way that no two adjacent vertices have the same color, and  [17] proves that any graph with maximum degree Δ has an equitable coloring with Δ + 1 colors. The theorem applied to the graph with Δ = 11 (Fig. 5) gives the color count of 12, which is much larger than the graph chromatic number of 7 (Fig. 6). It means the theorem provides a too pessimistic solution that is not practically acceptable.
We introduce a color-class-size constraint CSC and formulate a new csc-coloring problem on graph G T to find a constrained chromatic number γ(G T ): The CSC constraint describes a requirement for the number of blocks assigned to the same slot group in cache. The formulation aims at both obtaining a low fragmentation of main memory and minimizing the cache size.

Color-class-size constrained coloring algorithms
Since the graph chromatic number problem is NP-hard, we propose two heuristic color-classsize constrained coloring algorithms: Algorithm 3 is a constrained deterministic graph coloring (CDGC), and Algorithm 4 is a constrained random graph coloring (CRGC).
CDGC traversals all vertices and chooses an earlier introduced proper color if any; otherwise, it adds a new color and assigns it to the current vertex. The color is proper if it does not label an adjacent vertex and its vertex class size does not exceed CSC. CRGC randomly generates many proper csc-colorings and returns the best of them as output. While generating the next coloring, it randomly selects an uncolored vertex and randomly selects an earlier introduced proper color if any; otherwise, it adds a new color and assigns it to the current vertex.
We have realized the both algorithms and conducted experiments on various matrix configurations. Fig. 7 reports results the CRGC algorithm obtained for the D[4×4] matrix. Fig. 7a depicts the optimal csc-coloring of 16 blocks. Fig. 7b depicts the optimal placement of the blocks in the main  The comparison concerns three parameters: the cache size, the overall block count in main memory, and the garbage blocks count in overall count. CRGC has reduced the cache size by up to 17.1 % against CDGC. It also introduced much less garbage blocks. Table 2 reports conflict graph parameters such as the vertex count, edge count, maximum, minimum and average vertex degree, and chromatic number upper bound depending on M. Table 3 reports the lower bound that is evaluated by Assertion 1 and the upper bound that is evaluated by CRGC with respect to the cache size, memory size and garbage block count that are sufficient for non-conflict allocation of matrix D depending on M and CSC. If M equals 4 and 6, the lower and upper bounds are the same, it means CRGC has given a minimum of cache size. If M equals 8, 10 and 12, the upper bound of cache size is 1, 2 and 2 blocks respectively that is larger than the lower bound, but the load of a cache block is one memory block lower, and the garbage block count are reduced from 11, 14 and 17 to 0, 5 and 6 respectively. The matrix D allocations given by CRGC are much better over those given by the lower bound. If M equals 5, 7, 9 and 11, the upper bound loses 1, 1, 1 and 2 blocks of the cache size respectively, and has a larger main memory fragmentation against the lower bound. The overall conclusion is in most cases CRGC has given optimal results and in other cases has given high quality solutions that are close to optimal ones. Fig. 8 shows a reduction of the cache size against the main memory size in non-conflict allocation of matrix D depending on M. It can be observed that the increase in the number of matrix blocks leads to the relative reduction of the cache size from 50 % at M = 4 down to about 10 % at M = 20.
In the paper, we have extended the concept of defective coloring to the concept of weighted defective coloring µ of graph G T . In the coloring, at least one color class T µ (r) ⊆ T, r∈R µ is a dependent vertex set. Since the class contains at least one weighted edge, we define a weighted defect with Equation (5).
A weighted defect of the coloring µ is We formulate the defective weighted constrained coloring problem as follows: subject to ( ) T r CSC µ ≤ , µ∈Ω and r∈Rµ, where CCC is a color-count-constraint. In case of CCC × CSC = M 2 a solution of problem (6) - (9) gives a block-matrix allocation without garbage blocks in the main memory and with a minimum of conflicts among blocks assigned to the same cache block. A permutation of D matrix blocks represents the allocation. is 3 conflicts. In the figure, each column represents a color class corresponding to a single cache block. The allocation of blocks in main memory is: 0, 2, 1, 3, 6, 4, 7, 5, 11, 9, 10, 8, 13, 15, 12, 14. We have developed Algorithm 5 of defective weighted constrained random coloring (DW-CRGC) of the conflict graph. The algorithm iteratively generates RunCount vertex random permutations (order) and selects a coloring that has a minimum ω of weighted defect. Each iteration produces a graph vertex coloring that meets the given constraints. After selecting a vertex u ∈ T \ L where L ⊆ T is a subset of already colored vertices, the algorithm chooses a color c using seven parameters: • an overall weighted defect D(c) on L; • a weighted additional defect d(c) after including u in c; • a maximum defect D max = max D(c) over all c; • a maximum defect d max = max d(c) over all c; • a weight function W(c) on L, whose maximum value indicate a selected color of vertex u; W(c) depends on two parameters: weighted defect D(c) of c over all colored vertices and additional defect d(c) due to coloring vertex u. In (10), we assume the first term be zero if D max = 0, and the second term be zero if d max = 0. Algorithm 5 adds vertex block to class BestC and recalculates D(BestC) and D max . After coloring all vertices, the algorithm updates BestColoring and its defect ω if the obtained Coloring is better than the BestColoring.
We have implemented Algorithm 5 in C/C++ and have performed several experiments. Table 4 reports results for D[6×6] with respect to the weighted defect of the CSC constraint and factors α and β. When α = 1 the algorithm yields a maximum defect. It gives a lower defect when α is closer to zero (in our experiment at α = 0.3). We can explain it as balancing the load among cache blocks (D(c) and D max are responsible for the balancing) is less important than avoiding conflicts when mapping the main memory blocks to cache blocks (d(c) and d max are responsible for the avoiding). CSC has taken values 3, 4, 6, 9 and 12, which guaranty the absence of garbage blocks at the D size of 36. The weighted defect has reduced as 42, 22, 6, 2 and 0 respectively with increasing CSC. At CSC = 12 the algorithm has generated a non-conflict block allocation. Table 5 compares the matrix row-major memory defective allocation of BSPA (Fig. 3) against the optimized cache allocation (Fig. 9) produced by the defective weighted coloring algorithm DWCRGC for matrix D [M×M] at M = 4, …, 12, CSC = CCC = M. In both cases, the allocation is defective since the conflict graph chromatic number is larger than M.  With the increase of M from 6 to 12 the minimized weighted defect ω per cache block given by DWCRGC has grown from 6 to 15 conflicts. The results given by the row-major allocation of BSPA are much worse: from 30 to 132 conflicts respectively. The gain of DWCRGC has increased from 5.0 to 8.8 times.

Conclusion
The paper has formulated the problem of optimizing the data allocation in main and cache memory to reduce the data miss count during execution of blocked all-pair shortest paths algorithms. We have introduced the model of block conflict weighted graph for solving the problem. The known coloring techniques does not solve the problem efficiently since they generate color classes of different size and give big fragmentation of the main memory. The paper has introduced two types of block allocation: non-conflict and weighted defective. We have pro-posed the color-class-size constrained coloring algorithms for the non-conflict allocation. Experimental results have shown the gain our random coloring algorithm provides against the deterministic one. To minimize the conflict count at the restricted cache size, we have extended the known concept of defective coloring to the concept of weighted defective coloring of the block conflict graph. Our random weighted constrained defective coloring algorithm minimizes the number of conflicts and balances the load on the cache slots for the given cache size. The model and algorithms target first the direct mapped cache although they are also applicable being modified to the set associative cache.