A SELECTION MECHANISM USING MULTI-CRITERIA EVALUATION AND HIERARHICAL CLASSIFYING TREE FOR RESUME DATA PROCESSING

The paper considers a problem of optimal feature selection for resume data processing by means of combining multicriteria evaluation technique and hierarhical classifying trees technology what makes it possible to build a selection mechanism without necessity to collect data for the learning purposes of real applicants. Instead, the learning data are generated by means of the technique used in a full factorial experiment with quite a restricted number of samples. The suggested approach minimizes the number of the features used in selection the best candidates and does not use the quantitative ratings of candidates replacing them with multiphases classifying procedure. These peculiarities of the suggested selection mechanism make it more flexible and form a basis for applying it in conditions characterized by vagueness and fuzziness of the applicant data.


Introduction
Automatic resume data processing is one of the important applications of the data mining and text mining technologies [1]. There are some world-wide known resume processing systems [2,3]. However, they, as a rule, are restricted with a strict curiculum vitae (CV) format presentation, a fixed system of criteria priorities used to select the best candidates. In order to rise up the system flexibility, the system should allow to adapt to specifics of specialty, that is, to change criteria and their priorities accordingly to practical needs. It is also important to minimize the number of criteria in order to reduce personal information database sizes. From this point of view, the paper suggests a technique combining multicriteria decision making (MDM) [4] and hierarchical classification tree (HCT) mechanism [5] in a way, excluding the necessity to collect data for the purpose of HCT learning, and use MDM instead. It gives a formal approach realizing mathematical model of criteria evaluation and generation of the classification tree(s) on the basis of optimal feature set. The paper develops the ideas of the authors' work [6].

Problem formalization
Let the initial feature set include the following attributes: age (F 1 ), education (F 2 ), professional expierence (F 3 ), knowing foreign languages (F 4 ), participation in big projects (F 5 ), publications in scientific journals (F 6 ), participation in scientific conferences (F 7 ), marital status (F 8 ), work in other organizations (F 9 ). The first step to be performed is to find the integral evaluation function I in the form ( ) , with α i standing for the feature priorities (normalized non-negative numbers, total sum of which is equal to 1), and f i (F i ) representing utility functions. To define analytical form of I, one can use the T. Saati's method of hierarhies [7], the Relief procedure [8], or the other techniques used in MDM, so the details are omitted here. Now, suppose I is of the form Starting from (2), one should define the hierarchical classification tree to use as a selecting means in CV processing [9]. The inputs of HCT represent the ordered sets of attributes of candidates to vacant position (further we use term Data Set (DS) for short). The HCT filters DS to two categories: (Acc 1 ) -accepted, (Dec 1 ) -declined. Clearly, the number of persons from Acc 1 may be greater than 1. In that last case, another HCT should be used to perform a more rigid selection. Again, if Acc 2 is greater than 1, the next filtering is performed accordingly to the scheme outlined later on. This iterative procedure may be finally resolved with random selection from Acc n (n>1). Our nearest goal is to show, how to minimize the feature set sizes and build it for selecting Acc i .
Clearly, formula (1) may contain extra features which should be deleted. To clear which features are excessive and get a non-linear (in general) evaluating function I, one should resolve two different mathematical problems. In practice, instead of defining a non-linear function I, one build an HCT, which performs «hidden computations» replacing direct evaluation of I.

Reduction of the feature set
Introduce some basic ideas of [6] and consider table 1 with some data samples from DS (explanation is given later on).
One uses formula (2) to compute integral evaluation function I. In order to get values in columns f 1 ,…, f 9 , one may apply the technique of complete factorial experiments. According to this technique, each feature (utility function) f i (F i ) takes only two possible values: one is in 15 % distance from the minimum value (that is, from «0» with respect to utility function), and the other one is in the same distance from the maximum value (i. e. from «1»). Let all data objects be divided into two classes A and B, for instance, each sample in class A has value of I greater than 0.5 and, on the contrary, each sample from class B has value of I not exceeding 0.5. Definition 1. Feature F t discriminates between two samples x∈A and y∈B if and only if F xt ≠ F yt .
Reformulation of this definition gives Definition 2. Feature F t discriminates between two samples x ∈ A and y ∈ B if and only if Definition 3. A set π of features F i is discriminating with respect to a given data set DS if for each two data objects d i and d j from DS belonging to different classes, there is some feature F p ∈π discriminating between d i and d j .
Definition 4. A set π is a minimum-size discriminating set for a given data set DS provided that it contains minimum number of features among all discriminating sets.
Lemma. With respect to integral evaluation function I from (1) and a given DS, two minimum-size discriminating sets π(F), containing features F 1 , F 2 , …, F Z , and π(f), containing utility functions f k (F k ), k = 1, z have the same sizes, i. e. z = Z and are in one-to-one correspondence to each other.
Proof. Let for simplicity there are only two different classes A and B. Let F t discriminates between two samples d r and d s , but f t -not, that ). Clearly, there must be another feature F p discriminating d r and d s and belonging to π(F). Indeed, if no other features from π(F) discriminate between d r and d s then they are pairwise equal to each other and therefore I(d r ) = I(d s ) with respect to (1), what leads to the fact that d r and d s belong to the same class which is impossible. From this, there should be at least one such feature F q ∈ π(F) with F q (d r ) ≠ F q (d s ) and f q (F q (d r )) ≠ f q (F q (d s )). One can include then f q in π(f) and exclude f t . These considerations remain valid with respect to every two pairs of data objects d r and d s from different classes in DS and show the way to make two sets π(F) and π(f) which are in one-to-one correspondence to each other. This ends the proof.

Finding minimum-size discriminating set
The next step is to build the discriminating 0,1-matrice M, coresponding to full table 1 with elements m kij = 1 if and only if feature f k discriminates between samples i and j; otherwise m kij = 0 (see Figure 1). The rows correspond to the features (utility functions), the columns are represented by pairs (i, j) with i and j specifying rows in table 1. For instance, consider row f 2 and column T a b l e 1. Fragment of DS relating to the example  Fig. 1. Discriminating 0,1- matrice for table 1 Evidently, there are no pairs corresponding to the objects from the same class; also, there should not be columns without «0». The case when no one feature discriminates between some pair of objects from different classes we do not consider (this supposes insufficiency of the criteria used in the model). Our task has reduced to finding a minimum-size cover for M.
Definition 5. One says that row k covers column (i, j) in 0,1-matrice M if and only if m kij = 1.
Definition 6. A minimum-size covering set π min (f) for M consists of the minimum possible number of features f i such that each column of M is covered at least by one row from π min (f).
The problem of finding a minimum-size feature set π min (f) can be resolved as explained in [6]. The technique applied in [6] uses group resolution principle (grp) resembling logical resolutionbased inference with more than 2 parent formulas participating in producing logical resolvent (see the details in [10]).
Return to table 1. Its form, participating in complete factorial experiment, is defined for DS with 2 9 =512 data objects. Theoretically, this table produces a discriminating matrice M with 9 rows and 256 2 /2 = 32 768 columns. However, only 512 columns remain unique with the rest 32 256 columns repeting some others. So, the maximum sizes of M are restricted by 9 rows and 512 columns for the case under consideration. This columns quantity can be theoretically obtained for approximately 33 different data objects). Evidently, such matrice M can be easily generated programmatically. The problem consists in finding a minimum-size cover of M what may be efficiently realized with the help of grp or other existing technique [11]. Then, given the features from π min (f), it is possible to build a classification tree for instance with the help of Python analytical means.

Experimental results
The experiments showed that it is necessary to take into consideration all 9 features to build a classifying tree. However, some combinations of features may be excluded, as the candidates with such profiles get very low resulting I-estimations (e. g., 0.3 or lower). This led to reduction of the feature set to 7 features constituting the minimum-size cover set π min (f) for discriminating matrice (DM) in order to correctly classify the persons to Acc 1 and Dec 1 by means of the HCT. They are: Age, Education, practical Experience, Knowing foreign Languages, Publications, Marital Status, and Work in other Organizations. It is so-called the HCT 1 of the first level as it uses only two classes A and B where class A is represented by the persons with the integral evaluation function I values greater than 0.5, and B comprising the rest candidates. The classification mechanism, used in HCT, appeals to differentiating objects by comparison of their features (not by computation of some integral evaluation criterion like I). In general, HCT may realize some kind of a complex non-linear estimation. The problem may arise, what to do when there remain more than the required number of candidates qualified as accepted. To decrease the number of candidates remained after the first selection, one can use the second classification tree HCT 2 which is created by analogy with HCT 1 . However, in the case of HCT 2 one should define a higher boundary level of the integral evaluation function I, separating class A from class B. For example, if I ≥ 0.6 then the candidate is qualified as accepted, otherwise, as declined. The corresponding changes should be made in DS used in factorial experiments to build HCT 2 . Our program resulted in 8 features now, excluding F 5 -participating in big projects. This process should be continued to build HCT 3 (for I ≥ 0.7), HCT 4 (for I ≥ 0.8), HCT 5 (for I ≥ 0.9), HCT 6 (for I ≥ 0.95). In our case, HCT3 uses 5 features: F 2 , F 3 , F 5 , F 7 , F 8 , while HCT 4 and the rest ones have only 3 features: F 2 , F 4 , F 6 . So, a collection of the classification trees has been created to provide sequential reduction (if necessary) the number of presumably accepted candidates. If, despite the filtering, there would remain more than one candidate then the final selection is realized as a random sampling.

Conclusion
The main advantage of the outlined technique consists in saving memory expencies for there is no need to store a database with feature values. Instead, a collection of hierarhical classifying trees is used with reduced feature set(s). Each HCT processes a vector of normalized feature values (in diapason [0..1]). To build HCT, one uses a factorial experiment resulting in building the discrimination matrix which is used to find a minimum-size covering set containing an opimal feature collection. As a final step, one applies a Python procedure to build a classifying tree. Varying a boundary level of I between the sets Acc i and Dec i , one provides a collection of HCT to filter the accepted candidates as much as possible. The number of the features in the sequence of HCTs decreases for high levels of I. If at the end of the filtering process there remains more than one candidate, the random selection is performed. The experts are in position to test different models, represented by equation (1) in order to find the feature weights most relevant to their preferences.