**
1. Introduction
**

The study of the nutrient components of plants used for animal feed has gained importance in scientific research with the aim of improving nutrition in both ruminants and non-ruminants. The different applications of artificial intelligence in the different fields of life and science are an advance in research. Solutions in medicine with contributions in most specialties can be mentioned, as in the case of application development in the field of imaging where there is software that makes use of pattern recognition to detect pathological anomalies (^{Verdecia et al., 2018}).

The protein plants used in animal nutrition are of great importance for the livestock community as a substitute for concentrates that are becoming more expensive every day. Within these the so-called of excellence reach more relevance every day among farmers for their properties; that is why it is necessary to know the behavior of its components in order to form a quality diet (^{Díaz et al., 2007}; ^{Otegui & Totaro, 2007}; ^{Alonso-Peña, 2011}; ^{Verdecia et al., 2018}). Hence, it is important that several authors dedicate time and resources to the investigation of the behavior of metabolites, components of the cell wall and digestibility of these plants used in livestock (^{Rincón-Tuexi et al., 2006}; ^{Ramírez-Lozano, 2010}; ^{T. Ruiz et al., 2011}; ^{T. E. Ruiz et al., 2014}).

Agriculture is currently committed to the so-called efficient agriculture, which is the one that is equipped with research and applications in the field of artificial intelligence to improve yields. In the present research, the lazy algorithms are analyzing with the learning bases of four plant varieties to determine which of these algorithms is better adapted when simulating the laboratory results in the determination of secondary metabolites, cell wall components and quality components of the species under study (^{Herrera et al., 2017}).

Meat and milk production in ruminants is conditions using forage plants in their diet. In the tropics, the use of leguminous has increased in search of better production indicators, as well as other feeding alternatives, obtaining indicators similar to conventional systems in several cases (^{Mahecha & Rosales, 2005}; ^{Mahecha et al., 2007}). Forage plants, beyond being one of the main and excellent components in ruminant nutrition, offer various advantages, among which it is worth noting that they prevent soil erosion, maintain humidity, and provide organic matter; *Gliricidia sepium Erythrina variegata, Leucaena leucocephala* and *Tithonia diversifolia* are among those preferred and used in the tropics (^{Cabrera, 2008}).

The aim of the present research is based on the prediction of the phytochemical components, cell wall components and digestibility of four varieties of protein plants. For this, the adaptability of the multiple regression algorithms to the database provided by the specialists in pastures and forages of the University of Granma has been determined as the main problem.

In previous research, many regressive algorithms have been tested in order to evaluate their behavior with the databases obtained. The result of these analyses has shown that lazy algorithms are the ones that best adapt to these data (^{Barrios et al., 2015}).

The present research studies the lazy algorithms present in the MULAN (^{Tsoumakas et al., 2011}) library developed by the University of Waikato. In this, the aRMSE (Average Root Mean Square Error) is evaluate as the main performance measure to determine the one that best suits the database. The peculiarity of these types of algorithms is that since they work with little data, they are then base on the probability that an object may resemble another to estimate or predict a value. Hence, the objective of this research is to evaluate regression algorithms, to determine the behavior of the expressions that best adapt to the procedures of a traditional laboratory and to estimate the chemical components of protein plants, in this sense the MULAN library of java has been used, that contain automatic learning algorithms capable of adapting to dissimilar problems.

**
2. Metodology
**

**
Regression and Classification Task
**

The one a most important problem into the Machine Learning is define a type of solutions. Is necessary to have a count the types of variables or types of the data into the data set (^{Alzubi et al., 2018}; ^{Coraddu et al., 2016}; ^{González, 2015}). Thus is very important to define the types of machine learning tasks. To give solutions to the problem firstly we define a classification and regression:

● Classification is the task of predicting a discrete class label.

● Regression is the task of predicting a continuous quantity.

There is some overlap between the algorithms for classification and regression, for example:

● A classification algorithm may predict a continuous value, but the continuous value is in the form of a probability for a class label.

● A regression algorithm may predict a discrete value, but the discrete value in the form of an integer quantity.

Some algorithms can be used for both classification and regression with small modifications, such as decision trees and artificial neural networks (^{Alebele et al., 2020}), (^{Mastelini et al., 2020}). Some algorithms cannot, or cannot easily be used for both problem types, such as linear regression for regression predictive modeling and logistic regression for classification predictive modeling. Importantly, the way that we evaluate classification and regression predictions varies and does not overlap, for example:

● Classification predictions can be evaluated using accuracy, whereas regression predictions cannot.

● Regression predictions can be evaluated using root mean squared error, whereas classification predictions cannot.

**
Multi-Target Regression task
**

In Machine Learning to predict a vector of values of any task, first it is need give to the model a dataset with all examples to system to create with this a system, it is composing to three steps, training, evaluation the training task to define the quality of the model, later to test a model given a vector of real values to obtain a vector of real values that affect a result of prediction (^{Džeroski et al., 2000}; ^{Despotovic et al., 2016}; ^{Waegeman et al., 2019}; ^{Chen et al., 2021}).

The learning process are realized using a learning algorithm. They algorithms are capable to the learn to the dataset to return a vector result. There are many algorithms, it are classify in based on rules, based on decision tree, lazy, vector regression, etc, each one with its specific characteristics (^{Nogueira & Koch, 2019}).

At present, the problems solved by means of regression have reached high levels of applicability. In various life scenarios, these are decision-makers in the behavior of systems or help in rational decision-making. Current models have reached levels of complexity by having problems where several dependent and several independent variables concur, a challenge that has drawn significant attention from researchers. Among the most current regression techniques is the Multiple Target Regression (MTR) where the main task is to simultaneously predict each objective variable from several independent variables.

Among the latest contributions to this technique is the proposal by (^{Borchani et al., 2015}) that establishes two forms or ways of solution according to the approach of the problem. The problems of transformation of the problem and those of adaptation of the method are then raised (^{Chen et al., 2021}). These differ by themselves in exploiting the interrelationship between variables to make a prediction. Adaptation-based problems take into account the relationships between the output variables, while transformation-based problems decompose the multi-objective problem into several output variables (^{Fang et al., 2015}; ^{Zhang et al., 2017}; ^{Zhen et al., 2017}; ^{Wang et al., 2018}; ^{Joshi et al., 2020}).

According to ^{Spyromitros-Xioufis et al., (2016)}, when an MTR problem is modeled, it is taken into account that the input is made up of two vectors, one input
, where each one consists of n variables, one can then be defined as a set of input variables and as the set of target variables , therefore vectors can the be defined as and

Cross validation or cross-validation is a technique used to evaluate the results of a statistical analysis and ensure that they are independent of the partition between training and test data. It consists of repeating and calculating the arithmetic mean obtained from the evaluation measures on different partitions. It is used in environments where the main objective is prediction and the accuracy of a model that will be carried out in practice is to be estimated. It is a technique widely used in artificial intelligence projects to validate generated models. Cross-validation is a way to predict the fit of a model to a hypothetical set of test data (^{Refaeilzadeh et al., 2016}; ^{Berrar, 2019}).

**
Regression algorithms
**

The regressor algorithms studied in this research are found in the WEKA library, these are:

The IBk algorithm does not build a model, instead it generates a prediction for a test instance just-in-time. The IBk algorithm uses a distance measure to locate k instances in the training data for each test instance and uses those selected instances to make a prediction (^{Amin & Habib, 2015}).

Locally Weighted Regression (LWL) or LOWESS. LOESS or LOWESS are nonparametric regression methods that combine multiple regression models in k-nearest-neighbor based model. Most of the algorithms such as classical feedforward neural network, support vector machines, nearest neighbor algorithms etc (^{Cambronero & Moreno, 2006}; ^{Mariño, 2015}).

The principal difference of K* against other IB algorithms is the use of the entropy concept for defining its distance metric, which is calculated by mean of the complexity of transforming an instance into another; so it is taken into account the probability of this transformation occurs in a random walk away manner. The classification with K* is made by summing the probabilities from the new instance to all of the members of a category (^{Cleary & Trigg, 1995}).

This must be done with the rest of the categories, to finally select that with the highest probability (^{Barrios et al., 2015}).

**
Regression metrics of evaluation
**

Now, to evaluate regression models are exists some metrics:

Root Mean Square Error (RMSE) is the standard deviation of the residuals (prediction errors). Residuals are a measure of how far from the regression line data points are; RMSE is a measure of how to spread out these residuals are. In other words, it tells you how concentrated the data is around the line of best fit.

one way to assess how well a regression model fits a dataset is to calculate the root mean square error, which is a metric that tells us the average distance between the predicted values from the model and the actual values in the dataset (^{Despotovic et al., 2016}).

Average Root Mean Square Error (Average RMSE) is the average of the RMSE of the data set.

Relative Root Mean Squared Error (RRMSE) this indicator is calculate by dividing RMSE with average value of measured data. (^{Despotovic et al., 2016}).

Mean Absolute Error (MAE) is the magnitude of difference between the prediction of an observation and the true value of that observation. MAE takes de average of absolutes errors for a group of predictions and observations as a measurement of the magnitude of errors for the entire group.

Average Mean Absolute Error (Average MAE) is the average of the MAE of the data set.

Relative Mean Absolute Error (Relative MAE) resents the ratio of the error between the measured value and the predicted dataset to the measured value of all the points (^{Li et al., 2018}).

Average Relative Absolute Error (Average Relative MAE) is the average of the RMSE of the data set.

**
3. Results and discussion
**

Because the principal objective of the research is based on to, find the algorithm that shows the best results in learning the databases of the treated species, all the algorithms in the MULAN library were tested to observe the learning behavior of the algorithms. For this task, were selected all algorithms can be used on regression problems. We specifically focused on instance-based ones, because in previous research these are show the best results in learning. The results showed in the next table, reflect the learning of the algorithms put in competition, that showed a better performance in learning were the lazy algorithms, in this case only the IBK and KStar algorithms was tested, the result is because they are instance-based algorithms, they can work with little data and use the probability that one object is similar to another to predict values given an input. In the case of predicting a numerical value, it is express in the way that a number can be approach to another, for this are used the approximation measures, as the nearest neighbor technique used by IBK while the measure used by KSTAR is based on the relative entropy between objects.

In the table 1, includes results of a study of the regressor algorithms of the MULAN tool, in this all were put to compete, to observe the learning behavior for each database of all of the species’ studies. As previously explained, in this case the ones that showed the best results were those based on instances (IBK and KSTAR). In this case, was created three databases for each species, that respond to secondary metabolites, cell wall components and digestibility, all algorithms were tested on each database of each variety. Hence, the instance-based algorithms (IBK and KSTAR) are the ones that showed the best performance in learning with databases. This is since they learn from previous cases and with the use of distance measurements between two points (protein values, digestibility and composition of the cell wall) to be able to predict the real values of the different constituents of the nutritional value of each one of the forage species or protein plants studied.

^{Pascual et al. (2016)} when using artificial neural networks to estimate the components of yield and nutritional value of three species of pasture grasses. The use of this technique through a multilayer perceptor network, which allowed estimating these indicators for the first time in Cuba, from databases for learning with information collected from scientific publications and data from referenced laboratories of the Humboldt University, Germany. Studies that served as the basis for our research since, despite being neural networks and regression models, the most recommended ways when you want to predict a numerical value from the set of real values. From the comparison of the two solutions in our study, better values were obtaine with the regression models of multiple objectives. These results are because the networks during the learning process can appear the so-called false positives, which generates an over-prelearning of the model. In these models based on neural networks, it is very complex to control the internal process and the interaction between the neurons that make up the model, which is why it is often unlikely to detect an over-learning phenomenon.

While, ^{Estrada-Jiménez et al. (2018)}, through a comparison between the regressor algorithms of the WEKA tool, they reported that the estimation of the phytochemical components of *Leucaena leucocephala* and *Tithonia diversifolia* from the variables of climate, rebound age and primary compounds (nitrogen and sugars) products of the photosynthetic activity of the plant, there was a better response for the KSTAR algorithm when evaluating the performance of the predictions using the aRRMSE function. Initial tests that served to establish for the present study the division of the training sets by processes, which favored a higher performance of the algorithms based on instances.

After to realize the preliminary study with all the algorithms and to see that the ones with the best performance were those based on instances, the work was centered on evaluating only those of this type in the tool. The Table 2 shows the principal measures of evaluation for this type of task. As you can see, this tool has 3 algorithms for regression, to putting them to compete, is observed that the most efficient training was about KStar. With this result then the other evaluation measures described above can also be seen. Now, the principal purpose is to find the one that learns the best to later create a tool, that automatically predict these values from input data which are, soil data, rebound age, primary metabolites, and climate.

In previous research (^{Spyromitros-Xioufis et al., 2016}; ^{Santana et al., 2017}; ^{Estrada-Jiménez et al., 2019}; ^{Waegeman et al., 2019}; ^{Estrada-Jiménez et al., 2020}; ^{Chen et al., 2021}), various machine learning algorithms have also been tested, as well as regressor, with satisfactory results. In this only the lazy algorithms were put to the test since these are based on the probability that an object can resemble others; in these investigations the datasets had not been divided, this responded to a variant to test the behavior of all types of algorithms to select the best one through aRMSE, in these the best ones always turned out to be the lazy ones, therefore hence the decision to in this investigation test only the sloths.

Also, several models have been compared that included in most cases a single dataset, this contained in a first experimentation all the data of the studied varieties without considering the flow of the processes that we tried to simulate, even so the lazy algorithms always showed a better adaptation to them even with this drawback.

Then, in consultation with specialists from the department of pastures and forages of the University of Granma, it was decided to separate the dataset of how the process flow is carried out in a laboratory, therefore the datasets were separated, and 3 datasets were created datasets that respond to phytochemical components, cell wall components and digestibility components, which is as shown in Table 2. When comparing. In the aRMSE values, it can be observed that the aRMSE decreases with respect to Table 2, so it can be affirmed that with the new variant of separating the data set, the algorithms simulate with better quality the behavior of the plants studied in this research.

With the data studied by learning the algorithms, it was possible to verify that the algorithms that had the best adaptation to the data were obviously the lazy algorithms. The performance measure used responds to the fact that in regression problems these are the measures to be used, but the same does not happen when the problem to be treated is classification. As has been used by several authors (^{Karalič & Bratko, 1997}; ^{Tuia et al., 2011};^{Osojnik et al.,2017}; ^{Reyes et al., 2018}; ^{Camejo-Corona et al., 2019}), this aRMSE measure is the most representative in a model, even though it is known that among its limitations is precisely that the average encompasses all the values that are included within it, therefore, if at any time there is any high very high or low very low value may directly affect the average.

In comparison with the research developed by ^{Estrada-Jiménez et al, (2018)}, it was possible to include the data referring to soil components, digestibility, and cell wall components. He proposed a model that predicted only the secondary metabolites from the primary metabolites, climate and rebound age. A significant detail is the reduction of the error evaluated to select the regressor algorithm. In addition, in the present paper the aRMSE is optimized by the creation of the databases by processes to be determined, that is, a data set to learning to secondary metabolites, cell wall components and digestibility respectively.

^{Painuli et al. (2014)} reported the effectiveness of the KSTAR algorithm in predicting the wear of agricultural machinery parts, based on the collection of a set of data and characteristics of these parts, with which the data set for learning was formed and with With the application of this algorithm, the effectiveness of the predictions could be evaluated at 78%, a value that is considered high due to the adaptability of the algorithm to the data set (^{Painuli et al., 2014}).

The use of artificial intelligence as a powerful tool to predict different life processes and different branches of science is a practice that has gained popularity in recent years due to its practical utility and high levels of precision. In this sense, ^{Erdal et al. (2018) }developed studies with algorithms based on instances (lazy) to simulate the evaluation of concrete quality. At first, all the algorithms of the WEKA tool, which contains the relevant libraries for data mining, were evaluate. Then the data was evaluated only with the lazy algorithms, from the error it was possible to determine the high performance of the instance-based algorithms (LW, IBK and KSTAR). While ^{Maliha et al. (2019)} to predict the causes and appearance of cancer found when using algorithms J-48 and KSTAR that in logistic regression the accuracy is 99,3%; for KSTAR it was 99,5% and J-48 is 99,1 %.

However, ^{Zighed and Bounour (2019)} used the KSTAR algorithm to assess software maintenance; based on the Quantity of codes to be implement for the maintenance of a specific computer product. Prediction models based on data collected from two object-oriented systems were create. In addition, the models created with the linear regression algorithms, neural network, decision tree, SVM, were compare with the use of the WEKA tool; where comparisons of the prediction accuracy of all models were established using and cross-validation. As a result, it shown that KSTAR produces better results by predicting more accurately than the other techniques. It should be note that the present study, using this tool, eliminates the multicollinearity between the input variables, eliminating the correlation between them to avoid setbacks.

^{Khosravi et al. (2021)}, using field data at one station, succeeded in predicting flow depth, water surface width, and water surface longitudinal slope using independent data mining techniques: database learning. instances (IBK), KSTAR, locally weighted learning (LWL), Vote, Attribute Selected Classifier (ASC), Regression by Discretization (RBD) and Cross-validation Parameter Selection (CVPS) (Vote-IBK, Vote-KSTAR , Vote-LWL , ASC-IBK, ASC-KSTAR, ASC-LWL, RBD-IBK, RBD-KSTAR, RBD-LWL, CVPS-IBK, CVPS-KSTAR, CVPS-LWL). Through a comparison of predictive performance and a sensitivity analysis of the driving variables, the results reveal that among other features the Vote-KSTAR model had the highest performance in predicting depth and width, and ASC-KSTAR in the estimation of the slope.

The results obtained in this research attest to the good behavior of the adaptability of the algorithms and artificial intelligence to predict the components studied. In this way, it is evident that for future studies it is advisable to use these procedures based on instances, which is due to the behavior of these before others.

**
4. Conclusions and recommendations
**

● The aRRMSE was optimized with respect to previous investigations

● The species that showed the best behavior was *Leucaena leucocephala* with the KStar algorithm

● Three datasets were created for each plant variety to evaluate the behavior of lazy algorithms.

● The datasets of the plant varieties created were tested with the lazy algorithms of the WEKA tool developed by the University of Waikato to evaluate the adaptability of these with the datasets.

● The results of the training with each of the datasets were evaluated with the conventional metrics for the evaluation of the regression algorithms. It was verified that the algorithm that presented the best aRMSE turned out to be KStar, which shows that this is the one that can best simulate the behavior of the properties of the varieties studied.

● We recommended, test the proposed and trained models with test cases designed by specialists from the Center for Animal Production Studies of the University of Granma.

● Consider the results proposed for the use of these in a computational tool that can gather and learn from these databases with these algorithms to simulate the behavior of plant components. Evaluate the methodology used and the flow of processes in the study of other varieties of plants used for animal nutrition.