Evaluation of Ridge, Elastic Net and Lasso Regression Methods in Precedence of Multicollinearity Problem: A Simulation Study

This study aims at performance evaluation of Ridge, Elastic Net and Lasso Regression Methods in handling different degrees of multicollinearity in a multiple regression analysis of independent variables using simulation data. The researcher simulated a collection of data with sample size n=200, 1000, 10000, 50000 and 100000, independent variables p=10. The researcher compared the performances of the three methods using Mean Square Errors (MSE). The study found that Elastic Net method outperforms Ridge and Lasso methods to estimate the regression coefficients when a degree of multicollinearity is low, moderate and high for any sample size. While, Lasso method is the most accurate regression coefficients estimator when data containing severe multicollinearity at sample size less than 10000 observations.


Introduction
Multiple linear regression is frequently employed is appropriate in particular context to evaluate a model to predict the expected responses, or to explore the link between the dependent variable and the independent variables. The first goal, which is the design's prediction accuracy, is critical; however, the second goal, which is the model's complexity, is more important. Common linear regression procedures are popular for generally not carrying out well according to both prediction performance and model involvement (Doreswamy and Vastrad, 2013). There is a high number of hypotheses about the model in the regression analysis, specially, the most important one is (multicollinearity), in addition to (non-homogeneity of variance, autocorrelation and linearity) . If one or more assumptions are broken, the model becomes unreliable, 132 and it is no longer suitable for estimating population parameters (Herawati et al., 2018).
When there is a close association or interaction between two or more independent variables, multicollinearity occurs in the study of multiple regression. Multicollinearity has the potential to produce inaccurate regression coefficient choices, increase regression coefficient standard errors, deflate partial t-tests for regression coefficients, produce wrong, non-significant p-values, and minimize model predictability. (Draper and Smith, 1998;Gujarati 1995).
The key issue with multicollinearity is that as the degree of collinearity rises, the coefficient estimates in the regression model become unsteady, and the standard errors for the coefficients become wildly maximized. Multicollinearity has two types; the first type is (full/perfect/exact multicollinearity and the second one is partially/less than perfect multicollinearity). The existence of the first type is when independent variables interrupted in a complete way. This means that no particular least squares solution to a multiple regression analysis can be computed under this condition. (Slinker, and Glantz1985).
Since we know that multicollinearity is a serious problem when trying to make inferences or find predictive models, it's crucial to figure out the best way to deal with it. (Judge 1988). Multicollinearity can be detected using a variety of techniques and methods. Using pair-wise scatter plots of the independent variables, searching for near-perfect relationships, analyzing the correlation matrix for high interactions and the variance inflation factors (VIF), using eigenvalues of the correlation matrix of the independent variables, and testing the signs of the regression coefficients are just a few of the common approaches. (Montgomery and Peck, 1992;Kutner et al., 2005).
The reduce of variance at the cost of introducing a group of bias. The scholars call this method as "regularization or shrinkage methods" and is roughly beneficial for the predictive appearance of the model. In the study of current data, regularization is crucial. To overcome the shortcomings of ordinary least squares regression in terms of prediction precision, we introduced regularized regression methods for linear regression. Methods of regularization aid in the formalization of a unique solution to this well-posed problem. Any coefficients are reduced to zero using these techniques. This does not help with descriptor collection on its own, but it does reduce the gap at the expense of a small increase in bias. This form, on the other hand, improves the estimate's generalization. (Doreswamy and Vastrad, 2013). The least absolute shrinkage and selection operators Ridge, lasso, and Elastic Net are among the methods.
Using simulated data, this study examines three different regression methods to see which one works better for coping with multicollinearity obstacles.

Materials and Methods:
At first, we need to consider the basics of regression and what parameters of the equation they changed when using a distinct model. The relationship between a dependent variable and independent variables can be estimated using a multiple linear regression model and the Ordinary Least Square (OLS) method. If data comprises in observations,   1 , n i i i yx = each observation has a scalar response yi and a vector of p independents xij for j=1,...,p, we can write a multiple linear regression model as: (2) the regression coefficients are calculated using Ordinary Least Square by reducing the squared distances between the predicted dependent variable and the observed (Montgomery and Peck, 1992).
When building a regression model, the model becomes more complicated as the number of data and variables grows, and major optimization issues arise (Zou and Hastie, 2005). Furthermore, classical regression analysis fails when assumptions such as constant variance, multi-collinearity, and normality are not met (Ogutu et al., 2012). As a result, high coefficients in the model must be corrected, or penalized.
Regularized regression is a form of regression in which the coefficient estimates are constrained to zero. It penalizes the magnitude of the error term as well as the magnitude of the coefficients. Complex models are discouraged, mainly to prevent over-fitting. A typical least squares model has some flaws, such as the fact that it does not generalize well to data sets other than its training data. Regularization greatly reduces the model's variance while having little effect on its bias. The effect on bias and variance is dominated by the tuning parameter λ used in the regularization systems mentioned. As the value of λ increases, the rate of coefficients decreases, lowering the variance. To some extent, this increase in λ is advantageous because it only reduces variance (thus preventing over-fitting) while losing no significant properties in the results, but after a certain value, the model loses significant properties, resulting in bias and under fitting. Accordingly, the value of λ they should select it carefully ( Biswas, 2019). There are three kinds of regularization systems called the Ridge, Lasso, and Elastic Net.
Ridge regression corrections are made with squared values, while Lasso regression corrections are made with absolute values. The ridge and Lasso biased estimation regression methods are combined in Elastic Net regression (Zou and Hastie 2003).

Ridge Regression:
It is obvious that Ordinary Least Square (OLS) is unsteady and presents estimates having a lot of variance when multicollinearity appears among independent variables, e.g. the columns of X are strikingly correlated. Hoerl et al (1975) develop ridge Regression and this approach is the adjustment of the least squares method, which allows for biased regression coefficient estimators. (Myers, 1986).
Ridge regression approach depends on adding a ridge parameter to the diagonal of ( ) Since the diagonal of systems in the correlation matrix can be interpreted as a ridge, we call it ridge regression (Hoerl and Kennard, 2000).
The ridge formula to find the coefficients is: When λ equal zero, the ridge estimator appears as the Ordinary Least Square (OLS). If they all λ 's are like each other, the estimators that resulted are called the ordinary ridge estimators (Hoerl, 1962;Hoerl et al., 1975). It is usually acceptable to edit ridge regression in Lagrangian form: is the tuning parameter (penalty, regularization) that controls the power of the penalty (linear shrinkage) by selecting the relative importance of the data-dependent practical error and the penalty term. The vaster the value of, the greater is the amount of shrinkage. Since the value of is reliant on the data, we can find it out using data-possessed techniques, includes; cross-validation (Doreswamy and Vastrad, 2013).
By constraining the coefficient estimates, Ridge regression can overpower this multicollinearity, as a result, it may decrease the estimator's variance while also introducing bias. (James et al., 2013).

Lasso Regression:
They broadly used lasso regression approaches in handling with big databases, such as those used in drug discovery, where efficient and quick algorithms are required (Hastie and Friedman, 2010) the Lasso estimator is also recognized as basis pursuit (Chen et al., 1998). Still, Because there are steep correlations between descriptors, Lasso will choose one and ignore the others, and when all descriptors are the same, it will decrease. The Lasso penalty looks for many coefficients that are similar to zero, with only a small subset of them being the best (and not equal zero). To get a sparse solution to the following expansion problem, the Lasso estimator uses the L1 penalized least squares basis (Tibshirani, 1996).  (Herawati, 2018). The Lasso estimation method handles both the multicollinearity issue and best feature selection together in the high dimension linear regression model. Nonetheless, according to Hastie and Zou (2005) Lasso estimation procedure is unstable if the amount of predictors is greater than the amount of observations. Further, the prediction performance of RE dominates Lasso if there is high multicollinearity among predictors.

Elastic Net Regression:
According to (Friedman et al., 2010) This is a continuity of the Lasso that is robust to the strongest correlations among the predictor variables.. In order to prevent the imbalance of the Lasso solution paths when predictor variables are strongly correlated, they projected the Elastic Net for assessing high-dimensional data. Zou and Hastie (2005) Note that 0  = , and then Elastic Net estimator in equation (7) is equivalent to Ridge. Similarly, 1  = , and then Elastic Net estimator in equation (7) is equivalent to Lasso. If 0  = , so using this method, the elastic net method, decreases to ordinary least squares regression.
Hence, we can write the Ridge, Lasso and Elastic Net estimator in a common form in the mis specified regression model as below: The MSE, which is the scheduled prediction error of the estimators is given by: where ( ) In brief, the following are some salient distinctions between Lasso, Ridge and Elastic Net (Hastie et al., 2001): • Lasso has a sparse selection, unlike Ridge which does not have.
• Ridge regression shrinks the two coefficients towards one another if we have extremely correlated variables. Furthermore, Lasso is neutral and picks one over the other. In terms of context, no one would know which variable was chosen. Elastic Net is an adjustment between the two which attempts to shrink and do a sparse selection at the same time. • Ridge estimators are neutral to multiplicative scaling of the data If constants multiplied both X and Y, the coefficients of the fit do not change for λ parameter. However, for Lasso, the fit is not separate from the scaling. In fact, the multiplier must scale the λ parameter up to get the same result. It is more complicated for Elastic Net. • In a comparison with Lasso, Ridge penalizes the largest β's rather than it penalizing the small ones (as it square them in the penalty term). Lasso penalizes the small ones more consistently. Sometimes, This is of no consequence. When faced with a forecasting issue involving a strong predictor, the Ridge shrinks the predictor's effectiveness as compared to the Lasso.

Results: Simulation Study
Using R package, we simulate the linear regression model for number of data n = 200, 1000, 10000, 50000, 100000 observations and 10 independent variables. To explore the implements of different grades of multicollinearity on the estimators, we which represent low, moderate, high multicollinearity and severe multicollinearity. (Mcdonald and Galarneau, 1975) generate the independent variables: ( ) We can observe in table 1-3, that Elastic Net was outperforms than Ridge and Lasso at n = 200, 1000, 10000, 50000, 100000 observations if degrees of multicollinearity are low, moderate, high.
In Table 4 we can show that Lasso method was outperforms than Ridge and Elastic Net when = 0.99 (severe multicollinearity) at n = 200, 1000, 10000 observations, while at n = 50000, 100000 observations was showed that Elastic Net method was the best. Elastic Net method is the most accurate regression coefficients estimator.

Conclusion
According to the outcomes of simulation at p = 10 and n = 200, 1000, 10000, 50000, 100000 observations, containing different degrees of multicollinearity within all independent variables, it can be summarized in: • Elastic Net method outperforms Ridge and Lasso methods to estimate the regression coefficients when degree of multicollinearity is low ( ) • Elastic net regression outperformed the other two methods in case of low, moderate and high level of multicollinearity when 0 < α <1. Also, it can be incurred that severe multicollinearity requires higher value (α=1). This is consistent with the theoretical framework.
• Ridge method was unsuitable regression coefficients estimator compared with Lasso and Elastic Net methods.
• General, in studying relationships and interconnecting economic and social factors, we recommend to the decision maker that use Elastic Net method for any sample size.
• We also recommend to the decision maker that use Lasso method for when using real data, and examine the relationships between the different variables (severe multicollinearity) at sample size less than 10000 observations.