Application of the logistic regression analysis to assess credibility of the farm

Logistic regression is useful tool of statistical analysis used in various field of research, especially to classify units according their parameters, or to estimate chance of event occurrence. On the economic field this method is usually used to estimate bankruptcy and credit models, or to predict consumers’ behavior. Objective of the proposed paper is to present application of the logistic regression analysis to assess credit of the farm. This paper can be used also as guide through the process of modelling, model verification and interpretation of its results. Data used to estimate logistic regression were individual farm data cover large farms from the database of the Ministry of Agriculture and Rural Development in Slovakia for the period 2009 to 2013. 4000 observations were used to estimate final model, and 427 observations were used as the sample for the model verification. Then, logistic regression model was estimated and verified. From the initial set of 13 variables were selected 7 significant variables to final model. Factor which increased probability of getting loan the most significantly was proportion of loans, on the other hand, factor which decreased this probability the most was the proportion of crop production. Quality and prediction ability of the final model according to standard indicators was fair, however there could be suggested including additional variables to improve model prediction ability, and its further testing by its application on more testing samples. Paper offers better insight into process of logistic regression application, and suggests ways of current topic further developing.


INTRODUCTION
In the 19th century Sir Francis Galton introduced to the world very popular statistical method called regression.Method is preferred in various areas of investigation, but in the first, it was used in genetics.Galton estimated regression model for the prediction of height of the child based on the data of parents.He found, that the difference between high of child and average high in the population is proportional to his parent's deviation from typical people in the population.
Classical linear econometric model was not appropriate in case, when dependent variable had binary or categorical character.The main reason was, that probability does not have linear nature, and that predicted values should fall in the interval between 0 and 1.These assumptions were not met in case of linear regression model.The need for a new method, which will satisfy these two conditions led to development of the logistic regression model.This method has been already applied in various fields of research.For example, it has been applied in healthcare research, social, geographic, ecological, physical research and in the field of economics.Presented paper is focused on application of logistic regression in the field of economics and finance, especially in assessing the credibility or bankruptcy with the use of logistic regression models.Paper shows application of this method to assess farm solvency, resp.to assess a chance, that farm will get bank loan.This method has already been used in this area of research which is described in the examples below.
In France, logistics regression was used for prediction of individual bankruptcy of enterprises.Because of lack of traditional prognostic models, Jabeur [1] developed a model that includes financial ratios as the explanatory variables, and deals with correlation and use penalizing weights for the wrong data in the matrix.This model was applied to predict business failure, which was appreciated not only by bankers but also by investors in France.Suggested model allows them to predict bankruptcy in advance and helps them to avoid bad investment.
When analyzing the financial statements of corporate entities, Nikolic et.al [2] used a model of logistic regression in the prediction of the credit score.Researchers proposed corporation credit scoring model, thanks to which they can predict probability of bankruptcy in 1 year period advance.They used the test sample to verify prediction ability of estimated models.Logistic regression was evaluated as the best predictive credit scoring model from the set of suggested solutions.Their model includes eight explanatory variables that showed the best predictive performance.Analysis was conducted in the Serbian region, so the final model could be implemented in a bank that operates in the same area, or in the region of South Eastern Europe.In other regions, it is possible to build an analogic model based on a similar technique.
Serener [3] analyzed the use of internet banking.They suggested model to estimate the probability of using Internet banking by customers.Factors considered as the explanatory variables were age, gender, income, marital status, education, occupation, experience with online shopping.Results of their analysis suggests, that clients aged 56 -65 are less likely to use these services than respondents aged 18 -25.Persons in marriage are less likely to use Internet banking than single respondents.With the increase in the individual's income, the likelihood of using internet banking also increased.Similarly, university graduates have a higher chance to use internet banking than lower educated people.Respondents who already had experience with online shopping showed tendency to use the internet banking.Among the professions considered in research, internet banking is most likely used by bank staff.Further application of the logistic regression can be expanded to marketing support.Appropriate sales support could be focused on the specific group of people suggested by logistic regression results.
There could be mentioned more examples of logistic regression applications.For its valuable properties and availability of software solutions it may be applied in many field of research.Presented paper is focused on the application of logistics regression in the field of finance to assess farm credibility and to determine factors which can influence it.Proposed paper describes the whole procedure of model specification, estimation, verification and interpretation of the results.Therefore, presented paper can be also used as the application instructions to logistic regression.

MATERIAL AND METHODS
Data used to present the logistic regression model were farm data over the period 2009 to 2013 divided in two groups based on the criterion, whether the farm did receive a bank loan or not.Individual farm data cover large farms from the database of the Ministry of Agriculture and Rural Development in Slovakia (Information letters of farms with double entry accounting).Dataset was divided into two parts.First part was used to estimate logistic regression model and included 4000 observations.Second part of dataset was used for verification of model prediction ability and included 427 observations.Model was estimated and verified using R Cran software package.

Model
If the Y is a binary response variable equal to 1 if attribute is present and 0 if it is not present in observation.If x = (x 1 , x 2 , x 3, …, x k ) is a set of explanatory variables which can be discrete, continuous or a combination.First, 13 variables were considered as exogenous factors in the model.After backward elimination and model selection process were left following 7 variables in the final model: Debt -firm debt, ploan -proportion of loans, Ebitda, size -size of firm, revpha -revenue per ha, own -number of owners and pprv -proportion of crop production.

Logistic regression model presents conditional probabilities (log odds) through a linear function of the predictors expressed as:
(1) Where s the estimated vector of k predictor coefficients.Vector of parameters is estimated using maximum likelihood method.Following likelihood function is maximized: (2) Then predicted probability can be expressed as follows: (3) In case of logistic regression is no more necessary to hold the assumptions of classical linear econometric model based on ordinary least square.Linear relationship between dependent and independent variables, explained variables and error term does not need to be normally distributed.Logistic regression also does not need variances to be homoscedastic and can handle also nominal or ordinal data as explanatory variables.

Likelihood ratio test
This method compares the likelihood of the data under the full model against the likelihood of the data under a model with fewer predictors.
Let L1 is the maximum value of the likelihood without the additional assumption (unrestricted model) and L2 the maximum value of the likelihood when the parameters are restricted (reduced model).Calculate the ratio: Result is always between 0 and 1.Then test statistics can be calculated: (5) And it follows Chi-square distribution with k degrees of freedom (k-number of restriction in the second model).H 0 holds that the reduced model is appropriate, and p-value for the overall model fit statistic less than 0.05 would suggest rejecting the null hypothesis.It provides evidence in favor of current model.

Pseudo R2
Usual R2 cannot be applied in case of logistic regression, due to binary nature of dependent variable.Estimated logistic function does not fit the real observations which can take values of 0 or 1.Therefore, it was necessary to introduce indicator, which better reflect the nature of binary data.McFadden Pseudo R2 can be calculated using following equation: (6) Lc -refers to the maximized likelihood value from the current fitted model and Lintercept refers to likelihood value from the model with only the intercept and no covariates.If comparing two models on the same data, McFadden's would be higher for the model with the higher likelihood.

Classification Rate
Classification rate is calculated comparing predicted probabilities with real results on the control group of data.If P(Y=1|X) > 0.5 then predicted Y=1 if P(Y=1|X) < 0.5 then predicted Y is 0. In some other application could be considered different boundaries to assess the model.The higher classification rate means better model.

RESULTS AND DISCUSSION
First, the dataset was divided into two parts, first part including 4000 observations was used to modelling process (variables selection, model estimation and verification) and second part including 427 observations was used to test prediction ability of the final model.Then, process of variables selection was applied; from total number of 12 variables were selected 7, as factors which significantly influence the probability of getting loan.These variables were used to estimate final logistic regression model.Estimated coefficients and their statistics are shown in Table 1.Overall model quality can be evaluated also by using ROC curve and its area under the curve.
The curve shows relationship between true positive rate (correctly predicted loans) and false positive rate (when model predicted loan in case when it did not occur).ROC curve is shown on the Figure 1.In this case it can be noticed, that model is slightly in favor of true positive prediction.From ROC curve is derived another important indicator AUC, which denotes Area under curve.AUC indicator of presented model is equal to 0.69.According common rules this means that model is close to fair quality.As it was already mentioned above, coefficients in estimated models does not indicate direct influence of explanatory variables on the probability of getting loan, but their influence on the log odds ratio of getting loan.Estimated function is not linear; influence of each explanatory variable on the final probability value will depend on the value of X.It will be different between low and high value of X.To describe the real influence of independent variable to probability it would be necessary to describe influence of change in low, medium and high values of this variable.Much easier is to describe constant effect on the odds ratio.Therefore, to derive influence of each variable on the probability of getting loan, it is necessary to exponentiate each coefficient value to get odds ratio for each variable.Odds ratios of explanatory variables with confidence intervals are shown in the table.Odds ratios higher than one mean positive effect of explanatory variable on chance of getting loan, odds ratio value lower than one mean negative effect on chance of getting loan.Significantly positive or negative influence on chance of getting loan means only odds ratio without 1 in its confidence interval (1 means indifferent to positive or negative influence).
Confidence intervals present range of values where the odds ratio should be with 95% probability.Highest positive influence on chance of getting loan has proportion of loans, then debt, and only slightly positive influence farm size and number of owners.For example, if the number of owners increases by 1, the chance of getting loan increases by 0.1%.The rest of the odds ratios could be interpreted analogically.On the other side, highest negative influence on chance of getting loan has proportion of crop production and slightly negative influence EBITDA and revenue per ha.In case, when it is necessary to assess credibility of farm, parameters of the individual farm can be filled into model and probability of getting loan can be estimated to assess its solvency.

CONCLUSIONS
The main objective of presented paper was to demonstrate application of logistic regression in case, when it is necessary to classify units according their attributes.In the economic field, this method is usually used for insurance and bank purpose, bonity models, bankruptcy models or in targeted marketing.Application in this paper was evaluation of farms solvency to get loan.First, from the set of 13 explanatory variables it was selected the set of 7 variables as the exogenous variables to final model.All the variables and overall model were considered significant according usual statistical procedures.Model quality was assessed by its application on test set of data and calculating McFadden R square, ROC curve and area under curve, and number of correct predictions.These indicators considered model as fair (60% of correct predictions).However, if the model should be used in practice, another explanatory variable should be considered to improve its prediction ability.It should be also applied on more testing data groups to get cross validated verification and get better insight into its prediction accuracy.When interpreting results of the model it should be noted, that estimated coefficients does not mean direct influence of explanatory variables on final probability of getting loan, but on the log odds ratio of getting loan.Probability in this kind of model is non-linear, therefore there is no constant influence of the variable, but the change in the probability depends on the specific value of explanatory variable and is different between low, medium and high values.This is the reason, why in logistic models is usually estimated and interpreted odds ratio (exponentiated value of coefficients) to assess variables influence.In presented example the highest positive influence on chance of getting farm loan had variable debt and proportion of loans, which suggest, that if the farm have already successfully got loan in the past, it probably would get also new loan.Highest negative influence on the chance of getting loan had proportion of crop production.The output of the model is probability of getting bank loan.In conclusion, producing final model would require further modification and cross validation, but it was presented that logistic regression offers efficient tool for data classification.This method can be applied in various fields of research and help to get better insight into process where outcome is categorical variable.
ROC curveROC curve and Area under the curve (AUC) present typical performance indicator for binary classifier.An area of 1 represents a perfect test; an area of 0.5 represents a worthless test.A rough guide for classifying the accuracy of a diagnostic test is the traditional academic point system: 0.90 -1 = excellent, 0.80 -0.90 = good, 0.70 -0.80 = fair, 0.60 -0.70 = poor, 0.50 -0.60 = failed.

Figure 1
Figure 1 ROC curve

Table 1
Estimated logistic regression modelEstimated coefficients in this case does not mean direct influence of the explanatory variables to probability of getting loan, but their influence on the log odds ratio of getting loan.Therefore, it is difficult to evaluate influence of each variable on dependent variable by coefficient value; on the other side, it is possible to evaluate significance of each variable according to their p-values in the final model.Significance is indicated in last column of table 1.Most of the factors are significant at alfa = 0.001, only variable revenues per ha is significant at the alfa = 0.01.Overall significance of the model evaluated by the probability of likelihood ratio test is close to 0, which suggest significant model.The same test was applied also to evaluate significance of individual variables.In this case was compared likelihood of model with only intercept (without explanatory variables) and model including additional explanatory variable.If the adding of explanatory variable improved prediction ability of the model, variable is denoted as significant.This procedure differs from the method used in coefficients table above.

Table 2
Evaluation of variables significance

Table 3
Odds ratios with confidence intervals