Classification model of poverty risk in the European Union

Analysis of the at-risk-of-poverty dataset using WEKA machine learning software tool aims for mining the relationship in selected data from database Eurostat for efficient classification. We used eight classification algorithms for analyzing dataset. We used WEKA tools to search the best classification algorithm. We evaluated accuracy of classification algorithms using various accuracy measures like Kappa statistic, TP rate, FP rate, Precision, Recall, F-measure, ROC Area and PRC Area. The accuracy of the models was monitored by the number of instances classified correctly. In this paper we describe the values of the monitored indicators of the best algorithm J48.


INTRODUCTION
Poverty and income inequality are a highly topical issue, not least because of the covid19 pandemic we are currently experiencing. This issue is not only important in developing countries, but reducing income inequality and reducing poverty are important goals for the Member States of the European Union. Monitoring the development of poverty levels is important in determining the socio-economic progress of society [12]. Eurostat publications state that one-fifth or more of the population was at risk of poverty in up to 7 EU countries in 2018 [2].
Poverty and social exclusion are multidimensional phenomena. Just as there is no only one or correct definition of poverty, there is no single generally accepted way of measuring it [11]. The at-risk-of-poverty line is set at 60% of the median national equivalent disposable income and is expressed in PKS (purchasing power parity). The foundation for comparing living standards between countries is often gross domestic product (GDP) per capita, which in monetary terms shows the basic measure of the total size of the economy divided by the number of people living in it and is used to measure a country's wealth and prosperity. However, this headline indicator does not provide information on the distribution of income within a country, nor does it provide information on non-monetary factors that can play an important role in determining the quality of living conditions of the population [2].
Long-term observations of income inequality and poverty show that countries with higher income inequality are most likely countries with high levels of poverty and countries with low income inequality, as well as countries with low at-risk-of-poverty. Janovičová & Bartová assessed the development of income inequality, the poverty risk rate in V4 countries over the years 2005-2017 using the panel of annual data and by econometric models [5]. They found that in Poland and Hungary, the at the risk of poverty rate was significantly higher than in Czech and Slovakia in the observed period. Carlsen and Bruggemann [1] studied the inequality within the 27 European Member States by partial ordering methodology multiindicator system. They found that Luxembourg, The Netherlands, Austria, and Finland had rather low inequality and on the other hand Bulgaria and Romania was with the highest degree of inequality in the period under review. They also found that Luxembourg and Hungary were isolated countries, i.e., incomparable to any other EU Member State. Muster [7] based on Eurostat research (the EU-SILC survey) presented the dynamics of changes in the phenomenon of in-work poverty in individual EU countries in 2006-2019 in his work. He said that a particularly significant increase in poverty in 2006-2019 was observed in Bulgaria, Germany, Hungary, Malta and the Netherlands. Between the factors that have a key impact on the problem of impoverishment of the economically active he included low level of education, flexible work, part-time work, young age, low work experience and living in multiperson households. Janovičová stated that proportion of population aged 65 years and more, unemployment rate and people aged 18-59 living in jobless household have statistically significant positive effect on income inequality and at the risk of poverty rate growth [4]. She assessed development of income inequality, poverty risk rate in the 19 EU Member States over the years 2005-2017.
Accurate data on poverty prevalence are needed by policymakers in anti-poverty policies [12]. Žilinský et al. [12] in their study argue that subjective poverty indicators provide essential information and should be taken into account as a supplementary dimension for assessments of the poverty level in a society. They found that with the exception of a few countries, all three subjective poverty indices (headcount ratio, the poverty gap index, and the severity of poverty index) show consistent decreasing trends in subjective poverty of EU Member States. Their results suggest that objective poverty measures should be considering housing costs because social subjective poverty lines are considerably higher for households paying mortgages and tenants paying rent than for outright homeowners.
Ivanová and Grmanová [3] studied the sustainability of EU labor immigration in terms of poverty inequalities and employment. They argue in their study that immigrants coming out of the EU are significantly at higher risk of poverty because in most EU countries, the employment rate in the group "nationals" is lower than in the group "foreign" from the EU. Tkachova et al. [10] in their study determined that the policy of integration of immigrants does not ensure the achievement of the goal of inclusive and equitable social-economic welfare. Next a particularly vulnerable group in terms of the risk of poverty are the unemployed. With almost half (48.6%) of all unemployed in the EU27 being at risk of poverty in 2018, with the undisputed highest rate recorded in Germany (69.4%). Another 11 EU Member States (Lithuania, Malta, Latvia, Sweden, Bulgaria, Hungary, the Czech Republic, Estonia, Slovakia, Spain and Belgium) reported that at least half of the unemployed were at risk of poverty in 2018 [2].
The Waikato Environment for Knowledge Analysis (WEKA) allow easy access to state-of the-art techniques in machine learning for researchers [9]. This software for analyzing data contains many machine leaning algorithms [8]. It is providing large number of different classifiers that are used in data mining task and analyze the output produced by these classifiers [6]. In this article we focused on classification as one of the data mining technique appropriate to extract patterns from data.

MATERIAL AND METHODS
We obtained the data in a secondary way from the international database Eurostat. We compiled a dataset consisting of 28 EU countries: Austria, Austria, Bulgaria, Croatia, Cyprus, Czech Republic, Denmark, Estonia, Estonia, Finland, France, Germany, Greece, Hungary, Ireland, Italy, Latvia, Lithuania, Luxembourg, Malta, the Netherlands, the Republic of Poland, Portugal, Romania, the Slovak Republic, Slovenia, Spain, Sweden and United Kingdom (Table 1). Table 1 Values of indicators forming the research dataset before its modification Source: data Eurostat, author processing, output from WEKA We used 11 numeric attributes for the analyzes (Table 1) There was no missing data in the dataset, it was adjusted by discretizing the variables needed for classification methods ( Table 2). As a classification attribute, we set the indicator -At the risk of poverty rate (RiskRP). We discretized the classification attribute to three nominal categories ( Table 2).
We selected the relevant data on the basis of selection using the values of correlation coefficients expressing the relationship between individual attributes and the classification attribute. Based on the obtained values of correlation coefficients , we can conclude that there is a relationship between the at-risk-of-poverty rate and 6 attributes: Income quintile share ratio S80/S20 , Gini coefficient of equivalised disposable income , Gross domestic product per capita , Population by educational attainment level -upper secondary, post-secondary non-tertiary, levels 3-4 , Population by educational attainment level -less than primary, primary and lower secondary education, levels 0-2 , Total general government expenditure . In the following analyzes, we will use only 6 of the listed attributes. We used data mining methods to extract the models describing the investigated data. We used and tested several methods of the classification: methods using information theory (algorithm J48), based on decision trees (Random Forest, Random Tree), methods based on conditional probability (Bayes Net, Naive Bayes), rules PART, classifiers on the principle of k-nearest neighbors (classifier lazy IBK, Instance Bases Learning with parameter K), meta-algorithm Bagging. We used Weka toolkit to analyze the performance of the classifiers. We used several sampling methods to test and build the model (Evaluation of Test Set): Cross Validation Fold, Use Training Set, 66% Percentage Split. We chose the best model based on the values of the following indicators: correctly classified instances, incorrectly classified instances, kappa statistics, area under receiver operating characteristic curve (ROC), area under precisionrecall curve (PRC).

RESULTS AND DISCUSSION
Given the values of the monitored indicators, we chose the model created by the J48 algorithm as the best model ( Figure 1). Although this model achieved the values of correctly and incorrectly classified instances the same as the model created on the basis of the rules PART classification and also the Bagging meta and the Bayes Net or lazy IBK classifier, J48 achieved a lower error rate and slightly higher area values under the ROC curve. The correctly classified instances were 89.29% and the incorrectly classified instances were 10.71%. Reached value of Kappa statics (0.84) is considered as very good. It is outstanding degree of agreement between two sets of categorized data, observed and predicted values. Mean absolute error is measure set of predicted value to actual value i.e. how close a predicted model to actual model [6]. The mean absolute difference between the predicted and observed values reached the value of 0.12. Root mean square error (RMSE) is measuring the differences between values predicted by a model and the values actually observed, so small value of RMSE means better accuracy of model [6]. Root mean square error reached the value of 0.25. The relative absolute difference between the predicted and actual values was 27%. The ratio of the number of observations predicted as a low at-risk-of-poverty rate to the total number of observations representing a given low-at-risk-of-poverty category was . The ratio of the number of observations predicted as the average at-risk-of-poverty rate to the total number of observations representing the given category of the average at-risk-of-poverty rate was . All of observations predicted as a high at-risk-of-poverty rate belonged to representing a given high-at-risk-of-poverty category . A false negative rate in the low at-risk-of-poverty category obtained value 0.11 and in the medium-on-poverty-weight category obtained value 0.59. Detailed accuracy (False positive rate, Precision, Recall, Matthews Correlation Coefficient, ROC Area, PRC Area) by class is shown in Figure 1. We have observed a high correlation between observed and predicted values (83%). The area under the ROC curve is graphically shown for two categories in Figure 2 and Figure 3. The area under the feedback and accuracy curve took on values of 79%, 85%, 100%, which means high accuracy and feedback for all categories. Source: data Eurostat, author processing by WEKA According to confusion matrix ( Figure 1) we can say that one country had a low at-risk-ofpoverty rate but was predicted as a medium at-risk-of-poverty rate and in two cases, countries achieved a medium at-risk-of-poverty rate but have been predicted to have a low poverty rate.
The decision tree is shown in Figure 4. The J48 algorithm decided that the root decision node would be the variable Income quintile share ratio (S80/S20). It builds the decision tree from labeled training data set using information gain and to make the decision the attribute with highest normalized information gain is used. The splitting procedure stops if all instances in a subset belong to the same class [9]. The tree contains two intermediate nodes (branches) formed by the variables Gini coefficient of equivalised disposable income (Gini) and the population with secondary education (PopSecEdu). The tree is terminated by 5 leaf nodes (leaves), which also contain the numbers of correctly and incorrectly classified variables.

CONCLUSIONS
In this article, we dealt with one technique of Data Mining. We applied the classification methods to the dataset of data obtained from the international Eurostat database. As a classification attribute, we determined the at-risk-of-poverty rate in the population. We focused on EU28 countries and 11 attributes. We found that the classification attribute was significantly positively affected by Total general government expenditure, GDP per capita, Income quintile share ratio (S80/S20), Gini coefficient of equivalised disposable income and also Proportion of population according to the level of primary and secondary education. We used Weka tools to search the best classification algorithm. The models created by classification techniques were building based on training data. To evaluate the performance of classifiers Weka data mining tool was used and the accuracy measures like Kappa statistic, TP rate, FP rate, Precision, Recall, F-measure, ROC Area and PRC Area. Overall observation was that the best algorithm suitable for predicting the at-risk-of-poverty rate in the monitored countries is J48.