*Francisca Dias*

Is there a relationship between any of those indicators and GDP per capita?

How strong is that relationship?

What is the effect of each indicator in GDP per capita?

Given any indicator, can GDP per capita be predicted?

I want to know the impact of each one of these 6 Worldwide Governance Indicators separately (univariate model), and later include all of them in the model (multivariate). The sample includes all countries, and the indicators are from the year 2016. The dataset was taken from the World Bank.

I am applying two different python libraries for estimating the relationship between the dependent value (Log GDP per
capita) and the independent variables (features) in the model: **stats model** and **scikit learn**. They will both
lead to the same results.

I will interpret the results for both Univariate and Multivariate model, introduce some concepts and problems that may
arise. I will approach **multicollinearity** and finally test/split the data so I can validate the model.

**Data Preprocessing**

- Clean the data: This includes renaming the columns, converting the values of all variables inot numeric, delete columns I will not need, map the name of features to acronymous, transpose the variables in order to have them as numpy arrays.

**Simple Linear Regression**

- Draw a heatmap, search for correlated variables, converting GDP per capita to log GDP per capita, plot variables, explain the logic behind linear regression, plot all countries in the map.

**Univariate Regression with Stats Model**

- Start with univariate model and deploy the statistics summary. Plot the OLS relationship between GE and GDP per capita.

**Linear Regression with Stats Model**

**Linear Regression with Scikit-Learn**

**Multivariate Regression with Stats Model**

- Now I include all variables in the model.

**Multivariate Regression with Scikit-Learn **

- Inlcude all variables within the model using sckikit-learn.

**Model Optimization**

Dataset was taken from the **World Bank Database**.

**Control of Corruption: Percentile Rank**
Control of Corruption captures perceptions of the extent to which public power is exercised for private gain, including both petty and grand forms of corruption, as well as "capture" of the state by elites and private interests. Percentile rank indicates the country's rank among all countries covered by the aggregate indicator, with 0 corresponding to lowest rank, and 100 to highest rank.

**Government Effectiveness: Percentile Rank**
Government Effectiveness captures perceptions of the quality of public services, the quality of the civil service and the degree of its independence from political pressures, the quality of policy formulation and implementation, and the credibility of the government's commitment to such policies. Percentile rank indicates the country's rank among all countries covered by the aggregate indicator, with 0 corresponding to lowest rank, and 100 to highest rank.

**Political Stability and Absence of Violence/Terrorism**
Political Stability and Absence of Violence/Terrorism measures perceptions of the likelihood of political instability and/or politically-motivated violence, including terrorism. Percentile rank indicates the country's rank among all countries covered by the aggregate indicator, with 0 corresponding to lowest rank, and 100 to highest rank.

**Regulatory Quality: Percentile Rank**
Regulatory Quality captures perceptions of the ability of the government to formulate and implement sound policies and regulations that permit and promote private sector development. Percentile rank indicates the country's rank among all countries covered by the aggregate indicator, with 0 corresponding to lowest rank, and 100 to highest rank.

**Rule of Law: Percentile Rank**
Rule of Law captures perceptions of the extent to which agents have confidence in and abide by the rules of society, and in particular the quality of contract enforcement, property rights, the police, and the courts, as well as the likelihood of crime and violence. Percentile rank indicates the country's rank among all countries covered by the aggregate indicator, with 0 corresponding to lowest rank, and 100 to highest rank.

**Voice and Accountability: Percentile Rank**
Voice and Accountability captures perceptions of the extent to which a country's citizens are able to participate in selecting their government, as well as freedom of expression, freedom of association, and a free media. Percentile rank indicates the country's rank among all countries covered by the aggregate indicator, with 0 corresponding to lowest rank, and 100 to highest rank.

**GDP per capita (constant 2010 US$):**
GDP per capita is gross domestic product divided by midyear population. GDP is the sum of gross value added by all resident producers in the economy plus any product taxes and minus any subsidies not included in the value of the products. It is calculated without making deductions for depreciation of fabricated assets or for depletion and degradation of natural resources. Data are in constant 2010 U.S. dollars.

This dataset can be found here.

Import the libraries.

In [1]:

```
import warnings
warnings.filterwarnings("ignore")
import pandas as pd
import numpy as np
```

Import the dataset.

In [2]:

```
gov_data = pd.read_csv('World_Bank_Governance_Data.csv')
```

Rename the columns.

In [3]:

```
gov_data.rename(columns={'Country Name': 'Name',
'Country Code': 'Code',
'Series Name': 'Series',
'Series Code': 'Series_Code',
'2016 [YR2016]': 'Values' }, inplace=True)
```

Convert the datatype from 'Object' to Numeric.

In [4]:

```
gov_data = gov_data.convert_objects(convert_numeric=True)
```

Delete unnecessary columns.

In [5]:

```
del gov_data['Name']
```

In [6]:

```
del gov_data['Series_Code']
```

Map the dataset to new features names.

In [7]:

```
gov_data['Series'] = gov_data['Series'].map({'Control of Corruption: Percentile Rank': 'CC',
'Voice and Accountability: Percentile Rank':'VA',
'Government Effectiveness: Percentile Rank':'GE',
'Regulatory Quality: Percentile Rank':'RQ',
'Rule of Law: Percentile Rank':'RL',
'Political Stability and Absence of Violence/Terrorism: Percentile Rank':'PS'})
```

What are the features?

- Control of Corruption: Percentile Rank
- Voice and Accountability: Percentile Rank:
- Government Effectiveness: Percentile Rank
- Regulatory Quality: Percentile Rank
- Rule of Law
- Political Stability and Absence of Violence/Terrorism

What is the response?

- gdp per capita:

Transform the dataset in order to have the features in columns so I can perform the analysis.

In [8]:

```
gov_data_2 = gov_data.set_index('Code')
```

In [9]:

```
gov_data_3 = pd.pivot_table(gov_data_2,index=["Code"],values=["Values"],columns=["Series"], aggfunc=[np.sum])
```

In [10]:

```
gov_data_4 = pd.DataFrame(gov_data_3.to_records())
```

In [11]:

```
gov_data_4.columns.values[1] = 'CC'
gov_data_4.columns.values[2] = 'GE'
gov_data_4.columns.values[3] = 'PS'
gov_data_4.columns.values[4] = 'RL'
gov_data_4.columns.values[5] = 'RQ'
gov_data_4.columns.values[6] = 'VA'
```

In [12]:

```
df1 = gov_data_4
```

Import the dataset related to GDP per capita.

In [13]:

```
gdp_data = pd.read_csv('/Users/FranciscaDias/Kaggle_Temporary/***Kaggle_Competions***/8.Data_Extract_From_Global_Economic_Prospects/World_Bank_Governance/World_GDP_constant.csv')
```

In [14]:

```
gdp_data.head()
```

Out[14]:

Rename the columns so I can merge with the first table.

In [15]:

```
gdp_data.rename(columns={'Country Name': 'Name',
'Country Code': 'Code',
'Series Name': 'Series',
'Series Code': 'Series_Code',
'2016 [YR2016]': 'GDP_Values' }, inplace=True)
```

In [16]:

```
del gdp_data['Name']
```

In [17]:

```
del gdp_data['Series']
```

In [18]:

```
del gdp_data['Series_Code']
```

In [19]:

```
gdp_data = gdp_data.convert_objects(convert_numeric=True)
```

In [20]:

```
gdp_data["log_gdp"] = np.log(gdp_data['GDP_Values'])
```

In [21]:

```
complete = pd.merge(df1, gdp_data, on='Code', how='outer')
complete.head(3)
```

Out[21]:

How can we measure the impact of governance indicators in GDP per Capita?

I will start by plotting a heatmap where I can see the correlation coefficient between variables.

Next I will show the difference between GDP per capita and the log of gdp per capita.

I will plot the linear relatinoship between Government Effectiveness (GE) and log gdp, first aggregate, and then separate so we can see the countries.

I will measure the relationship between GE and log gdp throught OLS and intepret the results.

In [22]:

```
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
sns.set(style="white")
# Compute the correlation matrix
corr = complete.corr()
f, ax = plt.subplots(figsize=(10, 7))
cmap = sns.diverging_palette(220, 10, as_cmap=True)
sns.heatmap(corr, annot=True, cmap=cmap, center=0,
square=True, linewidths=.5, cbar_kws={"shrink": .5});
```

**multicollinearity** really
occurs.

In [23]:

```
total = complete.isnull().sum().sort_values(ascending=False)
percent = (complete.isnull().sum()/complete.isnull().count()).sort_values(ascending=False)
missing_data = pd.concat([total, percent], axis=1, keys=['Total', 'Percent'])
missing_data.head()
```

Out[23]:

We have to make sure there is no missing data.

In [24]:

```
complete=complete.dropna(axis=0)
```

Below I am calculating correlation using two libraries: scipy and numpy

In [25]:

```
from scipy.stats.stats import pearsonr
from numpy import corrcoef
a = complete['GE']
b = complete['log_gdp']
print(pearsonr(a,b))
print(np.corrcoef(a,b))
```

Our correlation estimation between target (log_gdp) and the GE indicator is 0.86 which is positive and strong.

This is also called **partial correlation** since we are trying to model the response (target) using just one predictor, in this case, GE.

In [26]:

```
sns.distplot(complete['GDP_Values']);
```

mean larger than the median.

In [27]:

```
sns.distplot(complete['log_gdp']);
```

In [28]:

```
complete.GE.describe()
```

Out[28]:

In [29]:

```
complete.plot(x='GE', y='log_gdp', kind='scatter')
plt.show()
```

This plot shows a positive relationship between GE and log GDP per capita.

Higher Government Effectiveness appears to be positively correlated with wealthier economies.

In order to describe this relationship I chooose the linear model, that can be translated to:

```
log gdp = Î²0 + Î²1GE + ui
```

Î²0 is the intercept on the y axis

Î²1 is the lineÂ´s slope

ui, random error, are the deviations of observations from the line due to factors that were not considered in the model

Simple linear regression is an approach for predicting a quantitative response using a single feature (or "predictor" or "input variable"). It takes the following form:

y=Î²0+Î²1x

What does each term represent?

y is the response

x is the feature

Î²0 is the intercept

Î²1 is the coefficient for x

Together, Î²0 and Î²1 are called the model coefficients. To create your model, you must "learn" the values of these coefficients. And once we've learned these coefficients, we can use the model to predict log GDP!

In [30]:

```
X = complete['GE']
y = complete['log_gdp']
f, ax = plt.subplots(figsize=(13, 9))
labels = complete['Code']
plt.scatter(X, y, marker='')
for i, label in enumerate(labels):
plt.annotate(label, (X.iloc[i], y.iloc[i]))
plt.xlabel('Government Effectiveness: Percentile Rank in 2016')
plt.ylabel('Log GDP (constant 2010 US$)')
plt.title('OLS relationship between Government Effectiveness and Income')
sns.regplot(x="GE", y="log_gdp", data=complete);
```

By sorting the dataframe in descending order by log gdp values, if we call the head and tail methods we can see what are the countries with highest an lowest GPD, respectively.

Countries with highest gdp per capita:

Luxembourg

Norway

Switzerland

Ireland

Qatar

Countries with lowest gdp per capita:

Niger

Congo, Dem. Rep.

Liberia

Central African Republic

Burundi

In [31]:

```
highest = complete.sort_values(['log_gdp'], ascending=[False])
highest.head(5)
```

Out[31]:

In [32]:

```
highest.tail(5)
```

Out[32]:

In [33]:

```
complete.columns
```

Out[33]:

The steps to building and using a model are:

Define: This is where we choose the model.

Fit : Capture patterns from provided data.

Predict:

Evaluate: here we evauate modelÂ´s accuracy

In [34]:

```
import statsmodels.api as sm
```

In [35]:

```
complete['const'] = 1
reg1 = sm.OLS(complete['log_gdp'],complete[['const', 'GE']])
results = reg1.fit()
results.summary()
```

Out[35]:

In [36]:

```
import statsmodels.formula.api as smf
```

In [37]:

```
reg2 = smf.ols(formula = 'log_gdp ~ GE', data = complete)
results = reg2.fit()
results.summary()
```

Out[37]:

We can now write our estimated relationship as

log gdp = 0.04 GE + 6.357

This equation describes the line that best fits our data.

LetÂ´s calculate the average GE in our dataset:

In [38]:

```
complete.GE.mean()
```

Out[38]:

Let us estimated the expected log of gdp per capita with an average GE of 49:

In [39]:

```
y = 0.04 * 49 + 6.357
y
```

Out[39]:

Just a reminder that the beta of each feature becomes its unit change measure, which corresponds to the change the outcome will have if the feature increases one unit.

For instance, let us see what happens to log_gdp if we increase one point on GE, from 49 to 50:

In [40]:

```
y = 0.04 * 50 + 6.357
y
```

Out[40]:

In [41]:

```
y0=8.317
y1=8.357
(y1-y0)*100
```

Out[41]:

This result can be interpreted as the following:

An increase of unit change in GE, leads to an increase in log GDP by 4%.

This is the same as calling the predict() method:

We can compare the observed and predicted values of log GDP by plotting them on the graph below.

In [42]:

```
f, ax = plt.subplots(figsize=(12, 9))
sns.regplot(x="GE", y = results.predict(), data = complete, label='predicted')
sns.regplot(x="GE", y = complete['log_gdp'], data = complete,label='observed')
plt.xlabel('Government Effectiveness: Percentile Rank in 2016')
plt.ylabel('Log GDP (constant 2010 US$)')
plt.title('OLS relationship between Government Effectiveness and Income')
plt.legend();
```

In [43]:

```
from sklearn import linear_model
```

In [44]:

```
linear_regression = linear_model.LinearRegression()
linear_regression
```

Out[44]:

In [45]:

```
complete.head()
```

Out[45]:

In [46]:

```
feature_cols = ['GE', 'const']
X = complete[feature_cols]
y = complete.log_gdp
```

In [47]:

```
linear_regression.fit(X, y)
```

Out[47]:

In [48]:

```
# print intercept and coefficients
print(linear_regression.intercept_)
print(linear_regression.coef_)
```

In [49]:

```
X1 = ['const', 'GE', 'CC', 'PS', 'RL', 'RQ', 'VA']
```

In [50]:

```
# Estimate an OLS regression for all set of variables
reg3 = sm.OLS(complete['log_gdp'], complete[X1])
results = reg3.fit()
results.summary()
```

Out[50]:

In [51]:

```
reg4 = smf.ols(formula = 'log_gdp ~ GE + CC + PS + RL + RQ + VA', data = complete)
results = reg4.fit()
results.summary()
```

Out[51]:

**Notes to the Results:**

- The R squared increased when we add more variables to our model, as we would expect.

- For this reason we should look at the adjusted R squared since this considers the complexity of the model and give us a more realistic measure.

- One good tip when considering using adjusted is to make the ratio between R squared and adjusted R squared; If exceeds 20%, it means that we probably add variables to our model that are redundant.

- In our case this ratio is 1% over the R square.

- We should also be cautious with p-values. Low p-values (using p < 0.05 as a rejection rule) implies that the effect of these features on log gdp is statistically significant. Therefore the use of CC, RL, RQ and VA can challenge our model.

- When it comes to Cond. No, when the score is greather than 30, such is our case, it signals unstable numerical results. This unstabiity is due to multicollinearity.

Remember the correlation matrix from the beginning?

In [52]:

```
correlation_matrix = complete.iloc[:, 1:7]
corr = correlation_matrix.corr()
corr
```

Out[52]:

We can see that there's strong correlation between variables, they are all above 0.5.

Another way to see associations among variables is to use the **eigenvectors**. They recombine the variance among the variables, therefore it is easier to spot the multicollinearity.

In [53]:

```
# Consider the all columns except code, Value GDP and log gdp
variables = complete.columns[1:-3]
variables
```

Out[53]:

In [54]:

```
eigenvalues, eigenvectors = np.linalg.eig(corr)
eigenvalues
```

Out[54]:

In [55]:

```
eigenvectors
```

Out[55]:

Let's investigate the eigenvector on last column, index 5:

In [56]:

```
eigenvectors[:, 5]
```

Out[56]:

In [57]:

```
print(variables[1], variables[3])
```

Removing these two columns would be the best solution.

In [58]:

```
feature_cols = ['CC', 'PS','RQ', 'VA']
X = complete[feature_cols]
y = complete.log_gdp
```

**validation data**.

In [59]:

```
from sklearn.model_selection import train_test_split
```

In [60]:

```
train_X, val_X, train_y, val_y = train_test_split(X, y,random_state = 0)
```

In [61]:

```
from sklearn import linear_model
linear_regression = linear_model.LinearRegression()
linear_regression.fit(train_X, train_y)
```

Out[61]:

In [62]:

```
# get predicted prices on validation data
val_predictions = linear_regression.predict(val_X)
```

In [65]:

```
from sklearn.metrics import mean_absolute_error
print(mean_absolute_error(val_y, val_predictions))
```

In [ ]:

```
```