Machine Learning and Data Analysis - Week3 | Lasso Regression

This is the third assignment of the third course (of five) Data Analysis and Interpretation Specialization detailed information about it can be seeing here.

See the entire code and output for this week, here.

This assignment deals with Lasso Regression Analysis that provides information for to obtain a subset of predictors that minimizes prediction error for a quantitative response variable.

Index

Variables

Details of my project can be seeing here, For this assignment, I add 10 variables, to get easier, I made a summary bellow:

Variable Name Description Type Kind
Income GPD per capita C E
Alcohol Alcohol Consumption C E
Army Armed force rate Q E
bCancer Breast Cancer rate Q E
CO2 Co2 emissions rate Q E
Female-employ Female employ rate Q E
Net-rate Internet users rate Q E
polity Democracy indicator score Q E
relectricperperson relectric per person Q E
suicide Suicide per 100th Q E
Employ Employment rate Q E
Urban Urbanization rate Q E
Life Life Expectancy Q R

Type: C = Categorical, Q = Quantitative
Kind: E = Explanatory, R = Response

Coeficients

Data were randomly split into a training set that included 70% of the observations and a test set that included 30% of the observations. The least angle regression algorithm with k=10 fold cross validation was used to estimate the lasso regression model in the training set, and the model was validated using the test set. The change in the cross validation average (mean) squared error at each step was used to identify the best subset of predictor variables.

Bellow I shows de code an te table with coefficients generated


#select predictor variables and target variable as separate data sets  
pred_cols = features
predvar = data1[pred_cols]

# standardize predictors to have mean=0 and sd=1
predictors=predvar.copy()
predictors = to_num(pred_cols, predictors)

target = data1.life

from sklearn import preprocessing

for p in [pred_cols]:
    predictors[p]=preprocessing.scale(predictors[p].astype('float64'))

# split data into train and test sets
pred_train, pred_test, tar_train, tar_test = train_test_split(predictors, target,
                                                              test_size=.3, random_state=123)

# specify the lasso regression model
model=LassoLarsCV(cv=10, precompute=False).fit(pred_train,tar_train)

# print variable names and regression coefficients
coefs = dict(zip(predictors.columns, model.coef_))

print(tabulate(pd.DataFrame(list(coefs.items())), tablefmt="grid", headers=['Variable', 'Coef']))

Variable Coef
female-employ -1.31691
employ 0
army 0.815667
suicideper100th -0.71429
relectricperperson 0
polity 0.726064
net-rate 5.43781
co2 1.1599
urban 0.821296
alcohol -0.0884026
income 0
bcancer 0.393847

As we can see from 12 explanatory variables, only six have coefficient greater than zero.

Conclusion

plot

Of the 12 predictor variables, 6 were retained in the selected model. During the estimation process, the internet rate and co2 emissions were most strongly associated with life expectancy.

The explanatory variables were chosen randomly whose goal was only for this assignment. maybe, for this, probably there are details (like a lurking variable, for example.) that necessities a more deep study and analysis.

See the entire code and output for this week, here.

Written on November 3, 2016