Machine Learning and Data Analysis - Week2 | Random Forest

This is the second assignment of the fourth course (of five) Data Analysis and Interpretation Specialization detailed information about it can be seeing here.

See the entire code and output for this week, here.

This assignment deals with Lasso Regression Analysis that provides information for to obtain a subset of predictors that minimizes prediction error for a quantitative response variable.

Variables

Details of my project can be seeing here, For this assignment, I add 10 variables, to get easier, I made a summary bellow:

|Variable Name|Description|Type|Kind| |————-|———–|:———:|:—–:|:–:| |Income |GPD per capita |C|E| |Alcohol |Alcohol Consumption |C|E| |Army |Armed force rate|Q|E| |bCancer |Breast Cancer rate|Q|E| |CO2 |Co2 emissions rate|Q|E| |Female-employ|Female employ rate|Q|E| |Net-rate |Internet users rate|Q|E| |polity |Democracy indicator score|Q|E|
|relectricperperson| relectric per person|Q|E| |suicide|Suicide per 100th|Q|E| |Employ|Employment rate|Q|E| |Urban|Urbanization rate|Q|E|
|Life |Life Expectancy|Q|R| Type: C = Categorical, Q = Quantitative
Kind: E = Explanatory, R = Response

Coeficients

Data were randomly split into a training set that included 70% of the observations and a test set that included 30% of the observations. The least angle regression algorithm with k=10 fold cross validation was used to estimate the lasso regression model in the training set, and the model was validated using the test set. The change in the cross validation average (mean) squared error at each step was used to identify the best subset of predictor variables.

Bellow I shows de code an te table with coefficients generated


#select predictor variables and target variable as separate data sets  
pred_cols = features
predvar = data1[pred_cols]

# standardize predictors to have mean=0 and sd=1
predictors=predvar.copy()
predictors = to_num(pred_cols, predictors)

target = data1.life

from sklearn import preprocessing

for p in [pred_cols]:
    predictors[p]=preprocessing.scale(predictors[p].astype('float64'))

# split data into train and test sets
pred_train, pred_test, tar_train, tar_test = train_test_split(predictors, target,
                                                              test_size=.3, random_state=123)

# specify the lasso regression model
model=LassoLarsCV(cv=10, precompute=False).fit(pred_train,tar_train)

# print variable names and regression coefficients
coefs = dict(zip(predictors.columns, model.coef_))

print(tabulate(pd.DataFrame(list(coefs.items())), tablefmt="grid", headers=['Variable', 'Coef']))

Variable	Coef
female-employ	-1.31691
employ	0
army	0.815667
suicideper100th	-0.71429
relectricperperson	0
polity	0.726064
net-rate	5.43781
co2	1.1599
urban	0.821296
alcohol	-0.0884026
income	0
bcancer	0.393847

As we can see from 12 explanatory variables, only six have coefficient greater than zero.

Conclusion

plot

Of the 12 predictor variables, 6 were retained in the selected model. During the estimation process, the internet rate and co2 emissions were most strongly associated with life expectancy.

The explanatory variables were chosen randomly whose goal was only for this assignment. maybe, for this, probably there are details (like a lurking variable, for example.) that necessities a more deep study and analysis.

See the entire code and output for this week, here.

Written on October 12, 2016