Machine Learning and Data Analysis - Week4 | K-Means

This is the fourth assignment of the fourth course (of five) Data Analysis and Interpretation Specialization detailed information about it can be seeing here.

See the entire code and output for this week, here.

This assignment deals with k-means cluster “Cluster analysis is an unsupervised machine learning method that partitions the observations in a data set into a smaller set of clusters where each observation belongs to only one cluster. The goal of cluster analysis is to group, or cluster, observations into subsets based on their similarity of responses on multiple variables.”

Index

Variables

Details of my project can be seeing here, For this assignment, I add 10 variables, to get easier, I made a summary bellow:

Variable Name Description Type Kind
Income GPD per capita C E
Alcohol Alcohol Consumption C E
Army Armed force rate Q E
bCancer Breast Cancer rate Q E
CO2 Co2 emissions rate Q E
Female-employ Female employ rate Q E
Net-rate Internet users rate Q E
polity Democracy indicator score Q E
relectricperperson relectric per person Q E
suicide Suicide per 100th Q E
Employ Employment rate Q E
Urban Urbanization rate Q E
Life Life Expectancy Q R

Type: C = Categorical, Q = Quantitative
Kind: E = Explanatory, R = Response

Average Distance From Observations

Data were randomly split into a training set that included 70% of the observations (N=84) and a test set that included 30% of the observations (N=36). A series of k-means cluster analyses were conducted on the training data specifying k=1-9 clusters, using Euclidean distance. The variance in the clustering variables that was accounted for by the clusters (r-square) was plotted for each of the nine cluster solutions in an elbow curve to provide guidance for choosing the number of clusters to interpret.

# k-means cluster analysis for 1-9 clusters                                                           
from scipy.spatial.distance import cdist
range_clusters=range(1,9)
meandist=[]

for k in range_clusters:
    model=KMeans(n_clusters=k)
    model.fit(clus_train)
    clusassign=model.predict(clus_train)
    meandist.append(sum(np.min(cdist(clus_train, model.cluster_centers_, 'euclidean'), axis=1))
    / clus_train.shape[0])

    plt.subplots(figsize=(12, 8))
    plt.plot(range_clusters, meandist)
    plt.xlabel('Number of clusters')
    plt.ylabel('Average distance')
    plt.title('Selecting k with the Elbow Method')

    plt.show()

plot

The elbow curve was inconclusive, suggesting that the 2, 3 and 5-cluster solutions might be interpreted. The results below are for an interpretation of the 3-cluster solution.

First Two Canonica Variables

Canonical discriminant analyses was used to reduce the 12 clustering variable down a few variables that accounted for most of the variance in the clustering variables. A scatterplot of the first two canonical variables by cluster (Figure 2 shown below) indicated that the observations in clusters was generally distinct whith greater spread, suggesting higher within cluster variance.

Figure 2. Plot of the first two canonical variables for the clustering variables by cluster. plot

Cluster Differences In GPA Using ANOVA

In order to externally validate the clusters, an Analysis of Variance (ANOVA) was conducting to test for significant differences between the clusters on grade point average (GPA). A tukey test was used for post hoc comparisons between the clusters. Results indicated significant differences between the clusters on GPA. The tukey post hoc comparisons indicating null hypothesys rejection on comparisons between groups 0,1; 0,4; 1,2; 1,4; 2,4; 3,4 and 4,5.

gpa_data=data0['life']
# split GPA data into train and test sets
gpa_train, gpa_test = train_test_split(gpa_data, test_size=.3, random_state=123)
gpa_train1=pd.DataFrame(gpa_train)
gpa_train1.reset_index(level=0, inplace=True)
merged_train_all=pd.merge(gpa_train1, merged_train, on='index')
sub1 = merged_train_all[['life', 'cluster']].dropna()

import statsmodels.formula.api as smf
import statsmodels.stats.multicomp as multi

gpamod = smf.ols(formula='life ~ C(cluster)', data=sub1).fit()
print (gpamod.summary())

OLS Regression Results                            
==============================================================================
Dep. Variable:                   life   R-squared:                       0.529
Model:                            OLS   Adj. R-squared:                  0.505
Method:                 Least Squares   F-statistic:                     22.15
Date:                Sat, 12 Nov 2016   Prob (F-statistic):           2.73e-12
Time:                        14:57:59   Log-Likelihood:                -271.73
No. Observations:                  84   AIC:                             553.5
Df Residuals:                      79   BIC:                             565.6
Df Model:                           4                                         
Covariance Type:            nonrobust                                         
===================================================================================
coef    std err          t      P>|t|      [95.0% Conf. Int.]
-----------------------------------------------------------------------------------
Intercept          71.4555      1.758     40.646      0.000        67.956    74.955
C(cluster)[T.1]     0.5833      2.537      0.230      0.819        -4.467     5.634
C(cluster)[T.2]    -0.3662      2.183     -0.168      0.867        -4.711     3.979
C(cluster)[T.3]   -10.6704      2.258     -4.725      0.000       -15.165    -6.176
C(cluster)[T.4]     9.4513      2.402      3.935      0.000         4.670    14.232
==============================================================================
Omnibus:                       17.615   Durbin-Watson:                   2.144
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               22.602
Skew:                          -0.982   Prob(JB):                     1.24e-05
Kurtosis:                       4.613   Cond. No.                         6.62
==============================================================================

mc1 = multi.MultiComparison(sub1['life'], sub1['cluster'])
res1 = mc1.tukeyhsd()
print(res1.summary())

Multiple Comparison of Means - Tukey HSD,FWER=0.05
==============================================
group1 group2 meandiff  lower    upper  reject
----------------------------------------------
  0      1     0.5833  -6.5008   7.6674 False
  0      2    -0.3662  -6.4602   5.7277 False
  0      3    -10.6704 -16.9748  -4.366  True
  0      4     9.4513   2.7457  16.1569  True
  1      2    -0.9495   -7.206   5.3069 False
  1      3    -11.2537 -17.7154  -4.792  True
  1      4     8.868    2.0143  15.7216  True
  2      3    -10.3041 -15.6619 -4.9464  True
  2      4     9.8175   3.993    15.642  True
  3      4    20.1217  14.0773   26.166  True
----------------------------------------------

See the entire code and output for this week, here.

Written on November 9, 2016