PYTHON (JUPYTER NOTEBOOK) PROJECT - GENERAL DESCRIPTION· This assignment needs to be delivered in a Jupyter Python Notebook (.ipynb)· Deep understanding and knowledge of Clustering Algorithms is required· Delivery Date: end of Sunday 2nd June (CET time)Aim is to cluster 51 objects (cases) according to a set of “clustering variables” through the implementation of various unsupervised clustering algorithms. Being an unsupervised learning problem, number of clusters in the data (if any) is unknown.For the found ´clusters´ by the various clustering algorithms, it´ll be of interest to compare them according to · the clustering variables and · to an additional set of variables not used in the clustering (the “profiling variables”).Some basic references are included at the end of this document to facilitate problem understanding and what is required to deliver. These are denoted as [xxx] in the following. No need to read in full such references. Usually by scrolling through relevant pages, formulas, images is enough to understand what is required.I highlight in this format, those observations that require great attention to ensure that this Python project delivers what it is required-asked.For any doubts, please contact me by email at XXXXXXXXXX. It´s expect that the coder in charge of this assignment may have questions for the various tasks. Hence, I´d expect to be contacted as the project advances.DATASETIn Excel File “Clustering Dataset clean”. · Number of cases (objects) = 51 (rows 3:53, 51 objects or cases in total)· Clustering Variables Set 1 (“CVS”) (cols B:U, 20 variables or features in total)· These are the used to cluster the data· Profiling Variables Set (“PVS”) (cols V:CG, 64 variables or features in total)· these variables are NOT used by the clustering algorithms. Instead, they´re used to further characterize the clusters.PART IINITIAL DATA EXPLORATION AND DATA-PREPROCESSING· Obtain basic statistical information for ALL the variables in CVS and PVS: mean, max, min, median, std dev, skew, kurtosis, Kolmogorov-Smirnov test for normality …· For CVS, PVS and CVS+PVS obtain Pearson coelation and the Spearman coelation matrices.· Normalize variables according to these two Normalization Methods (“NM”)· NM1 – Z-score: subtract mean and divide by standard deviation· NM2 – Min-Max Method: subtract minimum and divide by absolute difference between minimum and maximumIn what follows, all calculations related to clustering will be done using BOTH normalization methods.PRINCIPAL COMPONENT ANALYSIS (PCA)Using both normalization methods:· Perform standard PCA on CVS· Show factor loadings, order factors by % variance explained and show variance explained (by each factor and cumulative)· Show plots of CVS data on the two most important factorsPART IIUsing both normalization methods, implement for Dataset CVS the following 8 Clustering Methods (“CM”).Before specifying such clustering algorithms, a common definition of standard performance and valuation metrics. Note that specific nuances should be considered depending on the type of clustering algorithm (partitional, hierarchical, density-based). For example, in Partitional-type algos, clustering structure well depend on the chosen number of clusters (k), whereas in Hierarchical algos, the clustering structure will be determined by the ´height´ at which the coesponding dendrogram is cut.ERROR MEASURES· For the overall clustering structure (see slides XXXXXXXXXXin [TAN04] or slide #8 & slide #20 in [RICCO])· TSS (SSE) Total Sum of Squares· WSS Within Clusters Sum of squares · BSS Between Clusters Sum of squaresTSS (SSE) = WSS + BSSSILHOUETTE WIDTH (COEFFICIENT) AND PLOT· See slide #10 in [RICCO2] or slides #19/#46 in [UMASS1]. · Preferably, I´d like Silhouette plots as that one in [DATANOVIA]DENDROGRAM PLOTS· Preferably, I´d like Silhouette plots as that one in [DATANOVIA]DISTANCE MEASURES· Although main distance metric that will be used is the Euclidean distance, set up the Python project by also accepting the Manhattan and Mahalanobis distance metrics. See pages 1-2 in [NELSON12]PARTITIONAL CLUSTERING ALGORITHMS· CM1 - Standard K-means· CM2 - K-medoids as in [MAIONE18] – also known as “Partitioning around medoids (PAM)”. See reference [TAN04]· CM3 - Bisecting K-means algorithm. See reference [TAN04]As these 3 algorithms are susceptible to initialization issues (i.e. the chosen initial clusters´ centers), run n=200 iterations for different initializations of random cluster centers.For each run, consider k = 1, 2, 3, 4, 5 number of clusters. Averaging across the n = 200 runs, · Compute TSS, WSS and BSS for k = 1, 2, 3, 4, 5· Compute Overall Average Shilouette Width for k = 1, 2, 3, 4, 5· Compute Shilouette Width for each cluster for k = 1, 2, 3, 4, 5 (e.g. for k=4, 4 Silhouette coefficients)· Plot Histogram distribution of TSS, WSS and BSS for k = 1, 2, 3, 4, 5 · Plot Histogram distribution of Overall Average Shilouette for k = 1, 2, 3, 4, 5· On same plot, taking k (number of clusters) as x-axis, show in y-axis both · Average (across number of clusters) of within-cluster dissimilarities WSS (as in slide #19 in [UMASS], or slide #94 in [TAN04]), and · Average Silhouette Width (as in slide #19 in [UMASS] or slide #10 in [RICCO2])For following sections, performance or otherwise metrics from these algorithms should be taken with respect their average (across the n=200 runs) valuesHierarchical Clustering Algorithms· 4 clustering algorithms: all 4 are of the Agglomerative type· Use the following approaches to measure the distance between clusters. See reference [TAN04]· CM4 - Single Link· CM5 - Complete Link· CM6 - Average Link· CM7 - Ward´s MethodFor each of these ´hierarchical´ clustering methods, obtain SSE (WSE), BSE, and display coesponding DendrogramsDENSITY-BASED CLUSTERING ALGORITHMS· CM8 - DBSCAN· Use various combinations of the two parameters in this method (See reference [TAN04])· Eps · MinPoints PART IIIPOST-PROCESSINGFor all the clustering structures obtained by the 8 clustering algorithms, it´ll be calculated (see slides #99-100 in [TAN04])· For each cluster in each clustering structure, the Within-Cluster Sum of Squares (WSS) as a measure of Cluster Cohesion· For the overall clustering structure (see slide #99 in [TAN], or slides #8/#20 in [RICCO])· TSS Total Sum of Squares· WSS Within Clusters Sum of Squares · BSS Between Clusters Sum of squares For all of them, it´ll be shown the (2-dimensional) plot of the 51 objects with respect to the two most important principal components obtained in previous section, with different colourings/markers representing different clusters. See slides #22-23 in [RICCO].For all of them, it´ll calculated the Silhouette coefficient for each cluster and the average Silhouette coefficient for the overall clustering structure (see slide #102 in [TAN04]) and display typical graph as in slide #22 in [UMASS]). Also show similar graph with respect to number of clusters as in slide #10 in [RICCO2].In addition, following [RICCO], perform the following and show similar tables and graphs:UNIVARIATE CHARACTERIZATION· Characterizing the partition (see slides #8-11 in [RICCO])· Characterizing the clusters / Quantitative variables V-test (see slides #12-14 in [RICCO])· Characterizing the clusters / One group vs. the others – Effect size (see slides #15-17 in [RICCO])· Characterizing the clusters / Categorical variables V-test (see slides #18 in [RICCO])MULTIVARIATE CHARATERIZATION· Characterizing the partition / Percentage of variance explained (see slide #20 in [RICCO]). Already calculated (TSS, WSS, BSS)· Characterizing the partition / Evaluating the proximity between the clusters (see slide #21 in [RICCO]).· Characterizing the clusters / In combination with factor analysis (see slides #22-23 in [RICCO])· Characterizing the clusters / Using supervised approach – E.g. Discriminant Analysis (see slide #24 in [RICCO])PROFILING VARIABLESLast, compute basic statistics for the clusters obtained in each clustering structure with respect the “Profiling Variables”.PART IIICLUSTERING EVALUATIONAs there´s no external information to validate the goodness of the various clustering structures, following [TAN04], calculate and display as applicable (some have already been calculated):· TSS, BSS, WSS· coelation between the “Proximity/Similarity” and “Incidence” matrices (see slide #87 in [TAN04])· Similarity matrix as in slide #89 in [TAN04]· Cophenetic coelation (see slides #49-51 in [UMASS]) as in slide #51 in [UMASS]· Silhouette plot as in slide #22 in [UMASS1] (with Silhouette coefficients for each cluster and the average for the whole clustering structure)Also, as per slide #97 in [TAN04], generate n=500 sets of random data, spanning the same ranges as the features of DataSet CVS, and · Obtain average SSE and display same histogram as in slide #97 of [TAN04]. · Do the same as per slide #98 of [TAN04] but for Coelation between incidence and proximity matrices· Do the same but with average Silhouette Coefficient · Do the same but with average Cophenetic CoelationGiven the (total) SSE, (average) Coelation between incidence and proximity matrices, (average) Cophenetic Coelation, and (average) Silhouette Coefficient obtained by each clustering method, obtain the likelihood of such values (obtained by each clustering algorithm) given these random runs (some sort of p-value). REFERENCES[DATANOVIA] Cluster Validation Statistics. Available at: https:www.datanovia.com/en/lessons/cluster-validation-statistics-must-know-methods/ [NELSON12] Nelson, J.D XXXXXXXXXXON K-MEANS CLUSTERING USING MAHALANOBIS DISTANCE. Available at: https:liary.ndsu.edu/iitstream/handle/10365/26766/On%20K-Means%20Clustering%20Using%20Mahalanobis%20Distance.pdf?sequence=1 [MAIONE18] Maione, C. Nelson, D.R., and Melgaço Baosa, R XXXXXXXXXXResearch on social data by means of cluster analysis. Applied Computing and Informatics. https:doi.org/10.1016/j.aci XXXXXXXXXX. [RICCO] Rakotomalala, R. Interpreting Cluster Analysis Results. Available at: http:eric.univ-lyon2.f~ricco/cours/slides/en/classif_interpretation.pdf [RICCO2] Rakotomalala, R. Cluster analysis with Python - HAC and K-Means. Available at: https:eric.univ-lyon2.f~ricco/cours/didacticiels/Python/en/cah_kmeans_avec_python.pdf [RICCO3] Rakotomalala, R. K-Means Clustering. Available at: https:eric.univ-lyon2.f~ricco/cours/didacticiels/Python/en/cah_kmeans_avec_python.pdf [TAN04] Tan, Steinbach and Kumar XXXXXXXXXXData Mining Cluster Analysis: Basic Concepts and Algorithms. Available at https:www-users.cs.umn.edu/~kumar001/dmbook/dmslides/chap8_basic_cluster_analysis.pdf[UMASS] UMASS Landscape Ecology Lab. Finding groups -- cluster analysis. Part 1. Available at https:www.umass.edu/landeco/teaching/multivariate/schedule/cluster1.pdf clustering-calculation-clean-lt2kn23o.docx dataset-clustering-clean-x0ybwvkd.xlsx