K-means Clustering - The Process and Results

Method

K-means Clustering Method

In this phase of the project, based on the top 6 predictors identified using LASSO, I performed K-means using the 4 out of these 6 predictors: the number of rooms, year built, distance to Sydney Central Business District, and building area.

Then I used the Elbow Method, i.e. computed the total sum of squares within each cluster by the number of clusters, in order to identify the optimal number of clusters for the data. There are elbows in the curve at 3 and 5. In the visualization, as 3- and 5-cluster methods break the data in a similar manner and the clusters in the 5-cluster method are more fragmented and do not have as much information to interpret as the clusters in the 3-cluster method, I used the 3-cluster method to move forward with the analysis.

Results

Cluster visualization using the 3-cluster method

After visualizing the data on different sets of predictors, I have characterized these 3 clusters as follows:

  • Group 1: Small-size Houses that are of various distances to the Sydney Central Business District

  • Group 2: Medium-size Houses that tend to be closer to the Sydney Central Business District compared to the first group

  • Group 3: Large-size Houses that are closest to the Sydney Central Business District

One interesting finding I discovered in this analysis is that the cluster characteristics are independent of the type of houses there are.

At first, we hypothesized that different clusters of houses in our dataset would align with the different types of houses in the dataset. However, as plotted out here, it is very interesting to see that different types of houses don’t really have distinctive characteristics due to this mix of clusters in each house type, and hence the different clusters of houses do not match with the different types of houses there are.

Last updated

Was this helpful?