Summary

The purpose of this project is to walk through how to use two different types of unsupervised machine learning algorithms (Hierarchical Clustering and K-Means Clustering) in R using a fun Pokemon dataset. The data was obtained from The Complete Pokemon Dataset on Kaggle.

Preperations

This first chunk loads the necessary R packages.

library(tidyverse)
library(ape)
library(gt)
library(cluster)

Modeling

Data

The first step is to read in and observe the data. For this example, we’re only using Pokemon from the first generation of the games and looking at their battle stats. Both of the unsupervised machine learning algorithms require a matrix of numbers to run.

# Read in data
pokemon <- read.csv("pokemon.csv")
# Observe the structure of the data
str(pokemon)
## 'data.frame':    801 obs. of  41 variables:
##  $ abilities        : chr  "['Overgrow', 'Chlorophyll']" "['Overgrow', 'Chlorophyll']" "['Overgrow', 'Chlorophyll']" "['Blaze', 'Solar Power']" ...
##  $ against_bug      : num  1 1 1 0.5 0.5 0.25 1 1 1 1 ...
##  $ against_dark     : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ against_dragon   : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ against_electric : num  0.5 0.5 0.5 1 1 2 2 2 2 1 ...
##  $ against_fairy    : num  0.5 0.5 0.5 0.5 0.5 0.5 1 1 1 1 ...
##  $ against_fight    : num  0.5 0.5 0.5 1 1 0.5 1 1 1 0.5 ...
##  $ against_fire     : num  2 2 2 0.5 0.5 0.5 0.5 0.5 0.5 2 ...
##  $ against_flying   : num  2 2 2 1 1 1 1 1 1 2 ...
##  $ against_ghost    : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ against_grass    : num  0.25 0.25 0.25 0.5 0.5 0.25 2 2 2 0.5 ...
##  $ against_ground   : num  1 1 1 2 2 0 1 1 1 0.5 ...
##  $ against_ice      : num  2 2 2 0.5 0.5 1 0.5 0.5 0.5 1 ...
##  $ against_normal   : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ against_poison   : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ against_psychic  : num  2 2 2 1 1 1 1 1 1 1 ...
##  $ against_rock     : num  1 1 1 2 2 4 1 1 1 2 ...
##  $ against_steel    : num  1 1 1 0.5 0.5 0.5 0.5 0.5 0.5 1 ...
##  $ against_water    : num  0.5 0.5 0.5 2 2 2 0.5 0.5 0.5 1 ...
##  $ attack           : int  49 62 100 52 64 104 48 63 103 30 ...
##  $ base_egg_steps   : int  5120 5120 5120 5120 5120 5120 5120 5120 5120 3840 ...
##  $ base_happiness   : int  70 70 70 70 70 70 70 70 70 70 ...
##  $ base_total       : int  318 405 625 309 405 634 314 405 630 195 ...
##  $ capture_rate     : chr  "45" "45" "45" "45" ...
##  $ classfication    : chr  "Seed Pokémon" "Seed Pokémon" "Seed Pokémon" "Lizard Pokémon" ...
##  $ defense          : int  49 63 123 43 58 78 65 80 120 35 ...
##  $ experience_growth: int  1059860 1059860 1059860 1059860 1059860 1059860 1059860 1059860 1059860 1000000 ...
##  $ height_m         : num  0.7 1 2 0.6 1.1 1.7 0.5 1 1.6 0.3 ...
##  $ hp               : int  45 60 80 39 58 78 44 59 79 45 ...
##  $ japanese_name    : chr  "Fushigidaneフシギダネ" "Fushigisouフシギソウ" "Fushigibanaフシギバナ" "Hitokageヒトカゲ" ...
##  $ name             : chr  "Bulbasaur" "Ivysaur" "Venusaur" "Charmander" ...
##  $ percentage_male  : num  88.1 88.1 88.1 88.1 88.1 88.1 88.1 88.1 88.1 50 ...
##  $ pokedex_number   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ sp_attack        : int  65 80 122 60 80 159 50 65 135 20 ...
##  $ sp_defense       : int  65 80 120 50 65 115 64 80 115 20 ...
##  $ speed            : int  45 60 80 65 80 100 43 58 78 45 ...
##  $ type1            : chr  "grass" "grass" "grass" "fire" ...
##  $ type2            : chr  "poison" "poison" "poison" "" ...
##  $ weight_kg        : num  6.9 13 100 8.5 19 90.5 9 22.5 85.5 2.9 ...
##  $ generation       : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ is_legendary     : int  0 0 0 0 0 0 0 0 0 0 ...
# Filter for first generation pokemon and select their stats
pokeData <- pokemon %>%
  filter(generation %in% c(1)) %>%
  select(name,attack,defense,sp_attack,sp_defense,speed,hp) 
# Turn the dataframe into a matrix and scale the values so that they are similar
pokeMatrix <- as.matrix(pokeData[,-1]) %>%
  scale()
# Attach the names of the pokemon to the matrix rows
rownames(pokeMatrix) <- pokeData$name

Hierarchical Clustering

For Hierarchical Clustering, we use the hclust function. The first input is a distance matrix that is made using the dist function on our matrix. There are several options for the method input, but the complete option will usually give you the results you’re looking for.

poke_clust <- hclust(dist(pokeMatrix),method ="complete")
plot(poke_clust)

One way to decide the proper number of clusters is by plotting the resulting dendrogram. There’s no one right way to pick the number of clusters. We’ll use 8 clusters so that Chansey is the only pokemon that is in its own cluster and Slowbro, the second to last pokemon clustered, is put into its cluster.

To make the clusters, we use the cutree function with the hclust object as the input with 8 clusters. You can also make the cut using the height on the dendrogram plot. The as.phylo function from the ape package lets us plot the clusters as a phylogram to get the nice circular look. You could also use the standard plot function with the colors to show the results.

# Colors for clusters
colors = c("gray", "blue", "green", "black", "navy", "deeppink", "red", "purple", "skyblue", "pink", "cyan", "yellow", "steelblue")
# Cut the tree into 8 clusters
poke_clust_cut <- cutree(poke_clust,k=8)
#
plot(as.phylo(poke_clust),type = "fan", tip.color = colors[poke_clust_cut],
     label.offset = .1, cex = 0.7,no.margin = TRUE)

This gives us a nice visualization to begin our analysis. At first glance, the green group appears to be weak unevolved Pokemon. The gray seems like more midrange pokemon. The rest seem to be different groupings of stronger Pokemon.

Now we can add the clusters to the original data and summarize to learn more about our new groupings.

clust_sum <- pokeData %>%
  mutate(cluster = poke_clust_cut) %>%
  group_by(cluster)  %>%
  summarize(attack = mean(attack),defense = mean(defense),sp_attack = mean(sp_attack),
            sp_defense = mean(sp_defense),speed = mean(speed),hp = mean(hp),n = n(),
            .group = "drop_last") %>%
  mutate(color = colors[1:n()]) %>%
  mutate(stat_tot = attack+defense+sp_attack+sp_defense+speed+hp,
         attack_tot = attack + sp_attack,
         defense_tot = defense + sp_defense,
         sp_tot = sp_attack + sp_defense,
         norm_tot = attack + defense,
         color = tools::toTitleCase(color)) %>%
  arrange(desc(stat_tot)) %>%
  select(color,n,attack,defense,sp_attack,sp_defense,speed,hp,attack_tot,defense_tot,norm_tot,sp_tot,stat_tot) %>%
  mutate_if(is.numeric,round)
## `summarise()` ungrouping output (override with `.groups` argument)
gt(clust_sum) %>%
  gt::cols_label(color = "Cluster",n = "Number of Pokemon",attack = "Attack",
                 defense = "Defense",sp_attack = "Attack",sp_defense = "Defense",
                 speed = "Speed",hp = "HP",attack_tot = "Attack",defense_tot = "Defense",
                 norm_tot = "Normal Stats",sp_tot = "Special Stats",stat_tot = "Stats") %>%
  gt::cols_align(align = "center") %>%
  gt::tab_spanner(label = "Normal",columns = c("attack","defense")) %>%
  tab_spanner(label = "Special", columns = c("sp_attack","sp_defense")) %>%
  tab_spanner(label = "Total",columns = c("attack_tot","defense_tot","norm_tot","sp_tot","stat_tot")) 
Cluster Number of Pokemon Normal Special Speed HP Total
Attack Defense Attack Defense Attack Defense Normal Stats Special Stats Stats
Blue 12 117 99 112 108 95 89 229 207 216 220 620
Red 5 81 86 95 86 46 131 176 172 167 181 525
Black 10 112 69 48 81 109 62 161 149 181 129 480
Navy 15 54 57 111 84 112 56 165 141 112 194 474
Purple 1 5 5 35 105 50 250 40 110 10 140 450
Gray 63 73 69 72 70 60 66 145 139 142 142 410
Deeppink 16 93 115 46 48 51 56 139 163 208 94 409
Green 29 48 43 42 40 63 42 90 83 92 82 278

As expected, the green cluster has the lowest total stats of any grouping. The gray and pink clusters both have low total stats but the gray group is more balanced while the pink group had much higher normal stats and weaker special stats. The Blue group was the strongest and included fully evolved or legendary Pokemon. The red cluster was the second highest in total stats although still a large drop off from the legendary group. The black and navy clusters had similar overall stats but were also differentiated by their normal and special focus. Chansey sits alone in the purple cluster due to it’s exceptionally high HP stat and virtually nonexistent normal stats.

K Means Clustering

There are different ways to run a K-Means clustering algorithm in R. We’ll use two different functions here. The pam function from the cluster package allows us to use the silhouette info to pick the number of clusters. We run it multiple times with different clusters and preserve the average width of the clusters for each iteration, and we plot those widths against the number of clusters.

avgWidth <- 0
for (k in 2:15) {
  poke_kmean <- pam(pokeMatrix,k = k) 
  avgWidth[k] <- poke_kmean$silinfo$avg.width
}
tibble(Clusters = 2:15,`Average Width` = avgWidth[2:15]) %>%
  ggplot(aes(x = Clusters,y = `Average Width`,color = ..y..)) +
  geom_point() +
  scale_color_viridis_c(direction = -1) +
  theme_minimal() + theme(legend.position = "none")

Picking the number of clusters is also subjective with k-means. This plot shows a clear minimum width at 6 clusters although this is pretty rare. Generally, you want to pick the number of clusters that provide the smallest change in the width, where there is an “elbow” in the plot. Even a clear elbow is rare, and there’s rarely a right or wrong number of clusters.

Now we use the kmeans function again on our matrix (you can use the pam function again, although some of the later code may be different). We specify that we want 6 clusters and set nstart to some large number to account for the randomness involved in the k-means algorithm. We take our clusters and calculate statistics for each Pokemon and calculate summary statistics for each cluster.

set.seed(151)
poke_kmean <- kmeans(pokeMatrix, centers = 6, nstart = 20)
colors <- c("red","blue","cyan","green","orange","deeppink")
poke_k <- pokeData %>%
  mutate(cluster = poke_kmean$cluster) %>%
  arrange(cluster) %>%
  mutate(stat_tot = attack+defense+sp_attack+sp_defense+speed,
         attack_tot = attack + sp_attack,
         defense_tot = defense + sp_defense,
         sp_tot = sp_attack + sp_defense,
         norm_tot = attack + defense)
poke_k_sum <- pokeData %>%
  mutate(cluster = poke_kmean$cluster) %>%
  group_by(cluster)  %>%
  summarize(n = n(),attack = mean(attack),defense = mean(defense),sp_attack = mean(sp_attack),
            sp_defense = mean(sp_defense),speed = mean(speed),hp = mean(hp),
            .group = "drop_last") %>%
  mutate(color = colors[1:n()]) %>%
  mutate(stat_tot = attack+defense+sp_attack+sp_defense+speed+hp,
         attack_tot = attack + sp_attack,
         defense_tot = defense + sp_defense,
         sp_tot = sp_attack + sp_defense,
         norm_tot = attack + defense,
         color = tools::toTitleCase(color)) %>%
  arrange(desc(stat_tot))  %>%
  select(color,n,attack,defense,sp_attack,sp_defense,speed,hp,attack_tot,defense_tot,norm_tot,sp_tot,stat_tot) %>%
  mutate_if(is.numeric,round)
## `summarise()` ungrouping output (override with `.groups` argument)

One advantage of Hierarchical Clustering is the nice visualization of the grouping you can make with the dendrogam. It’s somewhat harder to visualize the K-Means Clustering for high dimension data. One way is using Principal Component Analysis to reduce the dimensionality of the data down to two dimensions. But, the result is still messy and the two dimensions are difficult to intuitively understand. Instead we’ll plot the Pokemon based on their normal and special statistics. We’ll also make an identical summary table as the hierarchical clustering.

poke_k %>%
  ggplot(aes(x = sp_tot,y = norm_tot)) +
  ggimage::geom_pokemon(aes(image =tolower(name),color = as.factor(cluster))) +
  # geom_text(aes(label = name,color = as.factor(cluster)),size = 2) +
  labs(y = "Total Normal Stats",x = "Total Special Stats") +
  scale_color_manual(values = colors) +
  theme_minimal() + 
  theme(legend.position = "none")

gt(poke_k_sum) %>%
  gt::cols_label(color = "Cluster",n = "Number of Pokemon",attack = "Attack",
                 defense = "Defense",sp_attack = "Attack",sp_defense = "Defense",
                 speed = "Speed",hp = "HP",attack_tot = "Attack",defense_tot = "Defense",
                 norm_tot = "Normal Stats",sp_tot = "Special Stats",stat_tot = "Stats") %>%
  gt::cols_align(align = "center") %>%
  gt::tab_spanner(label = "Normal",columns = c("attack","defense")) %>%
  tab_spanner(label = "Special", columns = c("sp_attack","sp_defense")) %>%
  tab_spanner(label = "Total",columns = c("attack_tot","defense_tot","norm_tot","sp_tot","stat_tot"))
Cluster Number of Pokemon Normal Special Speed HP Total
Attack Defense Attack Defense Attack Defense Normal Stats Special Stats Stats
Green 16 118 94 109 103 93 87 226 197 211 212 604
Red 5 67 51 76 91 50 162 143 142 118 167 497
Cyan 19 63 62 108 87 108 61 171 149 125 194 488
Blue 38 91 71 67 78 75 70 158 150 162 145 452
Orange 22 81 115 59 49 49 57 140 165 197 108 410
Deeppink 51 51 47 49 47 57 48 100 94 98 96 298

Once again, we have a grouping of the most powerful Pokemon represented by the green cluster in the upper right of the plot. The pink cluster shows the weak Pokemon in the bottom left of the plot. The cyan, blue, and orange clusters are mid range but split mostly by the difference in the special and normal stats. The orange cluster is in the upper left (high normal stats, low special stats). The cyan cluster is in the bottom right (low normal stats, high special stats). While the blue cluster is in the center of the plot (average normal stats, average special stats). The clusters overlap each other because we are only looking at some of the summary statistics and can’t see all of the data. Finally, the red cluster appears to be the odd ones out. The table shows that they had the highest HP of any cluster but this is likely because of Chansey’s outlier HP showing up in this cluster.