Unsupervised Machine Learning with Pokemon

Summary

The purpose of this project is to walk through how to use two different types of unsupervised machine learning algorithms (Hierarchical Clustering and K-Means Clustering) in R using a fun Pokemon dataset. The data was obtained from The Complete Pokemon Dataset on Kaggle.

Preperations

This first chunk loads the necessary R packages.

library(tidyverse)
library(ape)
library(gt)
library(cluster)

Modeling

Data

The first step is to read in and observe the data. For this example, we’re only using Pokemon from the first generation of the games and looking at their battle stats. Both of the unsupervised machine learning algorithms require a matrix of numbers to run.

# Read in data
pokemon <- read.csv("pokemon.csv")
# Observe the structure of the data
str(pokemon)

## 'data.frame':    801 obs. of  41 variables:
##  $ abilities        : chr  "['Overgrow', 'Chlorophyll']" "['Overgrow', 'Chlorophyll']" "['Overgrow', 'Chlorophyll']" "['Blaze', 'Solar Power']" ...
##  $ against_bug      : num  1 1 1 0.5 0.5 0.25 1 1 1 1 ...
##  $ against_dark     : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ against_dragon   : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ against_electric : num  0.5 0.5 0.5 1 1 2 2 2 2 1 ...
##  $ against_fairy    : num  0.5 0.5 0.5 0.5 0.5 0.5 1 1 1 1 ...
##  $ against_fight    : num  0.5 0.5 0.5 1 1 0.5 1 1 1 0.5 ...
##  $ against_fire     : num  2 2 2 0.5 0.5 0.5 0.5 0.5 0.5 2 ...
##  $ against_flying   : num  2 2 2 1 1 1 1 1 1 2 ...
##  $ against_ghost    : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ against_grass    : num  0.25 0.25 0.25 0.5 0.5 0.25 2 2 2 0.5 ...
##  $ against_ground   : num  1 1 1 2 2 0 1 1 1 0.5 ...
##  $ against_ice      : num  2 2 2 0.5 0.5 1 0.5 0.5 0.5 1 ...
##  $ against_normal   : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ against_poison   : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ against_psychic  : num  2 2 2 1 1 1 1 1 1 1 ...
##  $ against_rock     : num  1 1 1 2 2 4 1 1 1 2 ...
##  $ against_steel    : num  1 1 1 0.5 0.5 0.5 0.5 0.5 0.5 1 ...
##  $ against_water    : num  0.5 0.5 0.5 2 2 2 0.5 0.5 0.5 1 ...
##  $ attack           : int  49 62 100 52 64 104 48 63 103 30 ...
##  $ base_egg_steps   : int  5120 5120 5120 5120 5120 5120 5120 5120 5120 3840 ...
##  $ base_happiness   : int  70 70 70 70 70 70 70 70 70 70 ...
##  $ base_total       : int  318 405 625 309 405 634 314 405 630 195 ...
##  $ capture_rate     : chr  "45" "45" "45" "45" ...
##  $ classfication    : chr  "Seed Pokémon" "Seed Pokémon" "Seed Pokémon" "Lizard Pokémon" ...
##  $ defense          : int  49 63 123 43 58 78 65 80 120 35 ...
##  $ experience_growth: int  1059860 1059860 1059860 1059860 1059860 1059860 1059860 1059860 1059860 1000000 ...
##  $ height_m         : num  0.7 1 2 0.6 1.1 1.7 0.5 1 1.6 0.3 ...
##  $ hp               : int  45 60 80 39 58 78 44 59 79 45 ...
##  $ japanese_name    : chr  "Fushigidaneフシギダネ" "Fushigisouフシギソウ" "Fushigibanaフシギバナ" "Hitokageヒトカゲ" ...
##  $ name             : chr  "Bulbasaur" "Ivysaur" "Venusaur" "Charmander" ...
##  $ percentage_male  : num  88.1 88.1 88.1 88.1 88.1 88.1 88.1 88.1 88.1 50 ...
##  $ pokedex_number   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ sp_attack        : int  65 80 122 60 80 159 50 65 135 20 ...
##  $ sp_defense       : int  65 80 120 50 65 115 64 80 115 20 ...
##  $ speed            : int  45 60 80 65 80 100 43 58 78 45 ...
##  $ type1            : chr  "grass" "grass" "grass" "fire" ...
##  $ type2            : chr  "poison" "poison" "poison" "" ...
##  $ weight_kg        : num  6.9 13 100 8.5 19 90.5 9 22.5 85.5 2.9 ...
##  $ generation       : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ is_legendary     : int  0 0 0 0 0 0 0 0 0 0 ...

# Filter for first generation pokemon and select their stats
pokeData <- pokemon %>%
  filter(generation %in% c(1)) %>%
  select(name,attack,defense,sp_attack,sp_defense,speed,hp) 
# Turn the dataframe into a matrix and scale the values so that they are similar
pokeMatrix <- as.matrix(pokeData[,-1]) %>%
  scale()
# Attach the names of the pokemon to the matrix rows
rownames(pokeMatrix) <- pokeData$name

Hierarchical Clustering

For Hierarchical Clustering, we use the hclust function. The first input is a distance matrix that is made using the dist function on our matrix. There are several options for the method input, but the complete option will usually give you the results you’re looking for.

poke_clust <- hclust(dist(pokeMatrix),method ="complete")
plot(poke_clust)

One way to decide the proper number of clusters is by plotting the resulting dendrogram. There’s no one right way to pick the number of clusters. We’ll use 8 clusters so that Chansey is the only pokemon that is in its own cluster and Slowbro, the second to last pokemon clustered, is put into its cluster.

To make the clusters, we use the cutree function with the hclust object as the input with 8 clusters. You can also make the cut using the height on the dendrogram plot. The as.phylo function from the ape package lets us plot the clusters as a phylogram to get the nice circular look. You could also use the standard plot function with the colors to show the results.

# Colors for clusters
colors = c("gray", "blue", "green", "black", "navy", "deeppink", "red", "purple", "skyblue", "pink", "cyan", "yellow", "steelblue")
# Cut the tree into 8 clusters
poke_clust_cut <- cutree(poke_clust,k=8)
#
plot(as.phylo(poke_clust),type = "fan", tip.color = colors[poke_clust_cut],
     label.offset = .1, cex = 0.7,no.margin = TRUE)

This gives us a nice visualization to begin our analysis. At first glance, the green group appears to be weak unevolved Pokemon. The gray seems like more midrange pokemon. The rest seem to be different groupings of stronger Pokemon.

Now we can add the clusters to the original data and summarize to learn more about our new groupings.

clust_sum <- pokeData %>%
  mutate(cluster = poke_clust_cut) %>%
  group_by(cluster)  %>%
  summarize(attack = mean(attack),defense = mean(defense),sp_attack = mean(sp_attack),
            sp_defense = mean(sp_defense),speed = mean(speed),hp = mean(hp),n = n(),
            .group = "drop_last") %>%
  mutate(color = colors[1:n()]) %>%
  mutate(stat_tot = attack+defense+sp_attack+sp_defense+speed+hp,
         attack_tot = attack + sp_attack,
         defense_tot = defense + sp_defense,
         sp_tot = sp_attack + sp_defense,
         norm_tot = attack + defense,
         color = tools::toTitleCase(color)) %>%
  arrange(desc(stat_tot)) %>%
  select(color,n,attack,defense,sp_attack,sp_defense,speed,hp,attack_tot,defense_tot,norm_tot,sp_tot,stat_tot) %>%
  mutate_if(is.numeric,round)

## `summarise()` ungrouping output (override with `.groups` argument)

gt(clust_sum) %>%
  gt::cols_label(color = "Cluster",n = "Number of Pokemon",attack = "Attack",
                 defense = "Defense",sp_attack = "Attack",sp_defense = "Defense",
                 speed = "Speed",hp = "HP",attack_tot = "Attack",defense_tot = "Defense",
                 norm_tot = "Normal Stats",sp_tot = "Special Stats",stat_tot = "Stats") %>%
  gt::cols_align(align = "center") %>%
  gt::tab_spanner(label = "Normal",columns = c("attack","defense")) %>%
  tab_spanner(label = "Special", columns = c("sp_attack","sp_defense")) %>%
  tab_spanner(label = "Total",columns = c("attack_tot","defense_tot","norm_tot","sp_tot","stat_tot"))

Cluster	Number of Pokemon	Normal		Special		Speed	HP	Total
Cluster	Number of Pokemon	Attack	Defense	Attack	Defense	Speed	HP	Attack	Defense	Normal Stats	Special Stats	Stats
Blue	12	117	99	112	108	95	89	229	207	216	220	620
Red	5	81	86	95	86	46	131	176	172	167	181	525
Black	10	112	69	48	81	109	62	161	149	181	129	480
Navy	15	54	57	111	84	112	56	165	141	112	194	474
Purple	1	5	5	35	105	50	250	40	110	10	140	450
Gray	63	73	69	72	70	60	66	145	139	142	142	410
Deeppink	16	93	115	46	48	51	56	139	163	208	94	409
Green	29	48	43	42	40	63	42	90	83	92	82	278

As expected, the green cluster has the lowest total stats of any grouping. The gray and pink clusters both have low total stats but the gray group is more balanced while the pink group had much higher normal stats and weaker special stats. The Blue group was the strongest and included fully evolved or legendary Pokemon. The red cluster was the second highest in total stats although still a large drop off from the legendary group. The black and navy clusters had similar overall stats but were also differentiated by their normal and special focus. Chansey sits alone in the purple cluster due to it’s exceptionally high HP stat and virtually nonexistent normal stats.

K Means Clustering

There are different ways to run a K-Means clustering algorithm in R. We’ll use two different functions here. The pam function from the cluster package allows us to use the silhouette info to pick the number of clusters. We run it multiple times with different clusters and preserve the average width of the clusters for each iteration, and we plot those widths against the number of clusters.

avgWidth <- 0
for (k in 2:15) {
  poke_kmean <- pam(pokeMatrix,k = k) 
  avgWidth[k] <- poke_kmean$silinfo$avg.width
}
tibble(Clusters = 2:15,`Average Width` = avgWidth[2:15]) %>%
  ggplot(aes(x = Clusters,y = `Average Width`,color = ..y..)) +
  geom_point() +
  scale_color_viridis_c(direction = -1) +
  theme_minimal() + theme(legend.position = "none")

Picking the number of clusters is also subjective with k-means. This plot shows a clear minimum width at 6 clusters although this is pretty rare. Generally, you want to pick the number of clusters that provide the smallest change in the width, where there is an “elbow” in the plot. Even a clear elbow is rare, and there’s rarely a right or wrong number of clusters.

Now we use the kmeans function again on our matrix (you can use the pam function again, although some of the later code may be different). We specify that we want 6 clusters and set nstart to some large number to account for the randomness involved in the k-means algorithm. We take our clusters and calculate statistics for each Pokemon and calculate summary statistics for each cluster.

set.seed(151)
poke_kmean <- kmeans(pokeMatrix, centers = 6, nstart = 20)
colors <- c("red","blue","cyan","green","orange","deeppink")
poke_k <- pokeData %>%
  mutate(cluster = poke_kmean$cluster) %>%
  arrange(cluster) %>%
  mutate(stat_tot = attack+defense+sp_attack+sp_defense+speed,
         attack_tot = attack + sp_attack,
         defense_tot = defense + sp_defense,
         sp_tot = sp_attack + sp_defense,
         norm_tot = attack + defense)
poke_k_sum <- pokeData %>%
  mutate(cluster = poke_kmean$cluster) %>%
  group_by(cluster)  %>%
  summarize(n = n(),attack = mean(attack),defense = mean(defense),sp_attack = mean(sp_attack),
            sp_defense = mean(sp_defense),speed = mean(speed),hp = mean(hp),
            .group = "drop_last") %>%
  mutate(color = colors[1:n()]) %>%
  mutate(stat_tot = attack+defense+sp_attack+sp_defense+speed+hp,
         attack_tot = attack + sp_attack,
         defense_tot = defense + sp_defense,
         sp_tot = sp_attack + sp_defense,
         norm_tot = attack + defense,
         color = tools::toTitleCase(color)) %>%
  arrange(desc(stat_tot))  %>%
  select(color,n,attack,defense,sp_attack,sp_defense,speed,hp,attack_tot,defense_tot,norm_tot,sp_tot,stat_tot) %>%
  mutate_if(is.numeric,round)

## `summarise()` ungrouping output (override with `.groups` argument)

One advantage of Hierarchical Clustering is the nice visualization of the grouping you can make with the dendrogam. It’s somewhat harder to visualize the K-Means Clustering for high dimension data. One way is using Principal Component Analysis to reduce the dimensionality of the data down to two dimensions. But, the result is still messy and the two dimensions are difficult to intuitively understand. Instead we’ll plot the Pokemon based on their normal and special statistics. We’ll also make an identical summary table as the hierarchical clustering.

poke_k %>%
  ggplot(aes(x = sp_tot,y = norm_tot)) +
  ggimage::geom_pokemon(aes(image =tolower(name),color = as.factor(cluster))) +
  # geom_text(aes(label = name,color = as.factor(cluster)),size = 2) +
  labs(y = "Total Normal Stats",x = "Total Special Stats") +
  scale_color_manual(values = colors) +
  theme_minimal() + 
  theme(legend.position = "none")

gt(poke_k_sum) %>%
  gt::cols_label(color = "Cluster",n = "Number of Pokemon",attack = "Attack",
                 defense = "Defense",sp_attack = "Attack",sp_defense = "Defense",
                 speed = "Speed",hp = "HP",attack_tot = "Attack",defense_tot = "Defense",
                 norm_tot = "Normal Stats",sp_tot = "Special Stats",stat_tot = "Stats") %>%
  gt::cols_align(align = "center") %>%
  gt::tab_spanner(label = "Normal",columns = c("attack","defense")) %>%
  tab_spanner(label = "Special", columns = c("sp_attack","sp_defense")) %>%
  tab_spanner(label = "Total",columns = c("attack_tot","defense_tot","norm_tot","sp_tot","stat_tot"))

Cluster	Number of Pokemon	Normal		Special		Speed	HP	Total
Cluster	Number of Pokemon	Attack	Defense	Attack	Defense	Speed	HP	Attack	Defense	Normal Stats	Special Stats	Stats
Green	16	118	94	109	103	93	87	226	197	211	212	604
Red	5	67	51	76	91	50	162	143	142	118	167	497
Cyan	19	63	62	108	87	108	61	171	149	125	194	488
Blue	38	91	71	67	78	75	70	158	150	162	145	452
Orange	22	81	115	59	49	49	57	140	165	197	108	410
Deeppink	51	51	47	49	47	57	48	100	94	98	96	298

Once again, we have a grouping of the most powerful Pokemon represented by the green cluster in the upper right of the plot. The pink cluster shows the weak Pokemon in the bottom left of the plot. The cyan, blue, and orange clusters are mid range but split mostly by the difference in the special and normal stats. The orange cluster is in the upper left (high normal stats, low special stats). The cyan cluster is in the bottom right (low normal stats, high special stats). While the blue cluster is in the center of the plot (average normal stats, average special stats). The clusters overlap each other because we are only looking at some of the summary statistics and can’t see all of the data. Finally, the red cluster appears to be the odd ones out. The table shows that they had the highest HP of any cluster but this is likely because of Chansey’s outlier HP showing up in this cluster.