Clustering Legends of Runeterra archetypes#

Goal#

Given a list of n Legends of Runeterra decks, how to automatically group them into archetypes, ie groups of similar decks?

Clustering decks as archetypes allows for better calculation of metrics like win-rate and play-rate. This is a problem that comes back for every card game I have ever played, and is usually “simply” solved by humans sort the decks themselves.

But what if deck clustering could be automated?

Approaches#

Intuition#

I started to think about how to “simply” group decklists. Two ways came to mind:

Choosing decks that have a pool of common cards
Choosing decks that have few different cards

Whichever distance we choose, we can start from individual decks and add new decks one by one, selecting the “closest” one to the cluster at each step. We can do this process in descending popularity order, so ties are broken by the popularity of decks. We can then select the biggest non-overlapping clusters found as good archetypes and re-start the process.

It does sound pretty close to k-means as it also relies on a centroid vector for each cluster, but here we don’t have to select how many clusters we want from the start. We can also have clusters of very different sizes and it will not be an issue with our approach.

We’ll call that method incremental constrained cluster growth and it’s the one I’ll be implementing in this blog post.

Existing clustering algorithms#

Clustery analysis is a complex and heavily researched domain. Algorithms usually exhibit O(n**3) complexity.

The most popular clustering algorithm, or at least the only one I remember from my M.Sc., is k-means. It’s unfortunately a pretty poor fit here as we don’t know the number of archetypes/clusters we are looking for, and it’s also poorly adapted for clusters of vastly different sizes.

Hierarchical clustering is a better fit for our use-case:

In data mining and statistics, hierarchical clustering (also called hierarchical cluster analysis or HCA) is a method of cluster analysis which seeks to build a hierarchy of clusters.

Deck archetypes naturally have multiple levels of “granularity” we can look at. From big macro archetypes being defined only by a few cards to micro archetypes arising from very few card differences. Ordering them in a hierarchy sounds like a great fit even though it will be harder to use for visualisation and analysis.

I will try hierarchical clustering and visualisation at some point, but I’m keeping it for a future blog post!

Preparing data#

Games selection#

As Legends of Runeterra does not offer access to ranks through its API, I use TrueSkill to identify the best players on a given server and parse them in descending skill order. I only use ranked results to determine skill ratings.

I verified the algorithm by verifying that the best players identified by TrueSkill were indeed in Master rank. We can do that by checking account names, which is the only type of rank data available.

This allows me to parse games for the ~20 000 best players per server, which returns ~35 000 ranked games per day per server. As there are 3 servers in total, we have ~100 000 games per day of data. This likely covers rank until platinum, and maybe even gold. But we have no way to check as LoR’s API doesn’t let us query ranks for players ¯\(ツ)/¯

SQL query#

The data parsed from Riot’s API is susceptible to regular model changes and I’ve therefore saved it as a JSONB column in Postgres.

I have written that query in many different ways but in the end making a subquery to create a player table was the easiest for me to maintain. I’m sure this query can be written in a smarter way but I’m still only starting to get used to JSONB direct querying with Postgres:

SELECT
  deck_code,
  factions,
  COUNT(*) as games,
  COUNT(CASE WHEN win::boolean THEN 1 END) as wins
FROM (
  SELECT
    jsonb_array_elements(lor_game.data->'info'->'players')->>'win' as win,
    jsonb_array_elements(lor_game.data->'info'->'players')->>'deck_code' as deck_code,
    jsonb_array_elements(lor_game.data->'info'->'players')->>'factions' as factions
  FROM lor_game
  WHERE lor_game.data->'info'->>'game_version' = 'live-green-3-13-42'
) as player
GROUP BY deck_code, factions
HAVING COUNT(*) > 10
ORDER BY games DESC

Result:

Deck code	Factions	Games	Wins
`CICQCAQDAMAQKBQBAEDA(...)`	[“faction_Bilgewater_Name”, “faction_Noxus_Name”]	39880	24067
`CMCQCAQAAIAQIAADAECA(...)`	[“faction_Demacia_Name”, “faction_Shurima_Name”]	24118	12756
`CUFACAIEGYAQEBR4AEBQ(...)`	[“faction_Jhin_Name”, “faction_Noxus_Name”]	16387	9315
`CMBQCAQAAIBAIB3HQIAQ(...)`	[“faction_Demacia_Name”, “faction_Shurima_Name”]	16153	9146
…	…	…	…

# Loading our env variables before we load our library
import dotenv

dotenv.load_dotenv(verbose=True, override=True)

# We checked for the latest patch value by looking at the most recent games
latest_version = "live-green-3-13-42"

# Connecting to the database
from neotokyo import db

session = db.connection.ghost_session_maker.session_maker()

# ORM
import sqlalchemy
from sqlalchemy.dialects import postgresql

# Defining our subquery, already filtered on the right game version
# Tbh the postgres JSONB syntax in SQLAlchemy is pretty disgusting
player_table = (
    sqlalchemy.select(
        sqlalchemy.func.jsonb_array_elements(
            db.models.LorGame.data["info"]["players"],
            type_=postgresql.JSONB,
        )["deck_code"].astext.label("deck_code"),
        sqlalchemy.func.jsonb_array_elements(
            db.models.LorGame.data["info"]["players"],
            type_=postgresql.JSONB,
        )["factions"].astext.label("factions"),
        sqlalchemy.func.jsonb_array_elements(
            db.models.LorGame.data["info"]["players"],
            type_=postgresql.JSONB,
        )["win"].astext.label("win"),
    )
    # We put our patch limit here so the subquery return is smaller
    .where(
        db.models.LorGame.data["info"]["game_version"].astext == latest_version,
    )
    .subquery()
)

games_count = sqlalchemy.func.count().label("games_count")
wins = sqlalchemy.func.count(
    sqlalchemy.case(
        (player_table.c.win == "true", 1),
    )
).label("wins")

# The disgusting player_table code lets us write a pretty clean and readable query at least!
query = (
    sqlalchemy.select(
        player_table.c.deck_code,
        player_table.c.factions,
        games_count,
        wins,
    )
    .order_by(games_count.desc())
    .group_by(
        player_table.c.deck_code,
        player_table.c.factions,
    )
    .having(games_count > 10)
)

# We get the results as a list
latest_patch_decks = session.execute(query).all()

print(f"Games found: {sum(dd.games_count for dd in latest_patch_decks):,}")

/home/tolki/.cache/pypoetry/virtualenvs/neotokyo-vP6WhsAw-py3.10/lib/python3.10/site-packages/aioredis/connection.py:68: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.
  hiredis_version = StrictVersion(hiredis.__version__)
/home/tolki/.cache/pypoetry/virtualenvs/neotokyo-vP6WhsAw-py3.10/lib/python3.10/site-packages/aioredis/connection.py:69: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.
  if hiredis_version < StrictVersion("1.0.0"):

Games found: 2,639,173

Deck codes to deck lists#

Riot gives us deck codes, but we want a list of cards to easily compute the distance. Technically one could implement the algorithm in pure SQL, but it’s quite a pain.

We will instead use lor-deckcodes to transform deck codes into a list of card codes and counts, and will then transform card cards into card names using our own database.

To store decklists in an easy to compare format we will use a set of 40 strings with card names postfixed by their occurence number:

{
  Veigar-1,
  Veigar-2,
  Veigar-3,
  Senna-1,
  ...
}

# Defining constants
DECK_SIZE = 40

# Adding a small utility to compute card names
card_cache = {}


def get_card_name(code: str) -> str:
    if code not in card_cache:
        card_cache[code] = session.get(db.models.LorCard, code).data["name"]

    return card_cache[code]


from typing import Set


# A small function to have more readable card lists
#   We don't do it based on deck codes/decklists because we want to use it with archetypes later
def cards_table(cards: Set[str]) -> str:
    """Beautiful output of cards"""
    import tabulate

    # We could do all that with complex list comprehensions but the gain in performance is not worth the loss in readability
    tabulate_input = []
    current_row = []
    added_cards = set()

    for card in sorted(
        cards,
        # The last character is the # of copies and we want to go in descending order
        key=lambda x: x[-1],
        reverse=True,
    ):
        card_name = card[:-2]

        # If we already added the card we continue
        if card_name in added_cards:
            continue

        added_cards.add(card_name)
        current_row.append(f"{card[-1]}x {card_name}")

        # We put only 4 cards per row
        if len(current_row) == 4:
            tabulate_input.append(current_row)
            current_row = []

    if current_row:
        tabulate_input.append(current_row)

    return tabulate.tabulate(tabulate_input, tablefmt="html")


# Making a dataclass to hold our information
from dataclasses import dataclass

# Utilities for outputting HTML
from IPython.display import HTML, display

# A dataclass automatically creates its own __init__ functions from type hints which is nice
@dataclass
class DeckData:
    deck_code: str

    factions: str
    cards: Set[str]

    games_count: int
    wins: int

    @property
    def winrate(self):
        return self.wins / self.games_count

    def display(self):
        display(
            HTML(
                f"""<h3>{self.deck_code}</b></h2>
<ul><li>Games: {self.games_count:,}</li>
<li>Winrate: {self.winrate*100:.2f}%</li>
<li>Factions: {self.factions}</li>
<li>Decklist: {cards_table(self.cards)}</li>"""
            )
        )


import lor_deckcodes

decks_data = {}

for row in latest_patch_decks:
    deck = lor_deckcodes.LoRDeck.from_deckcode(row.deck_code)

    # We store decklist as a set of 40 unique strings as it lets use the intersection operator
    # I tried storing decklists as sets of unique integers but saw no significant performance improvement
    cards = set()
    for card in deck.cards:
        for i in range(card.count):
            cards.add(f"{get_card_name(card.card_code)}-{i+1}")

    # Making sure we do have 40 unique card strings in our set
    assert len(cards) == DECK_SIZE

    # Saving all the data we got for this specific deck code
    decks_data[row.deck_code] = DeckData(
        deck_code=row.deck_code,
        factions=row.factions,
        games_count=row.games_count,
        wins=row.wins,
        cards=cards,
    )


print(f"Decks found: {len(decks_data):,}")

first_deck_code = next(iter(decks_data))
decks_data[first_deck_code].display()

print("Raw cards data: ", decks_data[first_deck_code].cards)

Decks found: 46,092

CICQCAQDAMAQKBQBAEDAMFIEAIDBMGRGHICQCAYCBQHSKKACAEBAMLIBAYDB4AA

Games: 80,571
Winrate: 60.21%
Factions: ["faction_Bilgewater_Name", "faction_Noxus_Name"]

Decklist:

3x Zap Sprayfin	3x Marai Warden	3x Legion Grenadier	3x Miss Fortune
3x Decimate	3x Precious Pet	3x Legion Saboteur	3x Twisted Fate
3x Noxian Fervor	3x Riptide Sermon	3x Legion Rearguard	3x Jagged Butcher
2x Make it Rain	2x Eye of Nagakabouros

Raw cards data:  {'Noxian Fervor-1', 'Zap Sprayfin-2', 'Zap Sprayfin-3', 'Legion Rearguard-1', 'Make it Rain-1', 'Decimate-2', 'Marai Warden-3', 'Legion Grenadier-3', 'Miss Fortune-2', 'Eye of Nagakabouros-1', 'Miss Fortune-3', 'Legion Grenadier-1', 'Decimate-3', 'Precious Pet-3', 'Jagged Butcher-1', 'Zap Sprayfin-1', 'Riptide Sermon-2', 'Legion Saboteur-3', 'Twisted Fate-2', 'Twisted Fate-3', 'Precious Pet-2', 'Noxian Fervor-3', 'Legion Rearguard-2', 'Miss Fortune-1', 'Legion Saboteur-2', 'Make it Rain-2', 'Eye of Nagakabouros-2', 'Noxian Fervor-2', 'Marai Warden-2', 'Marai Warden-1', 'Riptide Sermon-1', 'Jagged Butcher-2', 'Riptide Sermon-3', 'Legion Grenadier-2', 'Decimate-1', 'Legion Saboteur-1', 'Legion Rearguard-3', 'Precious Pet-1', 'Twisted Fate-1', 'Jagged Butcher-3'}

Implementing incremental constrained cluster growth#

Basic idea#

Define the distance from a deck to a cluster
Iterate on decks in descending popularity (# games)
For each deck
- Create a cluster containing only this deck
- Add decks one by one by, selecting the one with shortest distance to the cluster each step
Add non-overlapping clusters in descending size
Re-start iteration with any remaining decks

The archetype will be the aggregate decklist as defined by Frank Karsten’s in this ChannelFireball article.

Things that didn’t work out#

With a direct implementation, computational time was through the roof at ~24h with ~20,000 decks
- Complexity O(n**3) does that
Even with multiple optimisations and shortcuts the method was still way too heavy and took multiple hours for each iteration
I tried an idea I called fast cutoff, which would directly select a cluster if it contained more than 1% of all decks
- It did heavily speed up the process at the cost of clustering quality
- I was able to not need it anymore once I fixed my code

Deck factions#

The most important thing to add was “pre-clustering”. O(n**3) complexity means that if we’re able to split the data into 100 sets, complexity goes down by 1,000,000.

And there are 100 obvious sets for our data: the deck’s factions. Those are similar to deck colors in Magic or hero in Hearthstone.

Forcing all decks in an archetypes to be the same factions does lose some granularity for decks that just splash 3/6 cards in a different faction, but it’s actually something we want. The end goal is identifying success of different strategies, and changing a splash in a deck is a new strategy.

The simplification also allowed me to drop the fast cutoff idea as I was working with much smaller lists of decks to cluster at each step.

from collections import defaultdict

# Creating faction groups
# We use lists and not sets because we want to keep the decks ordered by count
factions_decks = defaultdict(list)

for deck_code in decks_data:
    factions_decks[decks_data[deck_code].factions].append(deck_code)

print(f"Found {len(factions_decks):,} different factions")

max_faction = max(factions_decks, key=lambda x: len(factions_decks[x]))
print(f"The faction with the most decks is", max_faction)

Found 88 different factions
The faction with the most decks is ["faction_BandleCity_Name", "faction_ShadowIsles_Name"]

Distance and clusters#

Distance between two decks#

Distance is calculated through the intersection operator for sets applied on the set of cards in the deck: &.

This works as each copy (1-2-3) is defined as a unique string in our decklists.

Distance from a deck to a cluster#

The distance from a deck to a cluster is 0 if adding the new deck does not change the intersection of existing decks in the cluster. If it requires removing a card from the intersection to add the new deck, it’s 1, and so on and so forth.

I initially implemented it wrongly, which cost me a lot of time.

The right way to calculate the distance from a deck to a cluster is to use the cluster’s intersection and compare it to the decklist.

Implementation#

What really matters to us is the distance of a deck to a cluster, so it’s even easier to implement all those as part of a Cluster class.

While we’re at it we implement the aggregate decklist code in that class as well as some basic stats handling and a beautiful display!

from abc import abstractmethod
from collections import Counter
from typing import Optional

type_cache = {}


def get_card_type(name: str) -> str:
    if name not in type_cache:
        card = (
            session.query(db.models.LorCard)
            .filter(db.models.LorCard.data["name"].astext == name)
            .first()
        )
        type_cache[name] = card.data["supertype"]

    return type_cache[name]


# A class we'll use to compute cluster stats and display them
class ClusterStats:
    def __init__(self, cluster) -> None:
        self.cluster = cluster

        cards_count = Counter()

        self.wins = 0
        self.games_count = 0

        for deck_code in self.cluster.decks:
            self.games_count += decks_data[deck_code].games_count
            self.wins += decks_data[deck_code].wins
            self.factions = decks_data[deck_code].factions

            for card in decks_data[deck_code].cards:
                # Instead of just using 1, we use the decklists' wins
                #   This makes it so not only more popular versions have more weight, but successful ones do too
                cards_count[card] += decks_data[deck_code].wins

        self.aggregated_decklist = [c for c, count in cards_count.most_common(40)]

        champions = set(
            c[:-2]
            for c in self.aggregated_decklist
            if get_card_type(c[:-2]) == "Champion"
        )

        self.title = f"""{" ".join(champions)} - {
            " ".join(n[9:-6] for n in self.factions.replace(" ", "")[1:-1].split(","))
        }"""

    @property
    def winrate(self):
        return self.wins / self.games_count

    def display(self):
        display(
            HTML(
                f"""<h3>{self.title}</h3>
    <ul>
    <li><b>{self.winrate*100:.2f}% winrate</b></li>
    <li>{self.games_count:,} games</li>
    <li>{len(self.cluster.decks)} decklists</li>
    <li>Aggregated decklist: {cards_table(self.aggregated_decklist)}</li></ul>"""
            )
        )


class Cluster:
    def __init__(self, center: str) -> None:
        self.decks = [center]

    def __contains__(self, item) -> bool:
        return True if item in self.decks else False

    def __len__(self) -> int:
        return len(self.decks)

    @abstractmethod
    def distance(self, deck_code: str) -> int:
        ...

    @abstractmethod
    def can_be_added_to_cluster(self, distance: int) -> Optional[bool]:
        # TODO Check if we actually need that
        # We will use a trilean here
        #   True -> can be added
        #   False -> cannot be added yet
        #   None -> will never be able to be added (useful for some distances)

        # We use the distance as argument so we don't compute it twice but to allow for different rules
        ...

    @abstractmethod
    def add(self, deck_code: str) -> None:
        ...

    def get_stats(self) -> ClusterStats:
        return ClusterStats(self)

Centered clusters#

Given a center deck and a list of candidates, we want to find the “best” cluster built around center.

To do that, we add decks one by one, taking the closest one to the cluster at each step.

There are a few possible optimisations:

If a deck has a distance 0 to the cluster, we should be able to add it instantly
- Distance 0 should mean adding it doesn’t change our cluster

import copy
from typing import List
import time


def get_centered_cluster(
    center: str,
    candidates: List[str],
    cluster_class: Cluster,
) -> Cluster:

    # We start the cluster with the center
    cluster = cluster_class(center)

    # We remove it from the candidates
    candidates.remove(center)

    # We iterate until we aren't able to add an eligible deck or we've added them all
    while len(candidates) > 0:

        # This is the maximum value (fully disjointed decks)
        minimum_distance = float("inf")
        minimum_deck = None

        # We iterate on possible members
        # We copy the list because we want to be able to remove candidates during iteration
        for candidate in list(candidates):

            cluster_distance = cluster.distance(candidate)
            can_be_added = cluster.can_be_added_to_cluster(cluster_distance)

            # We check if the deck can be added to our cluster first
            if can_be_added is None:
                # We use a trilean to speed up the process with some algorithms
                candidates.remove(candidate)

            elif can_be_added is False:
                pass

            elif can_be_added is True:
                # If distance = 0 we can add the deck directly (happens for common cards clustering)
                if cluster_distance == 0:
                    cluster.add(candidate)
                    candidates.remove(candidate)
                    continue

                # We check if we found a new closest candidate and save it if that's the case
                if cluster_distance < minimum_distance:
                    minimum_distance = cluster_distance
                    minimum_deck = candidate

        # One step of iterations on candidates is over

        # If we didn't find a new minimum deck, we stop
        if minimum_deck is None:
            break

        # Adding the deck we found to our cluster
        cluster.add(minimum_deck)

        # We remove the deck from the remaining decks and continue iterating
        candidates.remove(minimum_deck)

    return cluster

Getting all clusters#

Then, we make a function for getting all clusters for a list of deck codes. This one’s pretty simple.

from typing import List


def get_all_clusters(deck_codes: List[str], cluster_class: Cluster) -> List[Cluster]:
    """Gets all clusters centered on each deck code

    Args:
        deck_codes (List[str]): a list of deck codes

    Returns:
        List[Cluster]: all clusters found centered on each deck code
    """
    clusters = []

    for center in deck_codes:
        # We copy the list of deck codes because we change it in our code above
        candidates = list(deck_codes)

        # We get the biggest cluster centered on deck
        clusters.append(get_centered_cluster(center, candidates, cluster_class))

    return clusters

Clustering a list of decks#

We now have all the necessary building blocks to make our last function. One that takes a list of deck codes and returns clusters containing all decks.

Pseudocode:

Call get_all_clusters to get all clusters centered on each remaining deck code
Order them by size
Add them in descending size order as long as they don’t contain any deck that’s already in our result
- The biggest one will always get added
- Smaller ones can get added directly if they have no overlap with the big clusters found in that step
Restart the process until all decks have been assigned to a cluster

def get_best_clusters(deck_codes: List[str], cluster_class: Cluster) -> List[Cluster]:
    """Gets the best clusters for the list of deck codes

    Args:
        deck_codes (List[str]): a list of deck codes

    Returns:
        List[Cluster]: the best clusters found
    """
    clusters = []
    remaining_decks = deck_codes

    def clusters_contains_deck(clusters: List[Cluster], deck) -> bool:
        # Return True if the deck is in any of the clusters already found
        return any(deck in c for c in clusters)

    while len(remaining_decks) > 0:
        possible_clusters = get_all_clusters(remaining_decks, cluster_class)

        for possible_cluster in sorted(
            possible_clusters, key=lambda x: len(x), reverse=True
        ):
            # If no other cluster already contains any of the decks in the current candidate cluster, we add it
            # This means we will always add the biggest cluster found at that step
            if all(not clusters_contains_deck(clusters, d) for d in possible_cluster.decks):
                clusters.append(possible_cluster)

            # Else we pass to the next possible cluster
            else:
                continue

        # We update our remaining decks
        remaining_decks = [
            d for d in deck_codes if not clusters_contains_deck(clusters, d)
        ]

    return clusters

Clustering based on common cards#

All decks in an archetype must share ARCHETYPE_SIZE cards
Any deck whose addition would make the cluster intersection go below ARCHETYPE_SIZE can directly be eliminated
- Adding more decks will make the cluster more stringent with future additions

ARCHETYPE_SIZE = DECK_SIZE - 10

class CommonCardsCluster(Cluster):
    def __init__(self, center: str) -> None:
        super().__init__(center)

        # Saving the intersection of the decklists in our cluster will help speed up the process
        self.intersection = decks_data[center].cards

        self.archetype_size = ARCHETYPE_SIZE

    @staticmethod
    def deck_to_deck_distance(dc_1: str, dc_2: str):
        # A small class method for validation
        return DECK_SIZE - len(
            # & is the intersection operator for sets
            decks_data[dc_1].cards
            & decks_data[dc_2].cards
        )

    def distance(self, deck_code: str) -> int:
        # If the cluster_intersection is fully contained in the decks_data cards, cluster_intersection = cluster + deck intersetion and len = 0
        # This will always be >= 0 because at most there's full overlap between a deck and a cluster's intersection
        return len(self.intersection) - len(
            # & is the intersection operator for sets
            self.intersection
            & decks_data[deck_code].cards
        )

    def can_be_added_to_cluster(self, distance: int) -> Optional[bool]:
        # If adding the new deck would make the intersection go over the archetype size, it will never be able to be added
        # For example if we currently have an intersection of 30 and an archetype size of 28, the max distance to add is 2
        if distance > len(self.intersection) - self.archetype_size:
            return None

        else:
            return True

    def add(self, deck_code: str) -> None:
        self.decks.append(deck_code)
        # & is the intersection operator for sets
        self.intersection = self.intersection & decks_data[deck_code].cards

for deck_code in decks_data:
    # A deck code has 100% overlap with itself
    assert CommonCardsCluster.deck_to_deck_distance(deck_code, deck_code) == 0

    if deck_code != first_deck_code:
        # deck-to-deck distance is always > 0 as they're distinct and <= 40 as that's the maximum difference
        assert (
            0
            < CommonCardsCluster.deck_to_deck_distance(deck_code, first_deck_code)
            <= 40
        )

        # In this specific case, the distance to the cluster is the same as the distance to the deck:
        #   it's the number of non-overlapping cards
        cluster = CommonCardsCluster(first_deck_code)

        assert cluster.distance(deck_code) == CommonCardsCluster.deck_to_deck_distance(
            deck_code, first_deck_code
        )

# Validation of the centered clusters code
for deck in factions_decks[max_faction][:200]:

    # Only checking out of 200 decks for speed
    cluster = get_centered_cluster(
        deck,
        factions_decks[max_faction][:200],
        CommonCardsCluster,
    )

    # We check the number of cards common to all decks in the cluster
    if len(cluster.decks) > 1:
        # If have more than one deck, their intersection is at most 39 cards
        assert 40 > len(cluster.intersection) >= ARCHETYPE_SIZE
    else:
        # A few decks simple have no good neigbors, we check the distance is > DECK_SIZE - ARCHETYPE_SIZE
        assert min(cluster.distance(d) for d in factions_decks[max_faction][:200] if d != deck) > DECK_SIZE - ARCHETYPE_SIZE

# Validation of the best clusters code
clusters = get_best_clusters(factions_decks[max_faction][:200], cluster_class=CommonCardsCluster)

# We clustered 200 decks, the sum of clusters lengths should be 200
assert sum(len(c) for c in clusters) == 200

# Checking we have the right intersection sizes
for cluster in clusters:
    assert 40 >= len(cluster.intersection) >= ARCHETYPE_SIZE

Putting it all together#

Parameter selection#

Selecting the right ARCHETYPE_SIZE is crucial so I experimented with a few different values. Keep in mind we have ~30,000 decklists.

ARCHETYPE_SIZE	ARCHETYPES	AVG DECK/ARCHETYPE FOR TOP 100
30	10132	53
28	8847	65
25	7032	80

ARCHETYPE_SIZE=28 looks like a good spot for archetype size and is coherent with the intuition of what defines an archetype. It allows for 12 cards to be different between decklists, which is 4 individual playsets of cards.

At the same time, the most popular archetype (Senna Veigar) contains over 500 different decks which is starting to obscure a lot of individual card choices which may be meaningful.

A dynamic ARCHETYPE_SIZE might be what’s needed, in particular for decks which have a lot of “flex” slots. Smaller changes in those popular decks could create new archetypes, and fringe decks could get grouped more easily to allow for easier analysis. This might be something I explore in the future.

Analysing the results#

And finally we get to the fun part, running it and finding out what’s the best deck in Legends of Runeterra right now!

Or so I thought. Let’s take a look at the result.

# This will be a list of lists containing deck codes
common_cards_cluster = []

# Progress visualization
from tqdm.notebook import tqdm

# Running the clustering process for all regions takes a few minutes
# It could be heavily optimized but it's good enough for now
for faction in tqdm(factions_decks):
    best_clusters = get_best_clusters(factions_decks[faction], CommonCardsCluster)
    common_cards_cluster.extend(best_clusters)

---------------------------------------------------------------------------
KeyboardInterrupt                         Traceback (most recent call last)
Input In [10], in <cell line: 9>()
      7 # Running the clustering process for all regions takes a few minutes
      8 # It could be heavily optimized but it's good enough for now
      9 for faction in tqdm(factions_decks):
---> 10     best_clusters = get_best_clusters(factions_decks[faction], CommonCardsCluster)
     11     common_cards_cluster.extend(best_clusters)

Input In [7], in get_best_clusters(deck_codes, cluster_class)
     15     return any(deck in c for c in clusters)
     17 while len(remaining_decks) > 0:
---> 18     possible_clusters = get_all_clusters(remaining_decks, cluster_class)
     20     for possible_cluster in sorted(
     21         possible_clusters, key=lambda x: len(x), reverse=True
     22     ):
     23         # If no other cluster already contains any of the decks in the current candidate cluster, we add it
     24         # This means we will always add the biggest cluster found at that step
     25         if all(not clusters_contains_deck(clusters, d) for d in possible_cluster.decks):

Input In [6], in get_all_clusters(deck_codes, cluster_class)
     17     candidates = list(deck_codes)
     19     # We get the biggest cluster centered on deck
---> 20     clusters.append(get_centered_cluster(center, candidates, cluster_class))
     22 return clusters

Input In [5], in get_centered_cluster(center, candidates, cluster_class)
     25 # We iterate on possible members
     26 # We copy the list because we want to be able to remove candidates during iteration
     27 for candidate in list(candidates):
---> 29     cluster_distance = cluster.distance(candidate)
     30     can_be_added = cluster.can_be_added_to_cluster(cluster_distance)
     32     # We check if the deck can be added to our cluster first

Input In [8], in CommonCardsCluster.distance(self, deck_code)
     21 def distance(self, deck_code: str) -> int:
     22     # If the cluster_intersection is fully contained in the decks_data cards, cluster_intersection = cluster + deck intersetion and len = 0
     23     # This will always be >= 0 because at most there's full overlap between a deck and a cluster's intersection
---> 24     return len(self.intersection) - len(
     25         # & is the intersection operator for sets
     26         self.intersection
     27         & decks_data[deck_code].cards
     28     )

KeyboardInterrupt: 

# We'll take a look at the 100 archetypes with the most games
for cluster in sorted(common_cards_cluster, key=lambda x: len(x), reverse=True)[:100]:
    stats = cluster.get_stats()

    if "Miss Fortune" in stats.title and "Twisted Fate" in stats.title:
        stats.display()

Miss Fortune Twisted Fate - Bilgewater Noxus

59.54% winrate
127,094 games
257 decklists

Aggregated decklist:

3x Precious Pet	3x Marai Warden	3x Legion Saboteur	3x Decimate
3x Noxian Fervor	3x Jagged Butcher	3x Legion Rearguard	3x Zap Sprayfin
3x Legion Grenadier	3x Twisted Fate	3x Miss Fortune	3x Riptide Sermon
2x Eye of Nagakabouros	2x Make it Rain

Miss Fortune Twisted Fate - Bilgewater Noxus

58.01% winrate
2,939 games
51 decklists

Aggregated decklist:

3x Marai Warden	3x Legion Saboteur	3x Legion Grenadier	3x Noxian Fervor
3x Twisted Fate	3x Miss Fortune	3x Decimate	3x Legion Rearguard
3x Zap Sprayfin	3x Eye of Nagakabouros	3x Precious Pet	3x Riptide Sermon
2x Jagged Butcher	2x Make it Rain

Miss Fortune Twisted Fate - Bilgewater Noxus

58.58% winrate
1,137 games
32 decklists

Aggregated decklist:

3x Marai Warden	3x Legion Saboteur	3x Zap Sprayfin	3x Decimate
3x Legion Grenadier	3x Noxian Fervor	3x Miss Fortune	3x Legion Rearguard
3x Jagged Butcher	3x Riptide Sermon	3x Twisted Fate	2x Eye of Nagakabouros
2x Make it Rain	2x Precious Pet	1x Tentacle Smash

Gangplank Miss Fortune Twisted Fate - Bilgewater Noxus

56.78% winrate
2,892 games
31 decklists

Aggregated decklist:

3x Legion Saboteur	3x Zap Sprayfin	3x Decimate	3x Imperial Demolitionist
3x Noxian Fervor	3x Crackshot Corsair	3x Miss Fortune	3x Legion Rearguard
3x Marai Warden	3x Precious Pet	3x Legion Grenadier	2x Island Navigator
2x Twisted Fate	1x Gangplank	1x Jagged Butcher	1x Arachnoid Sentry

As we can see, we have 3 archetypes that are… Pretty much the same MF-TF aggro deck. The differences in their aggregate decklists are minimal, and they should not be 3 different archetypes, or split differently at least.

I think this is due to the fact that players will try pretty much any change to a decklist, which means each individual card will get cut from the decklist at some point. Archetypes aren’t really defined by cards they all have in common, but more by how many cards they have in common.

So it’s time to implement our second distance!

Clustering based on cards differences#

Let’s take the opposite approach:

The distance of a deck to a cluster is the maximum number of different cards it has with a deck in the cluster
We will add new decks by minimizing this maximum distance
We will set a limit on the maximum distance between two decks in an archetype

The goal is to add decks that differ slightly from eachother, whichever cards the players decide to change.

Even though the number seems high at MAX_DIFFERENCE=15, it’s the maximum difference between two arbitrary decks in the cluster. As the clusters are formed iteratively, similar decks will get grouped together early and hopefuly it won’t catch too many “parallel” archetypes.

# This will be the maximum number of different cards 2 different decks can have while in the same cluster
MAX_DIFFERENCE = 15


# Global distance cache
decks_distance_cache = defaultdict(dict)


def deck_to_deck_distance(dc_1: str, dc_2: str):
    # This time we will use this distance heavily and add a distance cache

    # We sort the decks
    d1, d2 = sorted((dc_1, dc_2))

    if d2 not in decks_distance_cache[d1]:
        decks_distance_cache[d1][d2] = DECK_SIZE - len(
            # & is the intersection operator for sets
            decks_data[d1].cards
            & decks_data[d2].cards
        )

    return decks_distance_cache[d1][d2]


class DifferenceBasedCluster(Cluster):
    def __init__(self, center: str) -> None:
        super().__init__(center)

        self.max_difference = MAX_DIFFERENCE

        # Internal cluster distance cache to speed up max search
        # Each deck will point to its maximum distance to the cluster
        # Which means when iterating a deck, we only need to check what's biggest between the max distance in the cache and the one with the last deck added
        self.cluster_distance_cache = defaultdict(lambda: 0)

    def distance(self, deck_code: str) -> int:
        # We could do max(self.deck_to_deck_distance(d, deck_code) for d in self.decks)
        # But we want to speed things up and any deck with a distance > MAX_DIFFERENCE can be ruled out directly
        max_distance = max(self.cluster_distance_cache[deck_code], deck_to_deck_distance(deck_code, self.decks[-1]))

        # In this situation the deck we selected will be too far, we return the maximum value and we'll remove the deck
        if max_distance > self.max_difference:
            return float("inf")

        # Else we save the new value
        self.cluster_distance_cache[deck_code] = max_distance
        
        return max_distance

    def can_be_added_to_cluster(self, distance: int) -> Optional[bool]:
        # By now I realize this function was superfluous, and we could have relied on distance() returning float("inf") as our way to prune decks
        #   Refactoring notebooks is a huge pain though so I won't change it :D

        # If adding the new deck would make the intersection go over the archetype size, it will never be able to be added
        # For example if we currently have an intersection of 30 and an archetype size of 28, the max distance to add is 2
        if distance == float("inf"):
            return None

        else:
            return distance < self.max_difference

    def add(self, deck_code: str) -> None:
        self.decks.append(deck_code)

for deck_code in decks_data:
    # A deck code has 100% overlap with itself
    assert deck_to_deck_distance(deck_code, deck_code) == 0

    if deck_code != first_deck_code:
        # deck-to-deck distance is always > 0 as they're distinct and <= 40 as that's the maximum difference
        assert 0 < deck_to_deck_distance(deck_code, first_deck_code) <= 40

        # In this specific case, the distance to the cluster is the same as the distance to the deck:
        #   it's the number of non-overlapping cards
        cluster = DifferenceBasedCluster(first_deck_code)

        # Our cluster distance returns float("inf") if deck is too far, so we need to check >= here
        assert cluster.distance(deck_code) >= deck_to_deck_distance(
            deck_code, first_deck_code
        )

# Validation of the centered clusters code
for deck in factions_decks[max_faction][:200]:

    # Only checking out of 200 decks for speed
    cluster = get_centered_cluster(
        deck,
        factions_decks[max_faction][:200],
        DifferenceBasedCluster,
    )

    assert len(cluster.decks)

# Validation of the best clusters code
clusters = get_best_clusters(
    factions_decks[max_faction][:200],
    cluster_class=DifferenceBasedCluster,
)

# We clustered 200 decks, the sum of clusters lengths should be 200
assert sum(len(c) for c in clusters) == 200

Checking results#

# This will be a list of lists containing deck codes
difference_based_clusters = []

# Progress visualization
from tqdm.notebook import tqdm

# Running the clustering process for all regions is ~1 hour with that algorithm
# It could be heavily optimized but let's say it's ok for now
for faction in tqdm(factions_decks):
    best_clusters = get_best_clusters(factions_decks[faction], DifferenceBasedCluster)
    difference_based_clusters.extend(best_clusters)

# We'll take a look at the 1000 archetypes with the most games
for cluster in sorted(difference_based_clusters, key=lambda x: len(x), reverse=True)[:100]:
    stats = cluster.get_stats()

    if stats.title.startswith("Miss Fortune Twisted Fate"):
        stats.display()

Miss Fortune Twisted Fate - Bilgewater Noxus

59.46% winrate
136,131 games
506 decklists

Aggregated decklist:

3x Legion Saboteur	3x Noxian Fervor	3x Decimate	3x Legion Rearguard
3x Marai Warden	3x Zap Sprayfin	3x Precious Pet	3x Jagged Butcher
3x Legion Grenadier	3x Twisted Fate	3x Miss Fortune	3x Riptide Sermon
2x Eye of Nagakabouros	2x Make it Rain

Looks better! We only have a single pirates aggro list that properly catches the decklists from the 3 previous ones we had.

So after all this… Let’s take a look at the best decks and call it a day :D

# Let's take a look at the 100 archetypes with the most games and print the 10 highest winrates amongst tho
for cluster in sorted(
    sorted(difference_based_clusters, key=lambda x: len(x), reverse=True)[:100],
    key=lambda x: x.get_stats().winrate,
    reverse=True,
)[:10]:
    stats = cluster.get_stats()
    stats.display()

Miss Fortune Twisted Fate - Bilgewater Noxus

59.45% winrate
137,768 games
506 decklists

Aggregated decklist:

3x Legion Saboteur	3x Noxian Fervor	3x Decimate	3x Legion Rearguard
3x Marai Warden	3x Zap Sprayfin	3x Precious Pet	3x Jagged Butcher
3x Legion Grenadier	3x Twisted Fate	3x Miss Fortune	3x Riptide Sermon
2x Eye of Nagakabouros	2x Make it Rain

Ziggs Katarina Annie - Noxus Shurima

58.10% winrate
4,881 games
40 decklists

Aggregated decklist:

3x Legion Saboteur	3x Treasure Seeker	3x Imperial Demolitionist	3x Decimate
3x Legion Rearguard	3x Merciless Hunter	3x Legion Grenadier	3x Ruin Runner
3x Ruinous Path	3x Baccai Reaper	3x Annie	2x Noxian Fervor
2x Ziggs	2x Might	1x Katarina

Taric Poppy - Demacia MtTargon

57.50% winrate
25,699 games
296 decklists

Aggregated decklist:

3x Golden Aegis	3x Petricite Broadwing	3x Poppy	3x Esmus, Breath of the World
3x Taric	3x Fleetfeather Tracker	3x Mountain Sojourners	3x Pale Cascade
3x Laurent Protege	2x Relentless Pursuit	2x Tyari the Traveler	2x Guiding Touch
2x Solari Soldier	1x Concerted Strike	1x Bastion	1x Ranger's Resolve
1x Brightsteel Protector	1x Sharpsight

Annie Jhin - Jhin Noxus

57.35% winrate
69,715 games
463 decklists

Aggregated decklist:

3x Legion Saboteur	3x Boomcrew Rookie	3x Noxian Fervor	3x Jhin
3x Annie	3x Tusk Speaker	3x Decimate	3x Legion Rearguard
3x Imperial Demolitionist	3x Solari Sunhawk	3x The Stagehand	3x Doombeast
3x Crackshot Corsair	1x Arachnoid Sentry

Gnar Sejuani - BandleCity Freljord

57.34% winrate
3,122 games
57 decklists

Aggregated decklist:

3x Teenydactyl	3x Pokey Stick	3x Inventive Chemist	3x Sejuani
3x Gnar	3x Troll Chant	3x Chief Nakotak	3x Tusk Speaker
3x Bitsy Lizard	3x Poison Dart	3x Ursine Spiritwalker	2x Three Sisters
2x Spotted Toad	2x Minitee	1x Elixir of Iron

Elise Gwen - Noxus ShadowIsles

57.14% winrate
27,051 games
287 decklists

Aggregated decklist:

3x Decimate	3x Precious Pet	3x Noxian Fervor	3x Arachnoid Horror
3x Doombeast	3x Stygian Onlooker	3x Elise	3x House Spider
3x Frenzied Skitterer	3x Legion Saboteur	2x Imperial Demolitionist	2x Stalking Shadows
2x Gwen	2x Boisterous Host	2x Phantom Butler

Gangplank Miss Fortune Twisted Fate - Bilgewater Noxus

57.04% winrate
3,054 games
48 decklists

Aggregated decklist:

3x Noxian Fervor	3x Miss Fortune	3x Legion Saboteur	3x Legion Rearguard
3x Decimate	3x Crackshot Corsair	3x Imperial Demolitionist	3x Zap Sprayfin
3x Marai Warden	3x Precious Pet	3x Legion Grenadier	2x Island Navigator
2x Twisted Fate	2x Arachnoid Sentry	1x Gangplank

Viego Evelynn - Evelynn ShadowIsles

56.95% winrate
83,699 games
610 decklists

Aggregated decklist:

3x Camavoran Soldier	3x Viego	3x Evelynn	3x Sultur
3x Glimpse Beyond	3x Hate Spike	3x Domination	3x Vile Feast
3x Invasive Hydravine	3x Black Spear	2x Vengeance	2x Vora
2x Neverglade Collector	2x Atrocity	2x Barkbeast

Gwen Elise Katarina - Noxus ShadowIsles

56.83% winrate
39,115 games
249 decklists

Aggregated decklist:

3x Boisterous Host	3x Phantom Butler	3x Gwen	3x Ravenous Flock
3x Fallen Reckoner	3x Ruined Reckoner	3x House Spider	3x Legion Rearguard
3x Vile Feast	2x Eternal Dancers	2x Arachnoid Sentry	2x The Harrowing
2x Hate Spike	2x Katarina	2x Glimpse Beyond	1x Elise

Shyvana Aurelion Sol - Demacia MtTargon

56.81% winrate
5,497 games
87 decklists

Aggregated decklist:

3x Shyvana	3x Dragon Chow	3x Herald of Dragons	3x Dragon's Clutch
3x Single Combat	3x Ruined Dragonguard	3x Screeching Dragon	3x Sharpsight
3x Wounded Whiteflame	3x Dragonguard Lieutenant	3x Strafing Strike	2x Eclipse Dragon
2x Aurelion Sol	2x Concerted Strike	1x Kadregrin the Ruined

Conclusion#

Because clustering is excruciatingly slow I’m not really satisfied with this first approach. There has to be a fast algorithm that can handle this task.

But in the meanwhile, I’ll just rely on heuristics. For Legends of Runeterra, this can simply be looking at champions and regions.

If you have any other idea for how to implement clustering for card games in a more game-agnostic way, don’t hestitate! I’ll still be looking at the subject moving forward, even though this blogpost already took me much more time than expected as my first notebook-based blog post.