KM-301a · Module 3

AI-Assisted Taxonomy Generation

5 min read

When a knowledge base has accumulated hundreds or thousands of items without a coherent taxonomy, building one manually is a significant project. AI-assisted category discovery — using clustering and embedding techniques to surface natural groupings — can dramatically reduce the manual effort. The key word is "assisted." AI surfaces candidate categories; humans decide which ones are taxonomically meaningful.

  1. Embed the Content Corpus Generate vector embeddings for each item in the knowledge base using a text embedding model. The embeddings capture semantic similarity — items that discuss similar concepts will have similar embedding vectors. Use titles, abstracts, or introductory paragraphs for embedding rather than full documents. Full-document embeddings pick up structural and formatting noise that obscures thematic similarity.
  2. Run Hierarchical Clustering Apply hierarchical agglomerative clustering to the embedding space. Hierarchical clustering does not require you to specify the number of clusters in advance — it builds a dendrogram that you can cut at different depths to reveal coarse or fine-grained groupings. Cut at a low depth to see broad thematic areas; cut at a higher depth to see subcategories. The dendrogram is a discovery tool, not a final taxonomy.
  3. Label and Validate Clusters For each cluster, ask an LLM to summarize what the items in the cluster have in common and propose a category name. Then validate: does the proposed name match how users think about this content? Are there items that clearly do not belong? Are any clusters too heterogeneous to have a coherent name? Cluster labels are hypotheses. Domain experts validate or reject them.
  4. Iterate on Cluster Resolution Some clusters will be too broad; some will be too narrow. For over-broad clusters, run a second-pass clustering on just that cluster's items. For over-narrow clusters, merge similar clusters and find the more general label. The goal is not to follow the algorithm — it is to use the algorithm to surface groupings that human reviewers then accept, reject, or reshape.
import numpy as np
from sklearn.cluster import AgglomerativeClustering
from sklearn.metrics.pairwise import cosine_similarity

def discover_taxonomy_candidates(embeddings: np.ndarray, n_clusters: int = 20) -> dict:
    """
    Cluster knowledge base embeddings to surface natural category candidates.
    Returns cluster assignments and within-cluster similarity scores.
    """
    clustering = AgglomerativeClustering(
        n_clusters=n_clusters,
        metric='cosine',
        linkage='average'
    )
    labels = clustering.fit_predict(embeddings)

    clusters = {}
    for i in range(n_clusters):
        mask = labels == i
        cluster_embeddings = embeddings[mask]

        # Coherence score: mean pairwise cosine similarity within cluster
        if len(cluster_embeddings) > 1:
            sim_matrix = cosine_similarity(cluster_embeddings)
            np.fill_diagonal(sim_matrix, 0)
            coherence = sim_matrix.sum() / (len(cluster_embeddings) * (len(cluster_embeddings) - 1))
        else:
            coherence = 1.0

        clusters[i] = {
            'item_indices': np.where(mask)[0].tolist(),
            'size': int(mask.sum()),
            'coherence': round(float(coherence), 3),
            # Low coherence clusters are candidates for splitting
            'review_flag': coherence < 0.65
        }

    return clusters

# After clustering: pipe representative items per cluster to an LLM
# to propose category names, then route to domain expert for validation.