KM-301a · Module 3
AI-Assisted Taxonomy Generation
5 min read
When a knowledge base has accumulated hundreds or thousands of items without a coherent taxonomy, building one manually is a significant project. AI-assisted category discovery — using clustering and embedding techniques to surface natural groupings — can dramatically reduce the manual effort. The key word is "assisted." AI surfaces candidate categories; humans decide which ones are taxonomically meaningful.
- Embed the Content Corpus Generate vector embeddings for each item in the knowledge base using a text embedding model. The embeddings capture semantic similarity — items that discuss similar concepts will have similar embedding vectors. Use titles, abstracts, or introductory paragraphs for embedding rather than full documents. Full-document embeddings pick up structural and formatting noise that obscures thematic similarity.
- Run Hierarchical Clustering Apply hierarchical agglomerative clustering to the embedding space. Hierarchical clustering does not require you to specify the number of clusters in advance — it builds a dendrogram that you can cut at different depths to reveal coarse or fine-grained groupings. Cut at a low depth to see broad thematic areas; cut at a higher depth to see subcategories. The dendrogram is a discovery tool, not a final taxonomy.
- Label and Validate Clusters For each cluster, ask an LLM to summarize what the items in the cluster have in common and propose a category name. Then validate: does the proposed name match how users think about this content? Are there items that clearly do not belong? Are any clusters too heterogeneous to have a coherent name? Cluster labels are hypotheses. Domain experts validate or reject them.
- Iterate on Cluster Resolution Some clusters will be too broad; some will be too narrow. For over-broad clusters, run a second-pass clustering on just that cluster's items. For over-narrow clusters, merge similar clusters and find the more general label. The goal is not to follow the algorithm — it is to use the algorithm to surface groupings that human reviewers then accept, reject, or reshape.
import numpy as np
from sklearn.cluster import AgglomerativeClustering
from sklearn.metrics.pairwise import cosine_similarity
def discover_taxonomy_candidates(embeddings: np.ndarray, n_clusters: int = 20) -> dict:
"""
Cluster knowledge base embeddings to surface natural category candidates.
Returns cluster assignments and within-cluster similarity scores.
"""
clustering = AgglomerativeClustering(
n_clusters=n_clusters,
metric='cosine',
linkage='average'
)
labels = clustering.fit_predict(embeddings)
clusters = {}
for i in range(n_clusters):
mask = labels == i
cluster_embeddings = embeddings[mask]
# Coherence score: mean pairwise cosine similarity within cluster
if len(cluster_embeddings) > 1:
sim_matrix = cosine_similarity(cluster_embeddings)
np.fill_diagonal(sim_matrix, 0)
coherence = sim_matrix.sum() / (len(cluster_embeddings) * (len(cluster_embeddings) - 1))
else:
coherence = 1.0
clusters[i] = {
'item_indices': np.where(mask)[0].tolist(),
'size': int(mask.sum()),
'coherence': round(float(coherence), 3),
# Low coherence clusters are candidates for splitting
'review_flag': coherence < 0.65
}
return clusters
# After clustering: pipe representative items per cluster to an LLM
# to propose category names, then route to domain expert for validation.