There are lots of caveats with hierarchical clustering, and it’s often used to draw unjustifiable conclusions. But I’ll save that discussion for later, and it’s at least occasionally useful. 🙂 So far, I’ve mainly used it to reorder the columns/rows of a covariance or correlation matrix. For example, I recently generated synthetic sparse precision matrices, and I wanted to make sure that the corresponding covariance/correlation matrices were as I expected.Â
However, linkage/dendrogram in both Matlab and SciPy are really slow. In particular, the linkage algorithms they use are O(n^3). So instead we can use fastcluster, which is O(n^2). All I had to do was replace this:
Z = sch.linkage(P, method="average", metric="correlation")
with this:
Z = fastcluster.linkage(P, method="average", metric="correlation")
Â
The second problem is that dendrogram does lots of unnecessary work when all I want is the reordered indices. SciPy actually implements it recursively, so it gives this error: “RuntimeError: maximum recursion depth exceeded”. So the second change is to replace this:
dendr = sch.dendrogram(Z,no_plot=True,distance_sort=True)
ix = dendr["leaves"]
with this (credit to a good denizen of StackOverflow):
n = len(Z) + 1
cache = dict() for k in range(len(Z)):
    c1, c2 = int(Z[k][0]), int(Z[k][1])
    c1 = [c1] if c1 < n else cache.pop(c1)
    c2 = [c2] if c2 < n else cache.pop(c2)
    cache[n+k] = c1 + c2
ix = cache[2*len(Z)]
Then it’s as simple as:
Pclust = P[ix,:]
Pclust = Pclust[:,ix]
pyplot.imshow(Pclust, interpolation="nearest")