There are lots of caveats with hierarchical clustering, and it’s often used to draw unjustifiable conclusions. But I’ll save that discussion for later, and it’s at least occasionally useful. 🙂 So far, I’ve mainly used it to reorder the columns/rows of a covariance or correlation matrix. For example, I recently generated synthetic sparse precision matrices, and I wanted to make sure that the corresponding covariance/correlation matrices were as I expected.

However, linkage/dendrogram in both Matlab and SciPy are really slow. In particular, the linkage algorithms they use are O(n^3). So instead we can use fastcluster, which is O(n^2). All I had to do was replace this:

Z = sch.linkage(P, method="average", metric="correlation")

with this:

Z = fastcluster.linkage(P, method="average", metric="correlation")

The second problem is that dendrogram does lots of unnecessary work when all I want is the reordered indices. SciPy actually implements it recursively, so it gives this error: “RuntimeError: maximum recursion depth exceeded”. So the second change is to replace this:

dendr = sch.dendrogram(Z,no_plot=True,distance_sort=True)

ix = dendr["leaves"]

with this (credit to a good denizen of StackOverflow):

n = len(Z) + 1

cache = dict()
for k in range(len(Z)):

c1, c2 = int(Z[k][0]), int(Z[k][1])

c1 = [c1] if c1 < n else cache.pop(c1)

c2 = [c2] if c2 < n else cache.pop(c2)

cache[n+k] = c1 + c2

ix = cache[2*len(Z)]

Then it’s as simple as:

Pclust = P[ix,:]

Pclust = Pclust[:,ix]

pyplot.imshow(Pclust, interpolation="nearest")