The causes of and impediments to corporate Machiavellianism

In Freda Utley’s fascinating memoir, Odyssey of a Liberal, she wrote about her youthful interest in Machiavelli:

“In my essay on Machiavelli, I argued that there was not really such a disparity as generally supposed between the Florentine’s advice to tyrants, as expressed in his “Prince,” and his eulogy of Republican Virtues in his “Commentaries on Livy” – the Roman classical historian. As I saw it, when fifteen years old, men are usually ready to condone, or even approve, actions taken by their state or country which they condemn when taken by an individual, so that what seemed admirable “virtue” in the Romans was regarded as wickedness in an individual Italian prince. I wish I still had this old essay of mine. All I can now remember is its main argument that Machiavelli’s precepts for Princes – his description of how tyrants maintain their power, which came to be called “Machiavellian,” – was not different in essence to the precepts and practices of the Roman Republic or modern nation states.”

That’s one incisive high school essay!

If certain unchanging principles apply to ancient Romans, Florentine princes, and modern nations, it isn’t a huge leap to believe that these apply to businesses too. On the other hand, corporate Machiavellianism is arguably even less popular than individual Machiavellianism. These two opposing factors help explain why corporations are only Machiavellian to a moderate degree. And your estimate of this degree is probably correlated with your level of disapproval of it.


On the unintended consequences of leadership training programs

On a weekly “open thread” for The Diff, I wrote:

Although the founders weren’t laid off, they founded Health-Ade Kombucha in 2012 while at GSK when it was going through a series of lay-offs. Daina Trout and Vanessa Dew were both in sales at GSK, and in the early 2010s it was struggling to adjust to competition from generics. GSK notoriously mismanaged the process in 2010 by laying off sales reps just days after promising no layoffs at their sales convention:

To improve morale and innovation, GSK started a rotational innovation / leadership program that Daina went through. However, after having this inspirational confidence-building experience, she was sent back to her old job in sales. The rotational program and repeated layoffs were both bad for employee morale by bringing “false dawns” followed by disappointment. Having said that, you might say that GSK’s choices did in fact build confidence and innovation, but only in ways that didn’t benefit GSK itself.

Daina Trout’s interview on the How I Built It podcast is great btw:

The demand-driven success of the American tech industry

Part 1: Why America Wins in the World of Bits

In The Diff, Byrne Hobart makes the following point:

Startup clusters don’t just happen in software: in Japan, there’s a growing number of [startups] in the materials and chemicals industries. Part of what creates a startup cluster is when there’s a big customer, or category of customers, that can be accessed by smaller ones. Silicon valley started out with a heavy defense slant, but eventually got plugged into the consumer economy. In Japan, the high value-added industrial export sector is relatively larger, so it’s a better category for new companies to sell into.

This makes me wonder whether the success of the American tech industry has actually been demand-driven. Even in the early days of the PC, the US had a large community of techie consumer hobbyists eager to spend money on a Mac 128K or IBM PS/2. This community also was happy to give feedback, and magazines like BYTE and PCMag also represented the “voice of the customer,” enabling and forcing PC makers to respond to consumer desires and complaints. In contrast, while Japan was a leading market the latest electronic hardware, neither Japan nor Europe had such a large number of early PC adopters.

This seems true for a lot of tech growth areas. In the early Internet-era, websites were generally able to extract more ad revenue (per impression and per click) from America than elsewhere, which meant the customer base of a web startup was concentrated in the US. Much of the demand for crypto coins and web3 products has come from libertarians, also a disproportionately-American population. The demand for enterprise software, SaaS, and cloud computing is disproportionately from tech startups which are setting up their tech stack and operations from a clean slate, and tech startups are also disproportionately American. And even big, established American companies (many of them grown-up-startups themselves) are disproportionately more open to trying out new enterprise systems.

Obviously, the US has also done well on the supply-side for tech startups (both producing founders and attracting immigrants who then become founders). But I think America’s lead as a market for new tech is an underappreciated aspect of our success.

Part 2: How to Win in the World of Bits — or Atoms

The uniqueness of the American tech market — large and profitable yet highly-competitive — suggests a non-intuitive path for foreign tech startups. If you’re a German government official or VC, it’s natural to think that the right initial niche for the Berlin tech ecosystem is to focus on meeting the unique needs and desires of German tech users, which Silicon Valley companies might tend to ignore. But if what really makes America special is that American customers are living in the future, you need to instead impose export discipline on your portfolio companies.

The preceding analysis offers an alternate explanation not only for why America wins at tech, but also for why tech wins in America. Peter Thiel has famously argued that innovators focus too much on the world of bits and not enough on the world of atoms, and that over-regulation further suppresses the supply of innovations in the world of atoms. But what if instead our problem is one of demand rather than supply?

Buying and selling in the world of bits is relatively depersonalized. Tech users and tech companies are low-loyalty: the only thing that matters is, “How much have you done for me lately?” Meanwhile, purchasing decision-makers in the world of defense, manufacturing, and infrastructure are heavily influenced by loyalty to longstanding suppliers and by personal relationships. They do not demand the greatest possible product at the best possible price, regardless of insider status. These sectors lack outsiders with new ideas and technical founders without glad-handing skills for a simple reason: such innovators are unwanted!

This also suggests a remedy for American founders in the world of atoms: go west! The most demanding, futuristic infrastructure customers are in Asia, so if you can make it there, you can make it anywhere — maybe even in New York. This is not patriotic advice, but product-market fit is a matter of survival for startups; you should compromise on market rather than on building a product you’re proud of.

The supply of innovation is driven by the existence of demand that is capable of choosing exit. A market consisting of customers willing and able to exit their existing relationships in favor of innovators will obtain the innovative products they desire, while simultaneously supporting a startup ecosystem. Aspiring innovators ought to follow the same path: they should exit and choose the customer base which demands excellence and innovation.

Excerpts from “On Some Tendencies in Geometric Investigations” by Corrado Segre

In 1904, Corrado Segre published On Some Tendencies in Geometric Investigations. In this essay, he sought to encourage young mathematicians, attracted to the then-trendy field of algebraic geometry, to pursue deeper investigations of important problems and to seek broader understanding of other, less popular, areas of mathematics. I have included excerpts from the English translation by J. W. Young (with minor edits), which I believe may be useful to fellow researchers in the now-trendy field of machine learning.


… Thus, owing to the exceeding growth of geometric research as well as of methods of transformation, grew also proportionately the facility mentioned by the great French geometer [Michel Chasles] of increasing without limit the number of new propositions, of generalizing and creating in geometry. And this facility which, at least apparently, is greater in this science than in analysis, in mathematical physics, etc., induces many young men, especially in Italy, to give geometry the preference over other branches of mathematics, some of which indeed are very sadly neglected by our students. 

But facility is a bad counsellor; and often the work to which it leads the beginner, while it may serve as training, as preparation for original research, will not deserve to see the light. In the innumerable multitude of scientific publications, geometric writings are not rare in which one would seek in vain for an idea at all novel, for a result which sooner or later might be of service, for anything in fact which might be destined to survive in the science. And one finds instead treatises on trivial problems or investigations on special forms which have absolutely no use, no importance, which have their origin not in the science itself but purely in the caprice of the author; or one finds applications of known methods which have already been made thousands of times; or generalizations from known results which are so easily made that the knowledge of the latter suffices to give at once the former; etc. 

Now such work is not merely useless: it is actually harmful because it produces a real incumbrance in the science and an embarrassment for more serious investigators, and because often it crowds out certain lines of thought which might well have deserved to be studied. Better, far better, that the student, instead of producing rapidly a long series of papers of such a nature, should work hard for a long time on the solution of a single problem, provided it is important: better one result fit to live than a thousand doomed to die at birth!1


But when is a question important? When does it deserve to be made the object of study? 

It is impossible to give a precise answer to this question. The importance of a result is largely relative, is judged differently by different men, and changes with the times and circumstances. It has often happened that great importance has been attached to a problem merely on account of the difficulties which it has presented; and indeed if for its solution it has been necessary to invent new methods, noteworthy artifices, etc., the science has gained more perhaps through these than through the final result. In general we may call important all investigations relating to things which in themselves are important; all those which have a large degree of generality, or which unite under a single point of view subjects apparently distinct, simplifying or elucidating them; all those which lead to results that promise to be the source of numerous consequences; etc. 

The study of the great masters is perhaps the best thing to recommend to the student who wishes to prepare himself to judge of the importance of problems. For it is precisely in the choice of these that the great minds have always shown themselves masters; and even when they have taken up very special problems, they have shown in what way those could become important. And here, in corroboration of the preceding, I will quote the words of Beltrami: “Students should learn to study at an early stage the great works of the great masters instead of making their minds sterile through the everlasting exercises of college, which are of no use whatever, except to produce a new Arcadia where indolence is veiled under the form of useless activity…. Hard study on the great models has ever brought out the strong; and of such must be our new scientific generation if it is to be worthy of the era to which it is born and of the struggles to which it is destined.” 

In such studies one should ever keep before him this other object: to broaden as much as possible his own knowledge. He who is interested only in works relating to the limited field which he is studying, will in the end give undue weight to questions which do not seem so important to another, who with broader knowledge looks at the subject from a higher point of view. It should be the aim of the student by an extended study of the best works in all branches, to attain a wider horizon with regard to the whole science.

1To students who are looking toward the doctorate in mathematics it is well to say frankly that science should not be thought of as a profession in which all can succeed. Though it be true that genius is no longer essential to produce useful results, still a certain aptitude is necessary; and he who knows himself to be without it should, with that veneration and sacrifice which science demands, renounce scientific research. Why should a young man, who could perhaps teach successfully the elementary mathematics and study thoroughly the numerous and important pedagogical questions which present themselves in his teaching, neglect such studies, in order to take up researches in higher mathematics which are not adapted to his type of mind?

Mapping between two Gaussians using optimal transport and the KL-divergence

Suppose you have two multivariate Gaussian distributions S and T, parameterized as N(\mu_S, \Sigma_S) and N(\mu_T, \Sigma_T). How do you linearly transform x \sim S so that the transformed vectors have distribution T? Is there an optimal way to do this? The field of optimal transport (OT) provides an answer. If we choose the transport cost as the type-2 Wasserstein distance W^2_2 between probability measures, then we apply the following linear function:

x \rightarrow \mu_T + A(x-\mu_S) = Ax + (\mu_T - A\mu_Q)


A =\Sigma^{-1/2}_S (\Sigma^{1/2}_S \Sigma_T \Sigma^{1/2}_S )^{1/2}\Sigma^{-1/2}_S = A^\top.

For more details, see Remark 2.31 in “Computational Optimal Transport” by Peyre & Cuturi (available on arXiv here).

But we might instead want to find the transformation which minimizes the Kullback-Leibler divergence between T and the transformed S. We will use the fact that the transformed vector will also come from a Gaussian distribution, with mean and covariance given by

E[Mx+b] = ME[x]+b = M\mu_S + b and Cov[Mx+b] = M Cov[x] M^\top = M\Sigma_S M^\top.

We then set up an optimization problem:

\min_{M, b} D_{KL}(N(\mu_T, \Sigma_T) || N(M\mu_S + b, M\Sigma_S M^\top))

This leads to the following nasty-looking objective:

\min_{M, b} \log(|M \Sigma_S M^\top|) - \log(|\Sigma_T|) + tr([M \Sigma_S M^\top]^{-1} \Sigma_T) + (M\mu_S + b - \mu_T)^\top [M \Sigma_S M^\top]^{-1} (M\mu_S + b - \mu_T)

But we don’t actually need to work through all this algebra, because the optimal transport solution also minimizes the KL-divergence. The KL-divergence D_{KL}(P || Q) reaches a minimum of 0 when P and Q are equal, so we only need to verify that the first optimal transport transformation produces samples with distribution T.

First checking the mean, we verify that E[\mu_T + A(x-\mu_S)] = \mu_T + A(\mu_S-\mu_S) = \mu_T. Next, checking the covariance, we have

Cov[\mu_T + A(x-\mu_S)] = A Cov[x] A^\top = A Cov[x] A = A \Sigma_S A = \\ \Sigma^{-1/2}_S (\Sigma^{1/2}_S \Sigma_T \Sigma^{1/2}_S )^{1/2}\Sigma^{-1/2}_S \Sigma_S \Sigma^{-1/2}_S (\Sigma^{1/2}_S \Sigma_T \Sigma^{1/2}_S )^{1/2}\Sigma^{-1/2}_S = \Sigma_T.

We’ve verified that \mu_T + A(x-\mu_S) \sim N(\mu_T, \Sigma_T), which means that our optimal transport solution also gives us the KL-divergence minimizer.

I’m using this fact in my ongoing research on domain adaptation under confounding. See the arXiv preprint here.

Mind-blindness strategy for Eliciting Latent Knowledge

The Alignment Research Center (ARC) has recently been studying the problem of Eliciting Latent Knowledge from AI agents. ARC has been holding contests to solicit proposals for solving this problem. What follows is my proposal to the February 15 version of the contest. [Update: this proposal received an honorable mention, and I’m definitely interested in exploring this further.]

Basic training strategy and why it might work

The overall strategy is to avoid training a “human simulator” reporter by regularizing its internal state to have mind-blindness. One could imagine training a “Human Simulator” that takes as input the “what’s going on” state, plus a question about what a human believes about the world, and is trained to maximize its accuracy at predicting the human’s answer explanation. Meanwhile, the Reporter is trained with dual objectives: in addition to (as before) answering questions about the world correctly, it should be penalized for containing internal state that is useful to the Human Simulator. To do this, an auxiliary “Human Simulator Detector” model is trained to predict the Human Simulator’s activations from the Reporter’s activations. The Reporter is trained adversarially to minimize the Human Simulator Detector’s accuracy. The reporter would be trained not merely to be bad at predicting the explanation — an evil reporter could pretend to be bad at predicting the explanation — but to not contain state that is predictive of “human simulator state”. This is something that the various defeated strategies are missing. A devious reporter can easily hide its full knowledge from its predicted answer, and may even choose weights to look like it does not contain a human simulator. But we argue that when operating on actual input data, a devious reporter will find it more challenging to run a simulation while pretending not to. 

This strategy is depicted below: 

More precise detail on strategy

The Reporter’s overall loss function to minimize takes the form: 


= reporter_prediction_loss – alpha * human_simulator_detector_loss.

How to find a strategy for choosing the hyperparameter alpha is an open question. But, if the Reporter can do just as well at reporting with and without containing a human simulator, then any positive alpha would be sufficient to choose not to contain a human simulator.

The Human Simulator assumes a form of data augmentation that allows us to train a model that knows how humans would interpret their observations of the world. In one embodiment depicted above, the human labeler will provide not only an answer, but an answer explanation. The Human Simulator would have the same inputs as the Reporter, but would predict this answer explanation rather than the answer. (Other possibilities can be envisioned. For example, the Human Simulator could be trained to predict both the answer and the answer explanation.)

The Human Simulator Detector takes in the Reporter’s internal state (activations) as input, and tries to predict the Human Simulator’s internal state (activations) as input (by training with MSE loss, for example). The Human Simulator Detector actually tries to minimize the 


= simulator_state_mse_loss + gamma * || human_simulator_detector.weights ||^2,

where the latter term can be replaced with some standard approach to minimizing its model complexity. If the Reporter contains a human simulator within itself identical to the Human Simulator, then the Human Simulator Detector will be able to predict the Human Simulator state by simply selecting such states (ie it will be the identity function for the portion of the Reporter that is its human simulator). Note that if the Reporter contains its human simulator activations in a permuted order from the Human Simulator, and if the Human Simulator Detector is a single-layer feedforward network, then it could simply learn a permutation matrix rather than the identity matrix.

On the other hand, if the Reporter does not contain a human simulator within itself, the Human Simulator Detector will either have high model complexity (simulating a human presumably requires a complex model — especially because it is not given “what’s going on” as input, but instead on the Reporter’s internal state, which contains less information according to the Data Processing Inequality), or it will have high simulator_state_mse_loss. Thus, the Reporter can be trained in an adversarial fashion, minimizing its own reporter_loss, while maximizing the human_simulator_detector_loss.


While I do not have a specific counterexample scenario that defeats this strategy, a “recipe” for a successful counterexample can be abstractly described. The scenario must be on in which it is easier for the Reporter to both learn its own simulator while also learning to disguise its simulator from the Human Simulator Detector, rather than learn to be an accurate reporter without learning a simulator. It would seem that standard regularization techniques applied to the Reporter would prefer the latter. So a counterexample would be one in which an accurate reporting would be difficult or impossible without human simulation.

Besides the possibility of counterexamples, this strategy could fail if one is unable to successfully train a Human Simulator Detector in the first place. To succeed we need to solve two problems that appear to be solvable yet practically challenging: training the Human Simulator, and training the Human Simulator Detector using the Human Simulator. 

With regards to training the Human Simulator, the key challenge is obtaining labeled data that forces a model to not only reason about the video input, but also reason about how humans reason about the video input. 

With regards to training the Human Simulator Detector, the model will have to map from a large input space to a large output space. However, because n_activations ~ sqrt(n_weights) in a feedforward neural network, the Human Simulator Detector would probably require roughly the same number of weights as the two other networks. We assume that the Human Simulator Detector can be trained to be permutation invariant with respect to Reporter activations. This is not as hard as it looks: as noted in the previous section, so long as the permutation of activations is the same across samples, then undoing this is a sparse linear transformation. If the permutation of activations varies among samples, then this would be harder.

DRYAD: an evolutionary alternative to DRY (“don’t repeat yourself”)

A growing number of programmers have observed that the object-oriented programming (OOP) dogma DRY (“don’t repeat yourself”) is sometimes actually harmful. A number of alternatives have been proposed, all of which involve thinking clearly and carefully. Think about the right level of abstraction at which shared implementation makes sense. Think about avoiding shared logic and ideas, rather than shared code. Think about bounded contexts, and avoid straying across them.

These are all great ideas, but they all require careful thought. Unfortunately, in many situations developers lack the ability (especially places that had massively duplicated code pre-DRY) or time (especially research settings) to think through these delicate questions.

Evolution offers a useful analogy to think through this problem in such settings. Genetic evolution enables adaptability through duplication. Quantitative traits are encoded across hundreds or thousands of parallel genetic variants. These polygenic traits are able to evolve much more rapidly than monogenic traits where a single genetic variant underlies all variation in a population. Sexual reproduction breaks up highly-correlated genetic variants (“linkage disequilibrium”) so that they can evolve independently to assist the fitness of a species. And the code for these traits is not only duplicated within a single individual’s genetic code: they are duplicated among all the individuals in a population.

Deep neural networks also evolve via a learning algorithm, such as stochastic gradient descent, and they utilize duplication to adapt their weights. In recent years, it has been discovered that wide “overparameterized networks”, with more activations than the number of samples, have better learning dynamics. This extra capacity is not strictly necessary to represent complex functions, but networks are more easily able to move their weights to such functions with this overcapacity.

In software development, when we tolerate duplicated code rather than force shared implementation, each copy can evolve in its natural direction, unencumbered by other use cases. Of course, this approach has a cost: code bloat. Fortunately, evolution offers us a solution: pruning. In biological evolution, natural selection filters out harmful genetic variants. And increasingly, pruning is used in deep learning to speed up inference of over-parameterized networks. Notably, pruning in both evolution and deep learning works very well with even simple strategies. It is often hard to find an accurate neural net, but once you have one, it’s easy to find hidden units that can be discarded. The evolutionary alternative to DRY requires only discipline rather than forethought: do repeat yourself and deduplicate (DRYAD).

DRYAD is a useful approach in R&D as well, at both an individual level and team level. When getting started on a new research project, it’s useful to copy-and-paste old code, recklessly changing it to meet one’s needs. As the project matures, one can readily identify duplicated code, and then start merging them. Similarly, it is hard for a development team to figure out and agree in advance on which code logic can rely on shared implementation. Yet, so long as all programmers have the discipline to “clean up after themselves” — to look out for old duplicated code and remove it — code quality can be maintained. Code deduplication (while it may not be fun) is actually intellectually easy, because it comes with the benefit of hindsight.

The Newton Schulz iteration for matrix inversion

The Newton Schulz method is well-known, and the proof of convergence is widely available on the internet. Yet the derivation of the method itself is more obscure. Here it is:

We seek the zero of f: \mathbb{R}^2 \rightarrow \mathbb{R}^2, defined as follows:
\begin{array}{l} f(X) = X^{-1} - A. \end{array}

The derivative of f at X, applied to matrix B, operates on B as follows:
\begin{array}{l} f'(X)[B] = -X^{-1} B X^{-1}. \end{array}

We can prove that f'^{-1} at X, applied to matrix B, operates on B as follows:
\begin{array}{l} f'^{-1}(X)[B] = -X B X. \end{array}

To see this, notice that
\begin{array}{l} B \\ = f'^{-1}(X)\Big[f'(X)[B]\Big] \\ = -X \Big[-X^{-1} B X^{-1}\Big] X \\ = B. \end{array}

The Newton method for root finding has at each iterate:
\begin{array}{l} X_{t+1} \\ = X_t - f'^{-1}(X_t)\Big[f(X_t)\Big] \\ = X_t - f'^{-1}(X_t)\Big[X^{-1} - A\Big] \\ = X_t - X_t[X^{-1}_t-A] X_t \\ = X_t - [-X_t + X_t A X_t] \\ = 2 X_t - X_t A X_t \end{array}

Thresholding sparse matrices in Matlab

Here are the methods I tried:

function [tA] = hard_threshold(A, t)

    tA = sparse(size(A));
    tA(abs(A) >= t) = A(abs(A) >= t);

    clear tA;
    tA = A;
    tA(abs(tA) < t) = 0;

    clear tA;
    tA = A;
    find_A = find(A);
    find_tA = find(abs(A) >= t);
    victim_tA = setdiff(find_A, find_tA);
    tA(victim_tA) = 0;

    fprintf('numel(A):%i nnz(A):%i nnz(tA):%i \n', numel(A), nnz(A), nnz(tA)');

I first tried a small sparse matrix with 100k elements, 1% sparsity, removing 50% of nonzeros:

A = sprand(1e5,1,0.01); tA = hard_threshold(A, 0.5);
Elapsed time is 0.128991 seconds.
Elapsed time is 0.007644 seconds.
Elapsed time is 0.003038 seconds.
numel(A):100000 nnz(A):995 nnz(tA):489

I next repeated with 1m elements:

A = sprand(1e6,1,0.01); tA = hard_threshold(A, 0.5);
Elapsed time is 15.456836 seconds.
Elapsed time is 0.082908 seconds.
Elapsed time is 0.018396 seconds.
numel(A):1000000 nnz(A):9966 nnz(tA):5019

With 100m elements, excluding the first, slowest, method:

A = sprand(1e8,1,0.01); tA = hard_threshold(A, 0.5);
Elapsed time is 16.405617 seconds.
Elapsed time is 0.259951 seconds.
numel(A):100000000 nnz(A):994845 nnz(tA):498195

The time differential is about the same even when the thresholded matrix is much sparser than the original:

A = sprand(1e8,1,0.01); tA = hard_threshold(A, 0.95);
Elapsed time is 12.980427 seconds.
Elapsed time is 0.238180 seconds.
numel(A):100000000 nnz(A):995090 nnz(tA):49950

The second method fails due to memory constraints for really large sparse matrices:

Error using < 
Requested 1000000000x1 (7.5GB) array exceeds maximum array size preference. Creation of arrays greater than this limit may
take a long time and cause MATLAB to become unresponsive. See array size limit or preference panel for more information. Error in hard_threshold (line 10)
 tA(abs(tA) < t) = 0;

After excluding the second method, the third method gives:

A = sprand(1e9,1,0.01); tA = hard_threshold(A, 0.5);
Elapsed time is 1.894251 seconds.
numel(A):1000000000 nnz(A):9950069 nnz(tA):4977460

Are there any other approaches that are faster?