Skip to main content

· 2 min read
Gaurav Yadav

In the following report, we examine the strategies Taiwan may employ to protect both itself and its semiconductor industry.

We discuss four strategies: 1) Scorch the Fabs - Taiwan threatens or executes the destruction of their fabs to deter invasion or deny their use for China - we conclude that the credibility of this approach is limited and intractable; 2) Sabotage the Fabs - leveraging vulnerabilities in the semiconductor manufacturing process, Taiwan subtly impairs key fabrication equipment - this strategy is suggested to be a last-resort, we consider this approach a more credible alternative to demolition; 3) Boatlift Key Staff - boatlifting personnel possessing tacit knowledge could deny China control of the fabs, the success of this strategy hinges upon the accuracy of the evacuating party to identify irreplaceable personnel and the willingness of said personnel to return if needed - given this logistical challenge, we consider this course of action unlikely, and; 4) Indirect Protection of Semiconductor Assets - while it is probable that sanctions would be imposed following an invasion, and that these sanctions would be economically damaging for all involved, we argue that it is unlikely that such efforts could be effectively maintained. Following this thought, we argue that while an invasion would violate international law, this is likely to be disregarded and unlikely to prevent Chinese aggression. 

Our report represents an attempt to map out the consequences of a Chinese invasion of Taiwan, a topic that appears to be neglected despite its clear significance to the future of AI research.

· 2 min read
Lucy Farnik

In order to solve a task using reinforcement learning, it is necessary to first formalise the goal of that task as a reward function. However, for many real-world tasks, it is very difficult to manually specify a reward function that never incentivises undesirable behaviour. As a result, it is increasingly popular to use reward learning algorithms, which attempt to learn a reward function from data. However, the theoretical foundations of reward learning are not yet well-developed. In particular, it is typically not known when a given reward learning algorithm with high probability will learn a reward function that is safe to optimise. This means that reward learning algorithms generally must be evaluated empirically, which is expensive, and that their failure modes are difficult to predict in advance. One of the roadblocks to deriving better theoretical guarantees is the lack of good methods for quantifying the difference between reward functions. In this paper we provide a solution to this problem, in the form of a class of pseudometrics on the space of all reward functions that we call STARC (STAndardised Reward Comparison) metrics. We show that STARC metrics induce both an upper and a lower bound on worst-case regret, which implies that our metrics are tight, and that any metric with the same properties must be bilipschitz equivalent to ours. Moreover, we also identify a number of issues with reward metrics proposed by earlier works. Finally, we evaluate our metrics empirically, to demonstrate their practical efficacy. STARC metrics can be used to make both theoretical and empirical analysis of reward learning algorithms both easier and more principled.

· 2 min read
Aidan Ewert

One of the roadblocks to a better understanding of neural networks’ internals is polysemanticity, where neurons appear to activate in multiple, semantically distinct contexts. Polysemanticity prevents us from identifying concise, human understandable explanations for what neural networks are doing internally. One hypothesised cause of polysemanticity is superposition, where neural networks represent more features than they have neurons by assigning features to an overcomplete set of directions in activation space, rather than to individual neurons. Here, we attempt to identify those directions, using sparse autoencoders to reconstruct the internal activations of a language model. These autoencoders learn sets of sparsely activating features that are more interpretable and monosemantic than directions identified by alternative approaches, where interpretability is measured by automated methods. Moreover, we show that with our learned set of features, we can pinpoint the features that are causally responsible for counterfactual behaviour on the indirect object identification task (Wang et al., 2022) to a finer degree than previous decompositions. This work indicates that it is possible to resolve superposition in language models using a scalable, unsupervised method. Our method may serve as a foundation for future mechanistic interpretability work, which we hope will enable greater model transparency and steerability.

· One min read
Gaurav Yadav

This report aims to develop a forecast to an open question from the analysis, ‘Prospects for AI safety agreements between countries’ (Guest, 2023): Is there sufficient time to have a ‘risk awareness moment’ in either the US (along with its allies) and China in place before an international AI safety agreement can no longer meaningfully reduce extinction risks from AI?

Bottom line: My overall estimate/best guess is that there is at least a 40% chance there will be adequate time to implement a CHARTS agreement before it ceases to be relevant.