The date today is September 15th 28th 2023
17 months ago DALL·E 2 was released. It was the first impressive and useful text-to-image AI model. A generative machine learning image model that is capable of generating images from textual descriptions.
It has been 13 months since Stable Diffusion v1.5 was first released, which was the first open source version of a high quality text-to-image generator.
It has been 10 months since the first release of ChatGPT 3.5, the first impressive and useful chat-bot. A machine learning generative language model that is capable of generating text and following user instructions.
It has been only 5 months ago that GPT4 was released which is currently the state of the art generative text model.
And it was only 1 month ago that LLaMA2 was released, the first GPT3.5 level open source language model.
In the upcoming several weeks, GPT4 will be augmented with vision capabilities and tight integration with DALL·E 3. This should be the first very large and highly capable multimodal vision-text chatbot.
If these names don’t mean anything to you, you are probably not in right place.
The world of economically useful machine learning models has gone into high gear in the past ~1.5 years, with new capabilities added by multiple companies and research organizations on monthly and weekly basis. These models are so impressive in their capabilities, that I think it’s no longer prudent to call them just machine learning models, but they truly deserve, for the first time in my opinion, the term AI as they truly possess some form of fairly general and flexible intelligence as well as non negligible creativity capabilities.
Since the world is changing in such fast pace, it’s very interesting to ask - where this is going?
For those of us who are thinking about what to do in the upcoming years and what to work on next, the question of “where the puck is going to be” is particularly important, as the answer might have substantial ramifications on our decisions.
This article is my attempt at describing the obvious next steps in AI research.
In the past several years, I personally have played around with and hacked extensively various computer vision and generative image models (and even trained a couple..), so although I was surprised every couple of months by the faster-than-expected pace of improvement in image modeling capabilities, I kinda saw it coming.
But the capabilities of GPT4? I most certainly did not see coming.
I was extremely surprised by that.
In my defense, it appears that almost all other machine learning practitioners that I know were caught by surprise as well (the clear caveat here is that I personally don’t know any ML folks from OpenAI or DeepMind).
The generality and capability of GPT3.5 and later GPT4 got my head racing for a while there, and I started shifting my focus to playing around with LLMs, and became more interested in LLM literature.
Recently, I feel the dust has finally settled in my head trying to make sense of what is going on and what the near future will look like, and I wanted to share it.
I’ve organized the “things that will likely to happen soon” in 7 points below. After presenting these points I go into several paragraphs of pondering mode and ponder about what all this means from AGI timelines perspective.
In particular, I ask - and attempt to answer - what are major uncertainty sources that still remain as opposed to what are just deterministic temporal dynamics of currently known and well understood trajectories.
Disclaimer/Warning: Note that this is not a grocery list. Rather, I’ve attempted to help anyone who reads this also reach the conclusions I’ve reached myself - so sometimes I attempt to build an argument and often refer to the literature. I fully understand that although these points are now sitting inside my head as self evident and completely obvious, they might not be for all readers. Also note that this is of course also an attempt to be somewhat compact so many pieces of evidence and intermediate reasoning steps are left out for the sake of brevity.
Obvious next steps in AI research, in no particular order:
Multimodal capabilities natively inside the weights of a single model
The GATO model by DeepMind has confirmed an interesting hypothesis - that a single transformer model can “pack” a huge amount of unrelated tasks within itself. In that work, DeepMind have inserted into a single tiny 1.2B transformer “language” model (with a measly 1024 tokens context window) the ability to play Atari, control robots, chat, answer questions about images (and more…) by generating data from various specialist models and training the transformer model to mimic them. This model was able reach an average performance level of about 60% of the performance of the specialist models from which training data was acquired. The 60% figure might seem not that impressive, but 1.2B model is also tiny compared to typical language model sizes of today. The fact that a single transformer model can pack inside of it a whole lot of unrelated capabilities makes quite a bit of an intuitive sense that we can all appreciate ourselves by just interacting with GPT. For example, the capabilities of writing poems in the style of Rudyard Kipling, and the ability to remember all the photo-receptor types in the mantis shrimp or the precise details about it’s punch strength, seem largely unrelated - despite the fact they are being communicated in the same “textual modality”. Additionally, the abilities of GPT to know how to write the byte sequence of a JPEG file format, or the sequence of dashes and dots format of Morse code, or chess playing capabilities, these are all basically independent capabilities inserted into GPT just by being easily converted into sequences of characters and training one large transformer on all of these inputs.
Current day non generative computer vision technology has reached an extremely advanced state in the past couple of years, although it is somewhat fragmented. Today there are different models that can detect objects in images (grounding DINO, YOLOv8), segment objects in images (Segment Anything), track objects in videos (Track anything, DEVA), recognize objects in images (Open CLIP). Image generation capabilities are also very advanced (SDXL, midjourny). All of these models pack inside them immense understanding of the visual world. The total parameter count of all these models if combined together is less than 10B parameters. Therefore, simply inserting the capabilities of these models inside a single ~200B parameters language model via brute force doesn’t appear to be too difficult. Indeed, DALL·E 1 and Parti, with 12B and 20B parameters, respectively, have managed to “insert” state of the art image generation capabilities inside a single transformer. Models such as BLIP2 and Flamingo (+Open Flamingo) demonstrate that this is successful also at the intermediate scale as part of a multimodal chat language model. Many additional works have been attempting to pack inside a single model all multi-modal capabilities. Kosmos, CM3Leon, NExT-GPT, PaLM-E, etc. are all proofs of concept at various levels of maturity. Soon both OpenAI and DeepMind will be releasing large multimodal models - GPT4 image inputs should be available in the upcoming several weeks, along with tight DALL·E 3 integration. DeepMind’s Gemini model should be released in a few months. With these two large multi-modal models out in the hands of millions, we will get to examine the capabilities frontier of the first actually large multi-modal models.
There appear to be two main alternatives to how multi-modal models will eventually be trained. The first is the CM3Leon way, in which every modality will be converted into a sequence of characters and a single “language” model is trained on a large corpus of text that also happens to contains sequences of charterers of visual or auditory origin. The second way is the NExT-GPT way, using pre-trained encoders and decoders for each modality as well as a pre-trained language models, and training only several adapter layers in between all of these components. The “sequence of characters” approach appears to be slightly less efficient in the short term, but might prove to be most scalable and performant in the long term. Using pre-trained encoders and decoders for each modality and training adapters seems more like a quick fix to get results quickly by utilizing already pre-trained models than the end solution.
If we delve a bit deeper, the first approach is using modality2string() encoding of new modalities into a sequence of characters, training a regular LLM on that sequence, and using a string2modality() as a decoding mechanism when displaying the modality to the end user at the application layer (think {image2string(), string2image(), audio2string(), string2audio(), …} as utility functions that are applied in every browser or any other application that interacts with humans).Although the NExT-GPT example is not as clean or elegant as CM3Leon, it’s possible to generalize this approach to having multiple modality2vec() and vec2modality() functions, and using a “sequence of vectors” approach to model training, where the model attempts to predict the next vector in the sequence. This, by the way, can also include a text2vec() encoder and vec2text() decoder, making sure that small text chunks are also encoded as vectors, and get rid of tokens altogether this way.
There are many advantages to the “sequence of vectors” approach, but this approach involves a still missing algorithmic piece and that is the capability to learn a probability distribution over predicted vectors instead of outputting the mean or mode of that distribution using MSE or MAE type losses. Because this method is still missing, and the method of predicting the next discrete token is strongly in place (and working extremely well at the moment, including several services that sprung up that just require one to upload a text corpus and they spit out a model trained on this corpus), it looks like the first method will be the method of choice in the near future that will produce the best results. Another interesting perk of the sequence of charterers method, is that a contribution to the knowledge of all major language model training will thus be possible by first constructing datasets of a “sequence of characters” (AKA a text file) and then simply uploading it to the internet (AKA a large pile of text files).
To conclude, a single pure multi-modal “language” model that is able to both generate and perceive visual and auditory modalities and was trained end to end so that it’s internal language representations are jointly developed and strongly influenced by it’s visual/auditory representations, will add visual and auditory capabilities to these models. Understanding graphs and charts. Understanding speech nuances. Understanding sounds created by the environments (rain, dog barking, garbage truck, etc).
The perception space of images and sounds is vastly greater than just pure text. But more importantly, the action space of producing any image or producing any sound is even more substantial since this allows for a huge amount of new applications to be performed by these models.Larger “natural” datasets
It is commonly stated that current models are trained on “all of the internet” and therefore data will soon be exhausted. But the reality is that ~1T tokens (typical dataset size for current LLMs) is only about 1B web pages and can fit on a hard drive with a few TeraBytes. Every academic paper is somewhere between 10K-100K text tokens, every year ~3M papers are published. From just pure academic papers - we have around ~100B extremely high quality tokens generated each year. If we also include the diagrams and charts in those papers (that contain most of the actual results of most papers rather than interpretation of the results that exist in the pure text), then it will become more than that. Perhaps 1-2 additional orders of magnitude of data. Books are 1 order of magnitude larger than the pure text in academic papers. About 1M books are published every year. Each book has around 100K-1M tokens, totaling ~1T high quality tokens each year. If we take into account the multimodal capabilities of the previous point we’ve discussed above, let me also mention all movies and TV shows that are produced every year. Those include a vast amount of documentaries or educational content. Let me also mention YouTube videos with the vast amount of educational and tutorial videos there. Once we have native multimodal capabilities, the “natural” datasets can get much much larger and are very far from exhausted currently.
Also, natural datasets have an interesting quality to them - they tend to contain a relatively small amount of repetition in them. Let me clarify, a scientific paper that is not novel at least in some aspect, will not be written. A book that tells an identical story to a book that was already written, will not be written again. A movie or a TV show will not be remade unless there is at least a tiny novel component in it or some quality improvement in it (although some Hollywood studious are really stress-testing this assumption in recent years). Note that this is always in accordance with humanity's sense of what is the precise amount of quality and novelty. For example, if a scientific paper that illustrates an idea is presented, a second paper that talks on the exact same idea will only be written if the first paper did a poor job of explaining or elaborating it for the sake of “humanity”. If “humanity” understood it the first time, and “took it in” already, the subsequent paper will not be rewritten. Put in different words, if a work was not clear enough, then it will be elaborated upon in subsequent works. Did you notice how I just now wrote the exact same thing multiple times in various different ways? this same thing happens in science and literature and art all the time. In particular in the “hard to understand” areas.
To conclude, newly generated datapoints by humans are typically either novel or higher quality than what currently exists, or both. This is generally true for scientific works, for literature, for film, for art, etc. There are quite a bit of humans on the planet currently, and some non tiny fraction of them produce new datapoints every year in various forms. The data quantity is actually quite large if we think about images, audio and video, and the data quality continues to improve each year.
Also, 1B web pages is not actually “all of the data on the internet”, it’s just some small fraction of it (current estimates are about 50B), and I’m not even mentioning all the private WhatApp chats and Facebook groups (well, here, I mentioned it).Force feeding models with information via synthetic datasets
Most of us implicitly assume that “trained on all of the internet” is a guarantee that the models will contain a lot of explanations and examples on all possible topics. This is not actually the case for things that only require a little bit of explanations for humans, like logical reasoning or math explanations. The reality is that the internet is built for humans. Humans are extremely sample efficient, and need only a few examples to understand how to add numbers or perform logical derivations. Current transformer models are vastly less data efficient than humans, and so these models are likely extremely under-trained on all tasks that humans can grasp with only a few examples and only a little bit of accompanying explanations, since the internet will simply not contain more than what typical humans need.
But, fortunately, transformers are also much more patient than humans. So, if we synthetically generate billions of examples of how to add/multiply numbers with all intermediate calculation steps, or algebraic manipulations on equations, or apply long chains of logical derivations they might just get it. Indeed, have a look at this work that demonstrates that LLMs can actually do arithmetic quite well if the proper training data is provided (in this case, 50M arithmetic exercises with a full multi-step expanded solution). The fact that GPT4 cannot reach accuracy levels of even a 10M parameter model trained on this arithmetic dataset means that “the internet” doesn’t naturally contain sufficient basic math data.
These types of holes can be plugged by generating synthetic datasets. There are things that we can feed to LLMs that humans could never be able to have the patience of doing, like solving 50 million of basic math exercises, and fortunately we can programatically generate these types of datasets and force language models to ingest them. Additional multi-modal examples could be showing the same image at all possible orientations to teach them about the concept of rotation. Or generating multiple 3D viewpoints of the same scene to teach them about 3D worlds. Or synthetically generate bar plots and various charts and graphs so that they visually understand what expression such as “y=sin(ax)” or “y=ax+bx^2+c” look like, or generating html code and a snapshot image of resulting rendered web page. etc etc…
Another interesting aspect of synthetic datasets is that it’s possible to distill noisy low quality web-scale datasets into smaller high quality and densely packed datasets from which it might be much easier to learn. Nice examples of this avenue is displayed in the following two works: Tiny Stories and Textbooks are all you need.
To conclude, the internet is made for humans, humans are vastly more data efficient and therefore need small amounts of examples for most things. This means the internet simply doesn’t have enough data points that transformers need in order to create internal machinery that e.g. multiplies numbers correctly. These gaps can be filled by synthetically generating datasets that specifically address them.
Although this is not precisely how people use this term “synthetic data”, I wish to stress that essence of synthetic data is mostly about teaching large models things that are fairly simple to us (think about image data augmentations as the simplest possible case of synthetically augmented datasets), or things that humanity has already scientifically discovered and “solved” (like arithmetic, or logic, or 3D graphics), and shoving them down the throat of a multi-modal LLMs via brute force. Sorry for the descriptive language, but the point here is to highlight that we need not rely on the capabilities of transformer models to infer various things from small number of data-points. We can just make sure they learn certain things by directly feeding it to them. The internal machinery that is formed inside the transformers after force feeding them with various types of information could potentially be useful for other downstream tasks via transfer learning, or it could not be useful. If it is useful, it’s pure gain. If it’s not useful, well, at least they will know how to multiply numbers. The essence of the word “synthetic” is in the contrast to the “natural” datasets that humans collect for their own use, which are things like newspaper articles, scientific papers, github repos, books, youtube videos, movies and TV shows, reddit threads, twitter threads, personal photo collections, etc.To visualize this, consider the current situation of all data used for training large models, and compare that to all human knowledge. Current data used for training is a strict subset of all human knowledge, and contains many many holes.
Fortunately, some of these holes can be filled via procedurally generated synthetic datasets that can be added to data that is used for training.
Anything we insert into the model in sufficient quantities, becomes in some sense “intuitive” to the model as there is some internal part of the model that performs these operations - a sub circuit. So if we brute force feed it with extremes amounts of long sequences of logical derivations, could the model use some of these logical circuits also in just plain conversation and simply become more logical?
Also, I will briefly mention that it’s possible to force feed language models with information most humans don’t typically even engage with but is easy to collect. Like vast amounts of electron microscopy (EM) images of various material or biological tissues, or light microscopy fluorescent images, or space images collected by the Hubble or James Webb telescope, or the weather data from every weather station on earth for the last 50 years, MRI scans, etc, etc.Larger models
This point is fairly trivial and perhaps the most obvious one. Larger models are basically more capable and are able to absorb more data and do so quicker (more sample efficient, somewhat paradoxically). There are of course diminishing returns, but a true plateau has not yet been reached. It is estimated that GPT4 cost slightly north of 100M dollars to train. It was trained sometime around September 2022, exactly 1 year ago today. Today, that same model would likely cost around ~50M dollars due to compute price reduction. Furthermore, even 10 billion dollar training runs are technically possible today, just a matter of resource allocation. Since we are all curious to see what happens when this is tested, and the funding environment right now is such that AI models have finally reached true usefulness levels, it’s basically clearly something that will be tested in the near future. As hardware and accompanying software drivers improve all aspects of compute will improve. Current models will become fatser, the largest models will be able to increase in size further and run on smaller devices, and the price will drop. Larger models is the most obvious next step.Bulk generation of 100s of tokens at a time instead of 1 token at a time.
At first glance, this could sound like a pure efficiency and speed point. This is of course true, in order to act in real-time, models must be much faster. But pure next token prediction can also create 1000 tokens sequentially, and simply increasing speed 1000x on the single token prediction task can (and likely will) happen due to other software and hardware advances without one-shot bulk generation, so why I’m bothering with inserting this as an additional point?
The simple reason is that it is possible and it is strictly better, therefore it will likely be done.
Why is it strictly better? It is likely that bulk generation will force the model to plan multiple steps ahead as it is a much more difficult task to generate 100 tokens at one go than 1 at a time. Planning capabilities have notoriously been somewhat lacking in large language models and this lacking capability could just be a simple consequence of attempting to predict too little into the future at once. Of course, predicting 1000 tokens into the future might be hard to get exactly right even for very capable models and therefore the feedback signal during training could be lost completely at some point. To address this, new loss functions will be required that will attempt prediction far into the future in a more abstract space and not precise tokens. It is also possible that, since sounds and images are inherently more hierarchical and self consistent in their nature compared to language, it will likely be much much easier to predict the 100th token (that could represent e.g. a 16x16 image patch at location (top, left) = (224, 296) of an image) in a mixed language-vision-audio dataset and therefore there will be a robust signal flowing even when attempting to predict hundreds tokens into the future in bulk. An additional important advantage of bulk generation is that of controllability. Autoregressive models are notoriously difficult to control due to “butterfly effect” or chaos like properties of dynamical systems. One-shot hierarchical generators can be controlled at all possible hierarchies and in many different ways. See for example the amazing case of the cottage industry of literature around controlling the StyleGAN2 generative model that demonstrate extreme levels of control over generated content.
Overall, the need for speed, the need for improving planning capabilities in models, and the need for controllability over the model generation, all of these factors will drive towards bulk generation of a batch of several hunders of tokens in one-shot instead of one token at a time.Novel data generation via interaction with reality
I hope we have established and that it is already generally very clear that if we are able to somehow assemble a dataset of billions of examples and feed it to a transformer, it will be able to absorb this dataset into its weights. And this is basically true no matter what the dataset consists of as long as it contains regularities.
So, we can insert existing static knowledge into a model.
But an important question arises - is there any systematic way to expand the data in a way that expands the models capabilities? Is there any mechanism that will allow the model to self improve in some automatic way? Well, the answer is quite simple really, just allow the model to interact with reality and examine the feedback the model receives from the world.
The simplest possible way to interact with reality is, for example, trying to run code that a model generated and see what happens. First source of feedback - does the code even run? If it doesn’t run, then the generated code is faulty, simple as that. If it runs, does it achieve a desired result? If we have some automatic test to check the validity of a piece of code, that is an additional source of feedback from reality.
Indeed, in this very interesting work, the authors did just that - the model proposes puzzles (test cases), and solutions (code). Fine-tuning a pre-trained language model on all generated code elevates the base performance of the model (but probably due to uninteresting reasons, such as just turning a language model into a code model). But, the interesting part is that fine-tuning only on a subset of the generated code that is verified to be correct boosts the performance more. Illustration of a figure from the paper can be found below.This is an example of a model being able to improve itself by using feedback from interaction with the world. In this case, the world and the feedback is provided by running the code with a python interpreter. Note that the authors of the paper refer to this as synthetic data, but I personally prefer the term novel data generation since any interaction with reality could in principle, especially when applied iteratively, create an unbounded amount of novel data.
The authors of Code LLaMA use a similar technique to generate novel data points that they term self-instruct and demonstrate significant improvements on tasks just by adding only a small amount of 14K new high quality problems and solutions that are verified to be correct by running the code and checking they pass the tests. An additional source of instruction data is collected using the “unnatural” instruction technique. The precise details are unclear but this leads to an additional improvement. Again, just by using self prompting and getting feedback from running code.
An additional interesting avenue of “interaction with reality” and novel data generation is collecting data via interaction with a computer simulated, but complex, environments.
A good example of such work is Voyager. In this work the authors were able to create an agent that continuously gains new skills (a “skill” is a correctly working javascript function in this case) by interacting with the game engine of Minecraft. They do not demonstrate the effectiveness of the novel data collection via an additional fine-tuning step since it’s not currently possible to finetune GPT4 for people outside OpenAI, but they at the very least demonstrate that an increasing set of learned skills is able to help the agent continue to build on top of them and explore the world and “discover” new aspects of the game with a clever in-context learning scheme. It is indeed interesting if an additional fine-tuning step of all skills acquired during the training process will allow the agent to reach new heights in Minecraft.
Another additional avenue of “interaction with reality” could just be regular human usage of AI products. E.g. consider the ChatGPT user intereface by OpenAI. Suppose OpenAI decided to go over all chats that users ever had on the platform and ask GPT4 if the user was satisfied or pleased with that exchange. We know that GPT4 is very good at extracting sentiments from text, so it is well within its area of competence. Now suppose that the top 1% of conversations with the best positive user sentiment scores are now added to the pre-training dataset of GPT 5.
The key principle here is that understanding the feedback from an environment that a generative model is operating in (is the code passing a test? did the agent succeed in creating a diamond in minecraft? is the user satisfied with the chatbot behavior?) is much simpler than performing an action (writing code, responding correctly to user requests, etc.), and therefore can be used as a robust way to assess the quality of an outcome of an interaction. If that outcome is positive, then the new datapoint could be a good candidate to serve as a novel datapoint for training. The only additional requirement for this datapoint to be useful is that the new datapoint is outside the competence area of the model and is located near the frontier of competence.
The figure below illustrates the process of how a model can grow the frontier of it’s competence if it has some access to the value of the outcome of it’s performance. Since estimating if an outcome is good or bad is usually not too difficult, it generally seems like a possible avenue.Note that although most researchers currently refer to this as synthetic data, I believe there is an important distinction between feeding already established knowledge into large models, and coming up with a scheme that allows the model to acquire new knowledge by way of interaction with the world.
By the way, there are a few methods, such as Reflexion and Tree of Thoughts and Retrieval Augmented Generation that allow a pre-trained model to improve it’s own performance without re-training by “paying" with compute during inference time and/or using an external Long Term Memory. This is important as these types of methods could be ways to “extend beyond the competence frontier” of a trained model not only for improving the current iteration of model capabilities, but also for the specific purpose of attaining novel datapoints that could be used to train the next model iteration.
Consider this illustration from Cognitive Architectures for Language Agents paper, that nicely illustrates that it’s possible to complicate the brain of a pretrained LLM to enhance it’s capabilities as an agent. I only wish to point out the obvious - that the model in C could create new training data for the LLM in A.Operating embodied agents with brains built with pure Multimodal LLMs
Technically, this is a form of “novel data generation via interaction with reality”, but it’s so resembling an animal or human that we better put this into a separate category.
Think of a Boston Dynamics robot dog - Spot (if you don’t like paying 70K dollars for your robotic dog, there are also more affordable options like Unitree Go2 and XGO2). Think of such a dog inside the home of 1M households. Think of it continuously operating using a multimodal LLM brain. It doesn't bump into walls since it can see and recognize all typical objects in the world via multimodal capability. It can move around without falling down since boston dynamics have handled the low level control of limb movements and balance, and abstracted it for the LLM brain to simpler commands like “take 2 steps forward”, “move robotic arm to coordinates (x,y,z)”, “grab object”, etc. It can understand what the household members tell it to do due to its auditory and language capabilities, and it can talk back to them via speech synthesis. It is familiar with you and your specific home since it has long term memory and it performs retrieval of relevant past information before every action. You could tell it to go to the kitchen and turn on the kettle so that you can make some tea. Since this is challenging for a robot to do, it could succeed, and it could fail. You will subsequently definitely tell it if it did a good job or a bad job, so it will be fairly easy to determine if the outcome of the attempt was good or bad. Let’s suppose that initially among the 1M robots that were told to go to the kitchen and turn on the kettle, 99.9% failed at this task, but 0.1% succeeded. We can take those 0.1%, i.e. 1000 data good points, and upload the sequence of {(perception, action)} taken by the dog to our data pile. Next month, an update to the brain of all 1M dogs will be issued, that brain will be trained on newly added good data points from the latest month. It will likely increase the probability of success on such tasks next time (let's say from 0.1% to 1%).
Within a few months and after a few updates, it could learn how to do this with 90% success rate. Similar things can happen at multiple domains simultaneously. E.g. household members asking the dog to tell a joke, and the dog recognizing if it elicits a laugh. Also, some of the robots sometimes will fall down the stairs. These data points will definitely not be added to the data pile to train on. I think the point is clear. A rapid evolutionary-like process can start this way. The word rapid here is somewhat understated as we are talking about incorporating the experiences of millions/billions of “animals” and creating a new “generation” of animal species every few days/weeks/months.
The frontier of robot capabilities will continuously grow as the robot encounters new situations in the world. Robot pets that also help out around the house is likely the simplest thing to imagine, and dogs in particular have undergone similar human guided evolution in the past thousands of years and were embedded into human society in a deep way as part of that human guided evolutionary process of non-natural selection. Think about the dataset of {(vision, audition, limb movements)} of all dogs that lived in the past 10,000 years and how much knowledge about the world of humans such a dataset could contain, simply by being “in the room where it happens”.
It’s not too difficult to also imagine robot lab assistants that help scientists perform experiments. These could go from initially occasionally wrecking pieces of equipment or needlessly wasting important lab resources, like typical first year PhD students tend to do. But, after a period of substantial training, these lab assistants could become more like super post docs who know anything about any possible experimental technique and can deliberately go directly to the frontier of scientific knowledge and perform a scientific experiment that will advance that frontier.
Several notable examples of works related to embodied robots can be found in the following links, for those who are not yet convinced that this is coming:
RT-2: Vision-Language-Action Models
TidyBot: Personalized Robot Assistance with Large Language Models
LSC: Language-guided Skill Coordination for open-vocabulary pick-and-place with Spot
RoboCat: A Self-Improving Foundation Agent for Robotic Manipulation
LINGO-1: Exploring Natural Language for Autonomous Driving
Grounded Decoding: Guiding Text Generation with Grounded Models for Robot Control
Building Cooperative Embodied Agents Modularly with Large Language Models
ChatGPT for Robotics: Design Principles and Model Abilities
To conclude, consider our previous image of all data used for training a huge multimodal LLM. Iterative interaction with actual reality, could systematically grow the frontier of the purple area, and potentially even surpass the orange area as the process we’ve described is completely open ended, similar to the process of evolution.
The thing is: all of the points discussed above are in many ways trivial.
The ideas in them exist in thousands of research papers, initial advances in all of these directions were already taken in dozens of research papers and engineering efforts (I’ve linked to many examples in this article).
Today, following DALL·E 2 and ChatGPT “moments”, everyone who is business oriented is paying attention, and resources are flowing into AI applications.
It is therefore clear that all of the above steps will be implemented with basically 100% certainty in the upcoming 1-4 years (most of them likely more on the 1-2 year end rather than 3-4 year end).
The only question remaining is - are the model architectures today data efficient enough at the frontier?
Meaning, are they good enough at learning from a small number of datapoints on the frontier of the data distribution so that the data frontier can grow indefinitely and the models become capable of gaining new knowledge about the world, including eventually knowledge that we humans were not able to get to yet via the scientific method?
Or, will the data inefficiency prove to be too great of a barrier, so that the models will basically be able to competently execute only everything that was demonstrated to them millions/billions of times before?
This is the question of self improvement.
The first scenario will lead to superhuman AGI.
The second scenario will lead to subhuman AGI.
The second scenario will still completely transform the world around us. To have a single object that contains a large fractions of all of humanity's capabilities inside of it is a big deal even it cannot extend human knoweldge. But it will not lead to superhuman AGI just yet. Instead we will just have extremely capable assistants that will augment human economic and scientific output enormously.
Anyhow, the point of this essay is that the second scenario is basically 100% certain at this point.
Those of us who have not done so yet, definitely need to mentally prepare for this world. Based on my conversations with many people of all shapes and sizes on this topic, I believe it’s not a thing that is easy to come to grips with, so it’s better to start the personal psychological preparation now. The only way to feel comfortable in that world is to basically visit it a few times in your imagination.
Regarding the first scenario - it is no longer impossible to imagine it coming to pass. Many low hanging fruits still exist, and many things have not yet been fully expored.
The key point in favor of the second scenario is that “data efficiency at the frontier” is not the same as regular “data efficiency”.
It is crystal clear that these models are extremely data inefficient.
But, at the frontier, the models know a lot about the world already, and a large amount of inductive bias has already been inserted into their weights and intermediate representations. Few shot learning following a long pretraining step was demonstrated extensively both in language and image models.
I believe it is simply intellectually dishonest to state that the probability of this happening is lower than say ~30% in the upcoming ~decade. And likely higher, and possibly sooner. Especially if we account for the probability of a more data efficient architecture to be invented in that time period.
But, it is also far from certain at this point.
A strong counter argument can be made by pointing out the goal of attaining self-driving vehicles. This is clearly a strictly much simpler task than “operate in the world and advance human knowledge and understanding of the world”. This is also an attempt in which a lot of resources were invested into during the past decade.
And yet, still no cigar.
A few “excuses'' can of course be made on behalf of the autonomous vehicles folk. First, the models of 3 years ago were not large enough so no meaningful progress could be achieved and jump-start the data flywheel. Another potential “excuse” could be that the accuracy threshold for deployment is not only high capabilities (“99%”), but near perfect capabilities (“99.9999%”) due to safety reasons, which is extremely hard to guarantee with machine learning models historically, and potentially domains where bad outcomes are less deadly will allow for more rapid imperfect solutions to be deployed. e.g. only 1% of creating a striking image could still be useful to an artist, whereas a car that doesn’t crash 99% of the trips is not very useful to anyone. It is possible that if the data engine starts it’s progress from 1%, capabilities will progress in a much more rapid pace and from a much more earlier stage.
Either way, for both scenarios above - the future seems bright.
Not without risk, but with great promise.
It appears that after a period of ~50 years of relative technological stagnation in the physical world[1], but with huge advances in the digital world[2], the investment of humanity in information processing could finally bear fruit.
In the past 50 years humanity’s best and brightest spent their lives advancing information processing capabilities. At first glance, this sounds stupid that CS PhDs will try to improve ad placement at Google instead of attempting to cure cancer, but the over-sized returns from digital tech that led to programmers being the most lucrative and sought after profession might finally pay off - if those programmers will actually deliver on AGI. If so, this could lead to potentially the biggest economic boom in the history of humanity. And for that, I cannot wait.
May we live in interesting times.
[1] “the world of atoms” as they say, meaning people of the 70s-80s lived in roughly the same houses, prepared food in similar ways, drove roughly the same cars, flown on roughly the same planes and used the roughly same medicine
[2] “the world of bits”, meaning computers, the internet, phones - basically, none of these even existed in the 70s from the consumer perspective. so, tremendous progress in that area.