1st-level generative AI as applications that are directly based on X-to-Y models (foundation models that build a kind of operating system for downstream tasks) where X and Y can be text/code, image, segmented image, thermal image, speech/sound/music, avatar, depth, 3D, video, 4D (3D video, NeRF), IMU (Inertial Measurement Unit), amino acid sequences (AAS), 3D-protein structure, sentiment, emotions, gestures, etc., e.g.
- X = text, Y = text: LLM-based chatbots like ChatGPT (from OpenAI based on LLMs GPT-3.5 [4K context] or GPT-4 [8K/32K context]), Bing Chat (GPT-4), Bard (from Google, based on PaLM 2), Claude (from Anthropic [100K context]), Alpaca, Vicuna, OpenAssistant, HuggingChat (all based on LLaMA [GitHub] from Meta), OpenChatKit (based on EleutherAI’s GPT-NeoX-20B), CarperAI, Guanaco, (other LLMs: MPT-7B from Mosaic [65K context, commercially usable]), or coding assistants (like GitHub Copilot / OpenAI Codex, AlphaCode from DeepMind, CodeWhisperer from Amazon, Ghostwriter from Replit, StarCoder from Big Code Project led by Hugging Face, CodeT5+ from Salesforce, Gorilla), or writing assistants (like Jasper, Copy.AI), etc.
- X = text, Y = image: Dall-E (from OpenAI), Midjourney, Stable Diffusion (from Stability.AI), Adobe Firefly, DeepFloyd-IF (from Deep Floyd, [GitHub, HuggingFace]), Imagen and Parti (from Google)
- X = text, Y = 360° image: Skybox AI (from Blockade Labs)
- X = text, Y = avatar: D-ID, Synthesia, Colossyan, Hour Once, Movio, YEPIC-AI, Elai.io
- X = speech + face video, Y = synched audio-visual: Lalamu
- X = text, Y = video: Gen-2 (from Runway Research), Imagen-Video (from Google), Make-A-Video (from Meta), or from NVIDIA
- X = image, Y = text: GPT-4 (from OpenAI)
- X = image, Y = segmented image: Segment Anything Model (SAM by Meta)
- X = speech, Y = text: STT (speech-to-text engines) like Whisper (from OpenAI), MMS [GitHub] (from Meta)
- X = text, Y = speech: TTS (text-to-speech engines) like VALL-E (from Microsoft), ElevenLabs, or Bark
- X = text, Y = music: MusicLM (from Google), RIFFUSION
- X = text, Y = 3D: DreamFusion (from Google)
- X = text, Y = 4D : MAV3D (from Meta)
- X = image, Y = audio: ImageBind  (from Meta, on GitHub)
- X = audio, Y = image: ImageBind  (from Meta)
- X = text, Y = image & audio: ImageBind  (from Meta)
- X = audio & image, Y = image: ImageBind  (from Meta)
- X = IMU, Y = video: ImageBind (from Meta)
- X = AAS, Y = 3D-protein: AlphaFold (from Google), RoseTTAFold (from Baker Lab), ESMFold (from Meta)
- X = 3D-protein, Y = AAS: ProteinMPNN (from Baker Lab)
- X = 3D structure, Y = AAS: RFdiffusion (from Baker Lab)
and 2nd-level generative AI that builds some kind of middleware and allows to implement agents by simplifying the combination of LLM-based 1st-level generative AI with other tools via actions (like web search, semantic search [based on embeddings and vector databases like Pinecone], source code generation [REPL], calls to math tools like Wolfram Alpha, etc.), by using special prompting techniques (like templates, Chain-of-Thought [COT], Self-Consistency, Self-Ask, Tree Of Thoughts) within action chains, e.g.
- ChatGPT Plugins (for simple chains)
- LangChain + LlamaIndex (for simple or complex chains)
we currently (April/May 2023) see a 3rd-level of generative AI that implements agents that can solve complex tasks by the interaction of different LLMs in complex chains, e.g.
- Llama Lab (llama_agi, auto_llama)
- Camel, Camel-AutoGPT
- JARVIS (from Microsoft)
- Generative Agents
- ACT-1 (from Adept)
However, older publications like Cicero may also fall into this category of complex applications. Typically, these agent implementations are (currently) not built on top of the 2nd-level generative AI frameworks. But this is going to change.
Other, simpler applications that just allow semantic search over private documents with a locally hosted LLM and embedding generation, such as e.g. PrivateGPT which is based on LangChain and Llama (functionality similar to OpenAI’s ChatGPT-Retrieval plugin), may also be of interest in this context. And also applications that concentrate on the code generation ability of LLMs like GPT-Code-UI, an open-source implementation of OpenAI’s ChatGPT Code Interpreter, should be noticed.
The next level may then be governed by embodied LLMs and agents (like PaLM-E with E for Embodied).