ITT – Stephan Seeger

Defining

1st-level generative AI as applications that are directly based on X-to-Y models (foundation models that build a kind of operating system for downstream tasks) where X and Y can be text/code, image, segmented image, thermal image, speech/sound/music/song, avatar, depth, 3D, video, 4D (3D video, NeRF), IMU (Inertial Measurement Unit), amino acid sequences (AAS), 3D-protein structure, sentiment, emotions, gestures, etc., e.g.

_{X = text, Y = text: LLM-based chatbots like ChatGPT (from OpenAI based on LLMs GPT-3.5 [4K context] or GPT-4 [8K/32K context]), Bing Chat (GPT-4), Bard (from Google, based on PaLM 2), Claude (from Anthropic [100K context]), Llama2 (from Meta), Falcon 180B (from Technology Innovation Institute), Alpaca, Vicuna, OpenAssistant, HuggingChat (all based on LLaMA [GitHub] from Meta), OpenChatKit (based on EleutherAI’s GPT-NeoX-20B), CarperAI, Guanaco, My AI (from Snapchat), Tingwu (from Alibaba based on Tongyi Qianwen), (other LLMs: MPT-7B and MPT-30B from Mosaic [65K context, commercially usable], Orca, Open-LLama-13b), or coding assistants (like GitHub Copilot / OpenAI Codex, AlphaCode from DeepMind, CodeWhisperer from Amazon, Ghostwriter from Replit, CodiumAI, Tabnine, Cursor, Cody (from Sourcegraph), StarCoder from Big Code Project led by Hugging Face, CodeT5+ from Salesforce, Gorilla, StableCode from Stability.AI, Code Llama from Meta), or writing assistants (like Jasper, Copy.AI), etc.}
_{X = text, Y = image: Dall-E (from OpenAI), Midjourney, Stable Diffusion (from Stability.AI), Adobe Firefly, DeepFloyd-IF (from Deep Floyd, [GitHub, HuggingFace]), Imagen and Parti (from Google), Perfusion (from NVIDIA)}
_{X = text, Y = 360° image: Skybox AI (from Blockade Labs)}
_{X = text, Y = 3D avatar: Tafi}
_{X = text, Y = avatar lip sync: Ex-Human, D-ID, Synthesia, Colossyan, Hour Once, Movio, YEPIC-AI, Elai.io}
_{X = speech + face video, Y = synched audio-visual: Lalamu}
_{X = text, Y = video: Gen-2 (from Runway Research), Imagen-Video (from Google), Make-A-Video (from Meta), or from NVIDIA}
_{X = text, Y = video game: Muse & Sentis (from Unity)}
_{X = image, Y = text: GPT-4 (from OpenAI), LLaVA}
_{X = image, Y = segmented image: Segment Anything Model (SAM by Meta)}
_{X = speech, Y = text: STT (speech-to-text engines) like Whisper (from OpenAI), MMS [GitHub] (from Meta), Conformer-2 (from AssemblyAI)}
_{X = text, Y = speech: TTS (text-to-speech engines) like VALL-E (from Microsoft), Voicebox (from Meta), SoundStorm (from Google), ElevenLabs, Bark, Coqui}
_{X = text, Y = music: MusicLM (from Google), RIFFUSION, AudioCraft (MusicGen, AudioGen, EnCodec from Meta), Stable Audio (from Stability.ai)}
_{X = text, Y = song: Voicemod}
_{X = text, Y = 3D: DreamFusion (from Google)}
_{X = text, Y = 4D : MAV3D (from Meta)}
_{X = image, Y = 3D : CSM}
_{X = image, Y = audio: ImageBind [1] (from Meta, on GitHub)}
_{X = audio, Y = image: ImageBind [2] (from Meta)}
_{X = music, Y = image: MusicToImage}
_{X = text, Y = image & audio: ImageBind [3] (from Meta)}
_{X = audio & image, Y = image: ImageBind [4] (from Meta)}
_{X = IMU, Y = video: ImageBind (from Meta)}
_{X = AAS, Y = 3D-protein: AlphaFold (from Google), RoseTTAFold (from Baker Lab), ESMFold (from Meta)}
_{X = 3D-protein, Y = AAS: ProteinMPNN (from Baker Lab)}
_{X = 3D structure, Y = AAS: RFdiffusion (from Baker Lab)}

and 2nd-level generative AI that builds some kind of middleware and allows to implement agents by simplifying the combination of LLM-based 1st-level generative AI with other tools via actions (like web search, semantic search [based on embeddings and vector databases like Pinecone, Chroma, Milvus, Faiss], source code generation [REPL], calls to math tools like Wolfram Alpha, etc.), by using special prompting techniques (like templates, Chain-of-Thought [COT], Self-Consistency, Self-Ask, Tree Of Thoughts, ReAct [Reason + Act], Graph of Thoughts) within action chains, e.g.

_{ChatGPT Plugins (for simple chains)}
_{LangChain + LlamaIndex (for simple or complex chains)}
_ToolFormer

we currently (April/May/June 2023) see a 3rd-level of generative AI that implements agents that can solve complex tasks by the interaction of different LLMs in complex chains, e.g.

_BabyAGI
_Auto-GPT
_{Llama Lab (llama_agi, auto_llama)}
_{Camel, Camel-AutoGPT}
_{JARVIS (from Microsoft)}
_{Generative Agents}
_{ACT-1 (from Adept)}
_Voyager
_SuperAGI
_{GPT Engineer}
_Parsel
_MetaGPT

However, older publications like Cicero may also fall into this category of complex applications. Typically, these agent implementations are (currently) not built on top of the 2nd-level generative AI frameworks. But this is going to change.

Other, simpler applications that just allow semantic search over private documents with a locally hosted LLM and embedding generation, such as e.g. PrivateGPT which is based on LangChain and Llama (functionality similar to OpenAI’s ChatGPT-Retrieval plugin), may also be of interest in this context. And also applications that concentrate on the code generation ability of LLMs like GPT-Code-UI and OpenInterpreter, both open-source implementations of OpenAI’s ChatGPT Code Interpreter/AdvancedDataAnalysis (similar to Bard’s implicit code execution; an alternative to Code Interpreter is plugin Noteable), or smol-ai developer (that generates the complete source code from a markup description) should be noticed.
There is a nice overview of LLM Powered Autonomous Agents on GitHub.

The next level may then be governed by embodied LLMs and agents (like PaLM-E with E for Embodied).

OpenAI released GPT-4 within ChatGPT on March 14, 2023, described in detail in a 98-pages paper (summarized on youtube).

Available to ChatGPT-Plus subscribers (currently with a cap that is changing over time, e.g. 100 messages every 4 hours, or 25 messages every 3 hours).
Still based on training data that cuts off Sept 2021.
It still does not learn from its experience.
Still no internet access.
The training was already finalized in Aug 2022.
Fine-tuned via RLHF (Reinforcement Learning with Human Feedback).
API waitlist is open (so no API access yet for everyone)
API prices (for comparison: GPT-3.5-turbo $0.002 per 1k tokens):
- gpt-4: 8K context window (about 13 pages of text) will cost $0.03 per 1K prompt tokens and $0.06 per 1K completion tokens.
- gpt-4-32k: 32K context window (about 52 pages of text) will cost $0.06 per 1K prompt tokens and $0.12 per 1K completion tokens.
The number of parameters and size of the training data set have both not been published. So competitors are not encouraged to replicate these performance ingredients but are referred to a freely available benchmark (OpenAI Evals) that measures the real performance.
GPT-4 ranks in the 10% best of the bar exam and 0.5% best of biology olympiad.
GPT-4 can handle contexts of over 25,000 words.
GPT-4 can access images as inputs and can generate captions, classifications, and analyses. However, this image-to-text functionality is not yet publicly available.
Microsoft Bing was already using an early version of GPT-4 in the last few weeks.

An excellent overview by Greg Brockman, President and co-founder of OpenAI, can be found on youtube.

Microsoft released Visual ChatGPT on March 08, 2023, in a paper and with source code on GitHub and Hugging Face. Although this does not seem to be GPT-4-based, it demonstrates similar image capabilities via a combination of pre-existing technologies (generate/modify [text-to-image], and describe [image-to-text]).

Two days after the GPT-4 release, Microsoft announced on March 16, 2023, the integration of GPT-4 into their Office products as a feature they called Copilot. Copilot is not yet available for general use, but Microsoft plans to roll it out gradually to selected customers in the coming months.

Tag: ITT

3rd-Level of Generative AI

OpenAI releases GPT-4

Recent Posts

Recent Comments

Archives

Categories