Category: Text to Image

OpenAI DevDay Announcements

November 7, 2023 / admin / 1 Comment

OpenAI rolled out on its DevDay an array of transformative updates and features [blog post, keynote recording]. Here’s a succinct rundown:

Recap: ChatGPT release Nov 30, 2022 with GPT-3.5. GPT-4 release in March 2023. Voice input/output, vision input with GPT-4V, text-to-image with DALL-E 3, ChatGPT Enterprise with enterprise security, higher speed access, and longer context windows. 2M developers, 92% of Fortune 500 companies building products on top of GPT, 100M weekly active users.
New GPT-4 Turbo: OpenAI’s most advanced AI model, 128K context window, knowledge up to April 2023. Reduced pricing: $0.01/1K input tokens (3x cheaper), $0.03/1K output tokens (2x cheaper). Improved function calling (multiple functions in single message, always return valid functions with JSON mode, improved accuracy on returning right function parameters). More deterministic model output via reproducible outputs beta. Access via gpt-4-1106-preview, stable release pending.
GPT-3.5 Turbo Update: Enhanced gpt-3.5-turbo-1106 model with 16K default context. Lower pricing: $0.001/1K input, $0.002/1K output. Fine-tuning available, reduced token prices for fine-tuned usage (input token prices 75% cheaper to $0.003/1K, output token prices 62% cheaper to $0.006/1K). Improved function calling, reproducible outputs feature.
Assistants API: Beta release for creating AI agents in applications. Supports natural language processing, coding, planning, and more. Enables persistent Threads, includes Code Interpreter, Retrieval, Function Calling tools. Playground integration for no-code testing.
Multimodal Capabilities: GPT-4 Turbo supports visual inputs in Chat Completions API via gpt-4-vision-preview. Integration with DALL·E 3 for image generation via Image generation API. Text-to-speech (TTS) model with six voices introduced.
Customizable GPTs in ChatGPT: New feature called GPTs allowing integration of instructions, data, and capabilities. Enables calling developer-defined actions, control over user experience, streamlined plugin to action conversion. Documentation provided for developers.

AI race is heating up: Announcements by Google/DeepMind, Meta, Microsoft/OpenAI, Amazon/Anthropic

September 22, 2023 / admin / 0 Comments

After weeks of “less exciting” news in the AI space since the release of Llama 2 by Meta on July 18, 2023, there were a bunch of announcements in the last few days by major players in the AI space:

Google/DeepMind: Bard extensions and multimodal LLM Gemini
OpenAI: DALL-E3 and GPT-Vision in ChatGPT, Gobi
Microsoft: Windows Copilot with DALL-E3 access
Amazon: Generative AI in Alexa, $4B investment in Anthropic
Meta: Meta AI, Ray-Ban, Emu, AI studio

Here are some links to the news of the last weeks:

Sep 28, 2023, Amazon: Securely customize CodeWhisperer
Sep 27, 2023, Meta: Meta AI assistant, Ray-Ban smart glasses, Emu, AI studio
Sep 25, 2023, ChatGPT can now see, hear, and speak
Sep 25, 2023, Amazon invests $4B in Anthropic, Claude in Bedrock
Sep 25, 2023, Spotify clones voices and translates them
Sep 21, 2023, Announcing Microsoft Copilot
Sep 20, 2023, Amazon brings generative AI to Alexa
Sep 20, 2023, OpenAI Announces DALL·E 3 in Research Preview
Sep 20, 2023, GitHub Copilot Chat beta now available for all individuals
Sep 19, 2023, Google Bard September update: App extensions
Sep 19, 2023, OpenAI’s multimodal LLM GPT-Vision to beat Google Gemini
Sep 16, 2023, DeepMind: LLMs can optimize their own prompts
Sep 15, 2023, Google nears release of AI software Gemini
Sep 15, 2023, Google Gemini: What We Know So Far
Sep 13, 2023, Stable Audio by Stability AI for music & sound generation
Sep 07, 2023, Anthropic introduces Claude Pro
Sep 06, 2023, Falcon 180B
Aug 31, 2023, Baidu launches Ernie chatbot
Aug 29, 2023, Duet AI for Google Workspace now generally available
Aug 28, 2023, Meta plans to take on GPT-4 with a rumored Llama 3
Aug 28, 2023, Introducing ChatGPT Enterprise
Aug 27, 2023, Google Gemini Smashes GPT-4 By 5X
Aug 24, 2023, Introducing Code Llama
Aug 22, 2023, GPT-3.5 Turbo fine-tuning and API updates
Aug 22, 2023, ElevenLabs releases Eleven Multilingual v2
Aug 21, 2023, MidJourney Adds Inpainting Feature
Aug 16, 2023, Adobe Express with AI Firefly app is released worldwide
Aug 10, 2023, ChatGPT expands its ‘custom instructions’ feature
Aug 08, 2023, Announcing StableCode — Stability AI
Aug 05, 2023, Tim Cook says Apple is building AI into ‘every product’
Aug 03, 2023, Every single Amazon team is working on generative AI
Aug 02, 2023, AudioCraft by Meta
Jul 31, 2023, ChatGPT for Android in all countries

3rd-Level of Generative AI

April 11, 2023 / admin / 0 Comments

Defining

1st-level generative AI as applications that are directly based on X-to-Y models (foundation models that build a kind of operating system for downstream tasks) where X and Y can be text/code, image, segmented image, thermal image, speech/sound/music/song, avatar, depth, 3D, video, 4D (3D video, NeRF), IMU (Inertial Measurement Unit), amino acid sequences (AAS), 3D-protein structure, sentiment, emotions, gestures, etc., e.g.

_{X = text, Y = text: LLM-based chatbots like ChatGPT (from OpenAI based on LLMs GPT-3.5 [4K context] or GPT-4 [8K/32K context]), Bing Chat (GPT-4), Bard (from Google, based on PaLM 2), Claude (from Anthropic [100K context]), Llama2 (from Meta), Falcon 180B (from Technology Innovation Institute), Alpaca, Vicuna, OpenAssistant, HuggingChat (all based on LLaMA [GitHub] from Meta), OpenChatKit (based on EleutherAI’s GPT-NeoX-20B), CarperAI, Guanaco, My AI (from Snapchat), Tingwu (from Alibaba based on Tongyi Qianwen), (other LLMs: MPT-7B and MPT-30B from Mosaic [65K context, commercially usable], Orca, Open-LLama-13b), or coding assistants (like GitHub Copilot / OpenAI Codex, AlphaCode from DeepMind, CodeWhisperer from Amazon, Ghostwriter from Replit, CodiumAI, Tabnine, Cursor, Cody (from Sourcegraph), StarCoder from Big Code Project led by Hugging Face, CodeT5+ from Salesforce, Gorilla, StableCode from Stability.AI, Code Llama from Meta), or writing assistants (like Jasper, Copy.AI), etc.}
_{X = text, Y = image: Dall-E (from OpenAI), Midjourney, Stable Diffusion (from Stability.AI), Adobe Firefly, DeepFloyd-IF (from Deep Floyd, [GitHub, HuggingFace]), Imagen and Parti (from Google), Perfusion (from NVIDIA)}
_{X = text, Y = 360° image: Skybox AI (from Blockade Labs)}
_{X = text, Y = 3D avatar: Tafi}
_{X = text, Y = avatar lip sync: Ex-Human, D-ID, Synthesia, Colossyan, Hour Once, Movio, YEPIC-AI, Elai.io}
_{X = speech + face video, Y = synched audio-visual: Lalamu}
_{X = text, Y = video: Gen-2 (from Runway Research), Imagen-Video (from Google), Make-A-Video (from Meta), or from NVIDIA}
_{X = text, Y = video game: Muse & Sentis (from Unity)}
_{X = image, Y = text: GPT-4 (from OpenAI), LLaVA}
_{X = image, Y = segmented image: Segment Anything Model (SAM by Meta)}
_{X = speech, Y = text: STT (speech-to-text engines) like Whisper (from OpenAI), MMS [GitHub] (from Meta), Conformer-2 (from AssemblyAI)}
_{X = text, Y = speech: TTS (text-to-speech engines) like VALL-E (from Microsoft), Voicebox (from Meta), SoundStorm (from Google), ElevenLabs, Bark, Coqui}
_{X = text, Y = music: MusicLM (from Google), RIFFUSION, AudioCraft (MusicGen, AudioGen, EnCodec from Meta), Stable Audio (from Stability.ai)}
_{X = text, Y = song: Voicemod}
_{X = text, Y = 3D: DreamFusion (from Google)}
_{X = text, Y = 4D : MAV3D (from Meta)}
_{X = image, Y = 3D : CSM}
_{X = image, Y = audio: ImageBind [1] (from Meta, on GitHub)}
_{X = audio, Y = image: ImageBind [2] (from Meta)}
_{X = music, Y = image: MusicToImage}
_{X = text, Y = image & audio: ImageBind [3] (from Meta)}
_{X = audio & image, Y = image: ImageBind [4] (from Meta)}
_{X = IMU, Y = video: ImageBind (from Meta)}
_{X = AAS, Y = 3D-protein: AlphaFold (from Google), RoseTTAFold (from Baker Lab), ESMFold (from Meta)}
_{X = 3D-protein, Y = AAS: ProteinMPNN (from Baker Lab)}
_{X = 3D structure, Y = AAS: RFdiffusion (from Baker Lab)}

and 2nd-level generative AI that builds some kind of middleware and allows to implement agents by simplifying the combination of LLM-based 1st-level generative AI with other tools via actions (like web search, semantic search [based on embeddings and vector databases like Pinecone, Chroma, Milvus, Faiss], source code generation [REPL], calls to math tools like Wolfram Alpha, etc.), by using special prompting techniques (like templates, Chain-of-Thought [COT], Self-Consistency, Self-Ask, Tree Of Thoughts, ReAct [Reason + Act], Graph of Thoughts) within action chains, e.g.

_{ChatGPT Plugins (for simple chains)}
_{LangChain + LlamaIndex (for simple or complex chains)}
_ToolFormer

we currently (April/May/June 2023) see a 3rd-level of generative AI that implements agents that can solve complex tasks by the interaction of different LLMs in complex chains, e.g.

_BabyAGI
_Auto-GPT
_{Llama Lab (llama_agi, auto_llama)}
_{Camel, Camel-AutoGPT}
_{JARVIS (from Microsoft)}
_{Generative Agents}
_{ACT-1 (from Adept)}
_Voyager
_SuperAGI
_{GPT Engineer}
_Parsel
_MetaGPT

However, older publications like Cicero may also fall into this category of complex applications. Typically, these agent implementations are (currently) not built on top of the 2nd-level generative AI frameworks. But this is going to change.

Other, simpler applications that just allow semantic search over private documents with a locally hosted LLM and embedding generation, such as e.g. PrivateGPT which is based on LangChain and Llama (functionality similar to OpenAI’s ChatGPT-Retrieval plugin), may also be of interest in this context. And also applications that concentrate on the code generation ability of LLMs like GPT-Code-UI and OpenInterpreter, both open-source implementations of OpenAI’s ChatGPT Code Interpreter/AdvancedDataAnalysis (similar to Bard’s implicit code execution; an alternative to Code Interpreter is plugin Noteable), or smol-ai developer (that generates the complete source code from a markup description) should be noticed.
There is a nice overview of LLM Powered Autonomous Agents on GitHub.

The next level may then be governed by embodied LLMs and agents (like PaLM-E with E for Embodied).

Google announces PaLM API release

March 15, 2023 / admin / 0 Comments

On the same day as OpenAI released GPT-4 (March 14, 2023), Google also announced the availability of the PaLM API for developers on Google Cloud [video]. They said that they are now providing access to foundation models on Google Cloud’s Vertex AI platform, initially for generating text and images, and over time also for audio and video. In addition, with the Generative AI App Builder, they introduced the possibility of quickly building AI-powered chat interfaces and digital assistants.

Finally, Google also made for a limited set of trusted test users generative AI features available within Google Workspace (Gmail and Google Docs).

RIFFUSION: Stable Diffusion for Real-Time Music Generation

December 16, 2022 / admin / 0 Comments

By using the stable diffusion model v1.5 without any modifications, just fine-tuned on images of spectrograms paired with text, the software RIFFUSION (RIFF + diffusion) generates incredibly interesting music from text input. By interpolating in latent space it is possible to transition from one text prompt to the next. You can try out the model here.

The authors provide source code on GitHub for an interactive web app and an inference server. A model checkpoint is available on Hugging Face.

There is a nice video about RIFFUSION by Alan Thompson on youtube.

Even more shocking than using diffusion on spectrograms and getting great results may be a paper by Google Research published on Dec 15, 2022. They use text as an image and train their model with contrastive loss alone, thus calling their model CLIP-Pixels Only (CLIPPO). It’s a joint model for processing images and text with a single ViT (Vison Transformer) approach and astonishing performance.

Stable Diffusion 2.0 officially released

November 24, 2022 / admin / 0 Comments

Stable Diffusion 2.0 has offically been released:

– New Text-to-Image diffusion models (improved quality, 512×512 and 768×768 image sizes by default)
– Super-resolution up-scaler (up to 4x upscaling for 2048×2048+ images)
– Depth-to-Image diffusion model
– Updated Inpainting diffusion model

GitHub
HuggingFace

DreamBooth: Synthesizing given subjects in new contexts

October 30, 2022 / admin / 0 Comments

Google presents DreamBooth, a technique to synthesize a subject (defined by 3-5 images) in new contexts defined by text input.

The method is based on Google’s pre-trained text-to-image model Imagen which is not publicly available. However, source code based on Stable Diffusion already exists on GitHub.

Imagic: Manipulating images by text

October 27, 2022 / admin / 0 Comments

Google researchers present Imagic, a method to edit images by text input.

The method is based on Google’s pre-trained image generator Imagen which is not publicly available. However, source code based on Stable Diffusion already exists on GitHub.

A recent publication by University of California, Berkeley [InstructPix2Pix] goes into a similar direction and shows even more impressive results.