Tag: STT

OpenAI DevDay Announcements

November 7, 2023 / admin / 1 Comment

OpenAI rolled out on its DevDay an array of transformative updates and features [blog post, keynote recording]. Here’s a succinct rundown:

Recap: ChatGPT release Nov 30, 2022 with GPT-3.5. GPT-4 release in March 2023. Voice input/output, vision input with GPT-4V, text-to-image with DALL-E 3, ChatGPT Enterprise with enterprise security, higher speed access, and longer context windows. 2M developers, 92% of Fortune 500 companies building products on top of GPT, 100M weekly active users.
New GPT-4 Turbo: OpenAI’s most advanced AI model, 128K context window, knowledge up to April 2023. Reduced pricing: $0.01/1K input tokens (3x cheaper), $0.03/1K output tokens (2x cheaper). Improved function calling (multiple functions in single message, always return valid functions with JSON mode, improved accuracy on returning right function parameters). More deterministic model output via reproducible outputs beta. Access via gpt-4-1106-preview, stable release pending.
GPT-3.5 Turbo Update: Enhanced gpt-3.5-turbo-1106 model with 16K default context. Lower pricing: $0.001/1K input, $0.002/1K output. Fine-tuning available, reduced token prices for fine-tuned usage (input token prices 75% cheaper to $0.003/1K, output token prices 62% cheaper to $0.006/1K). Improved function calling, reproducible outputs feature.
Assistants API: Beta release for creating AI agents in applications. Supports natural language processing, coding, planning, and more. Enables persistent Threads, includes Code Interpreter, Retrieval, Function Calling tools. Playground integration for no-code testing.
Multimodal Capabilities: GPT-4 Turbo supports visual inputs in Chat Completions API via gpt-4-vision-preview. Integration with DALL·E 3 for image generation via Image generation API. Text-to-speech (TTS) model with six voices introduced.
Customizable GPTs in ChatGPT: New feature called GPTs allowing integration of instructions, data, and capabilities. Enables calling developer-defined actions, control over user experience, streamlined plugin to action conversion. Documentation provided for developers.

OpenAI launches ChatGPT app for iOS

May 19, 2023 / admin / 0 Comments

OpenAI has officially launched the ChatGPT app for iOS users in the US. The app comes with a range of notable features:

Free of Charge: The ChatGPT app can be downloaded and used free of cost.
Sync Across Devices: Users can maintain their chat history consistently across multiple devices.
Voice Input via Whisper: The app includes integration with Whisper, OpenAI’s open-source speech-recognition system, allowing users to input via voice commands.
Exclusive Benefits for ChatGPT Plus Subscribers: Those who subscribe to ChatGPT Plus can utilize GPT-4’s enhanced capabilities. They also receive early access to new features and benefit from faster response times.
Initial US Rollout: The app is initially launching in the US, with a plan to expand its availability to other countries in the upcoming weeks.
Android Version Coming Soon: OpenAI has confirmed that Android users can expect to see the ChatGPT app on their devices in the near future. Further updates are expected soon.

3rd-Level of Generative AI

April 11, 2023 / admin / 0 Comments

Defining

1st-level generative AI as applications that are directly based on X-to-Y models (foundation models that build a kind of operating system for downstream tasks) where X and Y can be text/code, image, segmented image, thermal image, speech/sound/music/song, avatar, depth, 3D, video, 4D (3D video, NeRF), IMU (Inertial Measurement Unit), amino acid sequences (AAS), 3D-protein structure, sentiment, emotions, gestures, etc., e.g.

_{X = text, Y = text: LLM-based chatbots like ChatGPT (from OpenAI based on LLMs GPT-3.5 [4K context] or GPT-4 [8K/32K context]), Bing Chat (GPT-4), Bard (from Google, based on PaLM 2), Claude (from Anthropic [100K context]), Llama2 (from Meta), Falcon 180B (from Technology Innovation Institute), Alpaca, Vicuna, OpenAssistant, HuggingChat (all based on LLaMA [GitHub] from Meta), OpenChatKit (based on EleutherAI’s GPT-NeoX-20B), CarperAI, Guanaco, My AI (from Snapchat), Tingwu (from Alibaba based on Tongyi Qianwen), (other LLMs: MPT-7B and MPT-30B from Mosaic [65K context, commercially usable], Orca, Open-LLama-13b), or coding assistants (like GitHub Copilot / OpenAI Codex, AlphaCode from DeepMind, CodeWhisperer from Amazon, Ghostwriter from Replit, CodiumAI, Tabnine, Cursor, Cody (from Sourcegraph), StarCoder from Big Code Project led by Hugging Face, CodeT5+ from Salesforce, Gorilla, StableCode from Stability.AI, Code Llama from Meta), or writing assistants (like Jasper, Copy.AI), etc.}
_{X = text, Y = image: Dall-E (from OpenAI), Midjourney, Stable Diffusion (from Stability.AI), Adobe Firefly, DeepFloyd-IF (from Deep Floyd, [GitHub, HuggingFace]), Imagen and Parti (from Google), Perfusion (from NVIDIA)}
_{X = text, Y = 360° image: Skybox AI (from Blockade Labs)}
_{X = text, Y = 3D avatar: Tafi}
_{X = text, Y = avatar lip sync: Ex-Human, D-ID, Synthesia, Colossyan, Hour Once, Movio, YEPIC-AI, Elai.io}
_{X = speech + face video, Y = synched audio-visual: Lalamu}
_{X = text, Y = video: Gen-2 (from Runway Research), Imagen-Video (from Google), Make-A-Video (from Meta), or from NVIDIA}
_{X = text, Y = video game: Muse & Sentis (from Unity)}
_{X = image, Y = text: GPT-4 (from OpenAI), LLaVA}
_{X = image, Y = segmented image: Segment Anything Model (SAM by Meta)}
_{X = speech, Y = text: STT (speech-to-text engines) like Whisper (from OpenAI), MMS [GitHub] (from Meta), Conformer-2 (from AssemblyAI)}
_{X = text, Y = speech: TTS (text-to-speech engines) like VALL-E (from Microsoft), Voicebox (from Meta), SoundStorm (from Google), ElevenLabs, Bark, Coqui}
_{X = text, Y = music: MusicLM (from Google), RIFFUSION, AudioCraft (MusicGen, AudioGen, EnCodec from Meta), Stable Audio (from Stability.ai)}
_{X = text, Y = song: Voicemod}
_{X = text, Y = 3D: DreamFusion (from Google)}
_{X = text, Y = 4D : MAV3D (from Meta)}
_{X = image, Y = 3D : CSM}
_{X = image, Y = audio: ImageBind [1] (from Meta, on GitHub)}
_{X = audio, Y = image: ImageBind [2] (from Meta)}
_{X = music, Y = image: MusicToImage}
_{X = text, Y = image & audio: ImageBind [3] (from Meta)}
_{X = audio & image, Y = image: ImageBind [4] (from Meta)}
_{X = IMU, Y = video: ImageBind (from Meta)}
_{X = AAS, Y = 3D-protein: AlphaFold (from Google), RoseTTAFold (from Baker Lab), ESMFold (from Meta)}
_{X = 3D-protein, Y = AAS: ProteinMPNN (from Baker Lab)}
_{X = 3D structure, Y = AAS: RFdiffusion (from Baker Lab)}

and 2nd-level generative AI that builds some kind of middleware and allows to implement agents by simplifying the combination of LLM-based 1st-level generative AI with other tools via actions (like web search, semantic search [based on embeddings and vector databases like Pinecone, Chroma, Milvus, Faiss], source code generation [REPL], calls to math tools like Wolfram Alpha, etc.), by using special prompting techniques (like templates, Chain-of-Thought [COT], Self-Consistency, Self-Ask, Tree Of Thoughts, ReAct [Reason + Act], Graph of Thoughts) within action chains, e.g.

_{ChatGPT Plugins (for simple chains)}
_{LangChain + LlamaIndex (for simple or complex chains)}
_ToolFormer

we currently (April/May/June 2023) see a 3rd-level of generative AI that implements agents that can solve complex tasks by the interaction of different LLMs in complex chains, e.g.

_BabyAGI
_Auto-GPT
_{Llama Lab (llama_agi, auto_llama)}
_{Camel, Camel-AutoGPT}
_{JARVIS (from Microsoft)}
_{Generative Agents}
_{ACT-1 (from Adept)}
_Voyager
_SuperAGI
_{GPT Engineer}
_Parsel
_MetaGPT

However, older publications like Cicero may also fall into this category of complex applications. Typically, these agent implementations are (currently) not built on top of the 2nd-level generative AI frameworks. But this is going to change.

Other, simpler applications that just allow semantic search over private documents with a locally hosted LLM and embedding generation, such as e.g. PrivateGPT which is based on LangChain and Llama (functionality similar to OpenAI’s ChatGPT-Retrieval plugin), may also be of interest in this context. And also applications that concentrate on the code generation ability of LLMs like GPT-Code-UI and OpenInterpreter, both open-source implementations of OpenAI’s ChatGPT Code Interpreter/AdvancedDataAnalysis (similar to Bard’s implicit code execution; an alternative to Code Interpreter is plugin Noteable), or smol-ai developer (that generates the complete source code from a markup description) should be noticed.
There is a nice overview of LLM Powered Autonomous Agents on GitHub.

The next level may then be governed by embodied LLMs and agents (like PaLM-E with E for Embodied).

OpenAI releases ChatGPT and Whisper APIs

March 2, 2023 / admin / 0 Comments

On March 01, 2023, OpenAI announced the releases of APIs for ChatGPT (published on Nov 30, 2022) and the automatic speech recognition (ASR) engine Whisper for speech-to-text (STT) transcription (and translation) that was open-sourced in Sept 2022.

The ChatGPT model family is called gpt-3.5-turbo and costs just $0.002 per 1k tokens, which is 10 times cheaper than the existing GPT-3.5 models. Instead of consuming unstructured text as traditionally done by GPT, the ChatGPT models consume a sequence of messages with metadata following a new format called Chat Markup Language (ChatML). The number of tokens (tokens in prompt + tokens in response as available via response[‘usage’][‘total_tokens’]) is restricted to 4096. Notice that there is no possibility to fine-tune gpt-3.5-turbo models.

For Whisper the large-v2 model is now available through an API for a price of $0.006 per minute. The API contains endpoints for transcriptions (transcribes in source language) and translations (transcribes into English).

In addition, the possibility of dedicated instances for professional users was announced that can make economical sense beyond ~450M tokens per day.

A significant change that was made in the Terms of Service and Usage Polices is that data submitted to the API is no longer used for service improvements (e.g. model training) unless an organization opts in. Before it was necessary to opt-out.

OpenAI releases exceptional ASR

September 29, 2022 / admin / 0 Comments

OpenAI open-sources Whisper, an Automatic Speech Recognition (ASR) software for Speech To Text (STT) transcription with exceptional performance.

Tag: STT

OpenAI DevDay Announcements

OpenAI launches ChatGPT app for iOS

3rd-Level of Generative AI

OpenAI releases ChatGPT and Whisper APIs

OpenAI releases exceptional ASR

Recent Posts

Recent Comments

Archives

Categories