Episodios

  • o3 breaks (some) records, but AI becomes pay-to-win
    Apr 25 2025

    A green card, o3 vs Gemini 2.5, 6 Benchmarks and a whole bunch of my thoughts on what on earth is happening in AI, from here to 2030. Plus, how AI is becoming pay-to-win, and why. Crazy times, 14 mins probably wasn’t enough.

    https://app.grayswan.ai/ai-explained

    AI Insiders ($9!): https://www.patreon.com/AIExplained

    Chapters:
    00:00 - Introduction
    00:33 - FictionLiveBench
    01:37 - PHYBench
    02:14 - SimpleBench
    02:54 - Virology Capabilities Test
    03:13 - Mathematics Performance
    04:29 - Vision Benchmarks
    05:43 - V* and how o3 works
    06:44 - Revenue and costs for you
    08:54 - Expensive RL and trade-offs
    09:40 - How to spend the OOMs
    13:27 - Gray Swan Arena

    Green Card: https://techcrunch.com/2025/04/25/an-openai-researcher-who-worked-on-gpt-4-5-had-their-green-card-denied/
    PHYBench: https://arxiv.org/pdf/2504.16074Virologytest: https://www.virologytest.ai/
    How o3 Vision Works: https://arxiv.org/pdf/2312.14135 https://x.com/sainingxie/status/1912570624523829573
    Visual puzzles: https://neulab.github.io/VisualPuzzles/
    Fiction Bench: https://x.com/ficlive/status/1912863028141244850
    https://geobench.org/
    https://simple-bench.com/
    AIME 2025: https://openai.com/index/introducing-o3-and-o4-mini/
    USAMO: https://x.com/mbalunovic/status/1914398518896193747
    NaturalBench: https://linzhiqiu.github.io/papers/naturalbench/
    Where’s Waldo: https://uk.pinterest.com/pin/492792384225896298/
    IMO and AlphaProof:https://deepmind.google/discover/blog/ai-solves-imo-problems-at-silver-medal-level/
    Crazy Revenue: https://www.theinformation.com/articles/openai-forecasts-revenue-topping-125-billion-2029-agents-new-products-gain?rc=sy0ihq
    Number of Users: https://www.theinformation.com/briefings/googles-gemini-user-numbers-revealed-court?rc=sy0ihq
    Subscriptions pay to win: https://www.forbes.com/sites/paulmonckton/2025/04/23/google-leak-reveals-new-gemini-ai-subscription-levels/
    GPU Trade-offs: https://x.com/sama/status/1915098951067554030
    RL Scale-up Amodei: https://www.darioamodei.com/post/on-deepseek-and-export-controls
    Log-linear Returns: https://x.com/bobmcgrewai/status/1895228291981943265
    2030 Scaling: https://epoch.ai/blog/can-ai-scaling-continue-through-2030
    Model Size: https://x.com/slow_developer/status/1874554473256997201
    Adam on AGI: https://x.com/TheRealAdamG/status/1913998366632968381
    Papers on Patreon: https://arxiv.org/pdf/2502.01839
    https://arxiv.org/pdf/2504.13837
    Chollet Quote: https://x.com/fchollet/status/1912934762580447447
    OpenSim: https://opensim.stanford.edu/


    Non-hype Newsletter: https://signaltonoise.beehiiv.com/

    Más Menos
    15 m
  • o3 and o4-mini - they’re great, but easy to over-hype
    Apr 16 2025

    Critical analysis of the two most powerful new models behind ChatGPT, o3 and o4-mini. Not just the system cards, benchmarks, and my own tests, but some you may not have seen before. Yes, they can whip up amazing front-end in a few seconds, but you always have to ask what is in their data. Either way, they prove the gains from RL are just beginning…

    https://weave-docs.wandb.ai/?utm_source=sponsorship&utm_medium=simple_bench&utm_campaign=ai_explained

    AI Insiders ($9!): https://www.patreon.com/AIExplained


    Chapters:
    00:00 - o3 and o4-mini


    https://simple-bench.com/

    Plus, Teams and Pro, plus token count: https://x.com/btibor91/status/1912568994512662679

    System Card: https://openai.com/index/o3-o4-mini-system-card/

    Release Notes: https://openai.com/index/introducing-o3-and-o4-mini/

    https://deepmind.google/technologies/gemini/pro/

    https://x.com/DeryaTR_/status/1912558350794961168

    https://x.com/polynoamial/status/1912564068168450396

    API Pricing:https://openai.com/api/pricing/

    https://aider.chat/docs/leaderboards/


    Non-hype Newsletter: https://signaltonoise.beehiiv.com/

    Más Menos
    14 m
  • ‘Speaking Dolphin’ to AI Data Dominance, 4.1 + Kling 2: 7 Developments Critically Analysed
    Apr 16 2025

    This pod won’t just be about the release of GPT 4.1 in the last 48 hours, o3 build-up, Kling 2.0, a sneak-peak at the next OpenAI model, or even the new Dolphin language tool. It will be about 7 such stories that contextualise where we are in AI and what is happening.

    https://www.emergentmind.com/


    Chapters:

    00:00 - Introduction

    00:30 - Kling 2.0

    01:35 - GPT 4.1

    05:25 - o3 Build-up

    07:37 - ‘Product Company’

    09:31 - Safe Superintelligence

    10:54 - DolphinGemma

    13:16 - Data Dominance?


    Kling 2.0: https://app.klingai.com/global/release-notes


    Dolphin Gemma: https://blog.google/technology/ai/dolphingemma/?s=09


    https://openai.com/index/gpt-4-1/


    OpenAI o3 Build-up The Information: https://www.theinformation.com/articles/openais-latest-breakthrough-ai-comes-new-ideas?rc=sy0ihq


    Physical reasoning: https://x.com/a_karvonen/status/1911839968990814503


    Fiction Live.bench: https://x.com/ficlive/status/1911853409847906626


    Altman Ted: https://www.youtube.com/watch?v=5MWT_doo68k


    https://simple-bench.com/try-yourself


    https://aider.chat/docs/leaderboards/


    4.5: https://www.youtube.com/watch?v=6nJZopACRuQ


    Geospatial reasoning: https://research.google/blog/geospatial-reasoning-unlocking-insights-with-generative-ai-and-multiple-foundation-models/


    Pioneers: https://x.com/OpenAIDevs/status/1910017976256119151

    Evals: https://www.youtube.com/watch?v=scsW6_2SPC4

    Anthropic Updates: https://www.bloomberg.com/news/articles/2025-04-15/anthropic-is-readying-a-voice-assistant-feature-to-rival-openai?srnd=phx-ai

    https://x.com/sethsaler/status/1912188383457059301


    https://techcrunch.com/2025/04/12/openai-co-founder-ilya-sutskevers-safe-superintelligence-reportedly-valued-at-32b/

    https://ai.meta.com/blog/llama-4-multimodal-intelligence/

    https://deepmind.google/technologies/gemini/pro/

    https://research.google/blog/accelerating-scientific-breakthroughs-with-an-ai-co-scientist/

    https://blog.google/products/google-cloud/ironwood-tpu-age-of-inference/

    OpenAI Documentary: https://www.patreon.com/posts/one-machine-to-121940490

    Más Menos
    20 m
  • AI CEO: ‘Stock Crash Could Stop AI Progress’, Llama 4 Anti-climax +‘Superintelligence in 2027’...
    Apr 7 2025

    The latest on Llama 4, and whether it signals a slowdown in AI, or solid progress. Plus, a deep dive on that viral prediction of superintelligence by 2027, and Amodei’s cautionary words on what could stop AI progress in its tracks. o3 news, and more, as well.

    Weights & Biases: https://weave-docs.wandb.ai/?utm_source=sponsorship&utm_medium=simple_bench&utm_campaign=ai_explained


    DeepSeek Doc: https://www.patreon.com/posts/openai-is-not-r1-125869969

    AI Insiders ($9!): https://www.patreon.com/AIExplained

    Chapters:
    00:00 - Introduction
    00:47 - Stock Crash
    02:28 - Llama 4
    10:55 - o3 News
    11:59 - OpenAI non-profit?
    13:13 - AI 2027

    Llama 4 Release: https://ai.meta.com/blog/llama-4-multimodal-intelligence/

    Dario Amodei Comments: https://www.youtube.com/watch?v=esCSpbDPJik

    Knowledge Cut-off: https://www.llama.com/docs/model-cards-and-prompt-formats/llama4_omni/

    Aider Polyglot: https://aider.chat/docs/leaderboards/

    Gemini 1.5: https://arxiv.org/pdf/2403.05530

    Fiction-LiveBench: https://fiction.live/stories/Fiction-liveBench-Mar-25-2025/oQdzQvKHw8JyXbN87

    OpenAI Valuation: https://www.nytimes.com/2025/03/31/technology/openai-valuation-300-billion.html?login=smartlock&auth=login-smartlock

    OpenAI Cybersecurity: https://www.bloomberg.com/news/articles/2024-01-16/openai-working-with-us-military-on-cybersecurity-tools-for-veterans

    Deep research System Card: https://cdn.openai.com/deep-research-system-card.pdf

    https://openai.com/index/paperbench/

    AI 2027: https://ai-2027.com/

    METR Paper: https://arxiv.org/pdf/2503.14499

    OpenAI non-profit: https://openai.com/index/nonprofit-commission-guidance/

    NYT Piece: https://www.nytimes.com/2025/04/03/technology/ai-futures-project-ai-2027.html?unlocked_article_code=1.804._yKi.QhwOp15Q3tcU&smid=url-share&s=09

    Kokotajlo predictions 2021: https://www.lesswrong.com/posts/6Xgy6CAf2jqHhynHL/what-2026-looks-like

    https://simple-bench.com/


    Non-hype Newsletter: https://signaltonoise.beehiiv.com/

    Podcast: https://aiexplainedopodcast.buzzsprout.com/

    Más Menos
    24 m
  • Gemini 2.5 Pro - It’s a Smart Chatbot … (New Simple High Score)
    Mar 28 2025

    Gemini gets a new record on Simple Bench, and several other benchmarks. I’ll go deep to explore its nuances, including how it deceptively reverse engineers answers, does better on certain coding benchmarks than others, may have a universal ‘conceptual language’ …

    https://weave-docs.wandb.ai/?utm_source=sponsorship&utm_medium=simple_bench&utm_campaign=ai_explained

    … and more. Plus practical tips, a note on security and Kling vs Veo 2 guest appearance.


    AI Insiders ($9!): https://www.patreon.com/AIExplained

    Chapters:
    00:00 - Introduction
    00:36 - Fiction Bench
    02:41 - Practicality - YouTube urls + Security - cut-off date
    03:42 - Coding
    06:22 - WeirdML Bench
    07:01 - Simple Bench Record High
    11:23 - Reverse Engineering!
    13:22 - Anthropic Paper
    17:49 - 3 Caveats

    Gemini 2.5 Updated: https://deepmind.google/technologies/gemini/

    Fiction Live Bench: https://fiction.live/stories/Fiction-liveBench-Feb-19-2025/oQdzQvKHw8JyXbN87

    https://simple-bench.com/

    WeirdML: https://htihle.github.io/weirdml.html
    https://x.com/htihle/status/1905014058228625542

    Anthropic Thoughts: https://www.anthropic.com/research/tracing-thoughts-language-model
    https://transformer-circuits.pub/2025/attribution-graphs/biology.html#dives-cot

    https://aistudio.google.com/prompts/new_chat

    Search Study: https://www.cjr.org/tow_center/we-compared-eight-ai-search-engines-theyre-all-bad-at-citing-news.php

    Live bench: https://livebench.ai/#/
    Paper: https://arxiv.org/pdf/2406.19314

    LiveCode Bench: https://livecodebench.github.io/

    SWE-Verified: https://arxiv.org/pdf/2310.06770


    Non-hype Newsletter: https://signaltonoise.beehiiv.com/

    Más Menos
    21 m
  • Did AI Just Get Commoditized? Gemini 2.5, New DeepSeek V3, & Microsoft vs OpenAI
    Mar 25 2025

    Gemini 2.5 is out, on the same day as the new DeepSeek V3 (which should power Deepseek R2). Do both models prove AI is being commoditized? Let’s find out, on this blockbuster day of AI releases. Plus exclusives from the Information, Simple indications, Vista Bench, LM Arena and more…

    AI Insiders ($9!): https://www.patreon.com/AIExplained

    Chapters:
    00:00 - Introduction
    01:15 - Gemini 2.5 Benchmarks
    05:46 - Long Context, Simple indication
    07:08 - New Deepseek V3 -024
    09:11 - Microsoft MAI
    11:48 - 90% of code but new Claude jobs

    ‘World’s most powerful model’: https://x.com/OfficialLoganK/status/1904580368432586975

    Gemini 2.5 Release Notes: https://blog.google/technology/google-deepmind/gemini-model-thinking-updates-march-2025/#gemini-2-5-thinking

    ‘Commoditized’: https://the-decoder.com/microsoft-ceo-satya-nadella-says-ai-models-are-getting-commoditized/

    Microsoft Information report: https://www.theinformation.com/articles/microsofts-ai-guru-wants-independence-from-openai-thats-easier-said-than-done?rc=sy0ihq

    LMarena: https://x.com/lmarena_ai/status/1904581128746656099/photo/1

    Free for now: https://x.com/btibor91/status/1904578053537476628

    Vista Bench:https://scale.com/leaderboard/visual_language_understanding

    DeepSeek V3: https://huggingface.co/deepseek-ai/DeepSeek-V3-0324

    Claude Plays Pokemon: https://www.twitch.tv/claudeplayspokemon
    Amodei: 100% Coding: https://www.youtube.com/watch?v=esCSpbDPJik&t=3017s

    Anthropic Jobs: https://job-boards.greenhouse.io/anthropic/jobs/4020717008

    Microsoft Money from Onslaught: https://www.972mag.com/microsoft-azure-openai-israeli-army-cloud/

    https://simple-bench.com/

    Release Date Comments: https://x.com/zacharynado/status/1904647277861318979


    Non-hype Newsletter: https://signaltonoise.beehiiv.com/

    Más Menos
    14 m
  • Manus AI - The Calm Before the Hypestorm … (vs Deep Research + Grok 3)
    Mar 13 2025

    Is Manus AI the memecoin of the AI world, or legit? I’ll compare it to OpenAI’s Deep Research, Operator, Grok 3 DeepSearch and more to find out. I’ll also let you in on some of the secrets of what makes a good hype campaign, the estimated costs of Manus AI, and where it is strong. Other news (yes, Gemini image editing and research hacking, I mean you), will have to wait for a few more hours, as millions enquire about Manus AI.

    https://app.grayswan.ai/arena

    AI Insiders ($9!): https://www.patreon.com/AIExplained
    Patreon Vid: https://www.patreon.com/posts/4-ai-trends-in-123857767

    Chapters:
    00:00 - Introduction
    00:46 - Hype Campaign
    02:40 - Single, Public Benchmark
    03:12 - What is Manus AI?
    04:22 - Test 1
    05:12 - Cost and Rate Limits
    06:15 - Test 2 vs Deep Research + Grok 3 DeepSearch
    08:24 - Test 3 (not AGI)
    11:10 - 4 Trends in AI in 2025
    11:37 - Hype Works

    Manus AI: https://manus.im/app

    Xiao Hong Interview: https://www.chinatalk.media/p/manus-chinas-latest-ai-sensation

    Gaia Benchmark: https://openreview.net/pdf?id=fibxvahvs3
    MIT Report: https://www.technologyreview.com/2025/03/11/1113133/manus-ai-review/

    Information Report: https://www.theinformation.com/articles/anthropics-claude-drives-strong-revenue-growth-while-powering-manus-sensation?rc=sy0ihq

    Hype Examples: https://x.com/Saboo_Shubham_/status/1898425707401031940
    https://x.com/EHuanglu/status/1899110687902978373
    https://x.com/AJs_AI/status/1898756132384178291

    Mistakes: https://x.com/TheXeophon/status/1898737178273829220

    Tools and Code: https://x.com/peakji/status/1898994802194346408

    https://operator.chatgpt.com/




    Non-hype Newsletter: https://signaltonoise.beehiiv.com/

    Podcast: https://aiexplainedopodcast.buzzsprout.com/

    Más Menos
    13 m
  • GPT 4.5 - not so much wow
    Feb 28 2025

    GPT 4.5 is here, and do you remember when AI lab CEOs like Sam Altman and Dario Amodei were betting everything on scaling up base models like this one? Well let’s find out what would have happened if the future of AI rested on models like GPT 4.5. You’ll see all the benchmarks, highlights of the paper, emotional intelligence and humor tests, Simple Bench results (reddit was an unreliable source), and why it’s not all bad news for OpenAI.

    https://www.emergentmind.com/

    AI Insiders (now $9!): https://www.patreon.com/AIExplained

    Chapters
    00:00 - Introduction
    01:04 - Details and Benchmarks
    03:04 - Emotional intelligence?
    08:37 - Creative writing?
    11:40 - Visual reasoning and Pricing
    12:41 - Simple Performance
    16:01 - End of Pretraining Scaling?
    17:03 - CEO Hype
    18:11 - System Card Highlights
    23:32 - Karpathy Reaction

    GPT 4.5 System card: https://cdn.openai.com/gpt-4-5-system-card-2272025.pdf
    Release Notes: https://openai.com/index/gpt-4-5-system-card/
    Altman Hype: https://x.com/sama/status/1891533802779910471
    Details: https://openai.com/index/introducing-gpt-4-5/ https://x.com/OpenAI/status/1895219596317335792
    End of an Era: https://x.com/wgussml/status/1895187231666774377
    Anthropic Original Claim: https://techcrunch.com/2023/04/06/anthropics-5b-4-year-plan-to-take-on-openai/
    Smell: https://x.com/rapha_gl/status/1895213014699385082
    Bob McGrew: https://x.com/bobmcgrewai/status/1895228291981943265
    Deep Research System Card: https://cdn.openai.com/deep-research-system-card.pdf
    Reddit: https://www.reddit.com/r/singularity/comments/1izu1t7/gpt45_crushes_simple_bench/
    API Pricing: https://openai.com/api/pricing/
    LiveStream: https://www.youtube.com/watch?v=cfRYp0nItZ8&t=1s
    https://simple-bench.com/


    Karpathy Comparison: https://x.com/karpathy/status/1895213020982472863
    https://x.com/karpathy/status/1895337579589079434


    Non-hype Newsletter: https://signaltonoise.beehiiv.com/

    Más Menos
    25 m
adbl_web_global_use_to_activate_webcro768_stickypopup