4 Million Watch 4th China AIGC Industry Summit: Key Insights from AI Leaders

Keynote Highlights

Fang Han, Chairman and CEO of Kunlun Tech

Fang Han shared his insights on how individuals and enterprises should respond to the impact of Agent. He noted that closed-loop, error-tolerant skills are easily replaceable by AI, but those with judgment and taste can last long. Token consumption has become a key metric: ordinary employees use millions to tens of millions per month, AI coding and tech staff use hundreds of millions to billions, and heavy Agent users easily consume tens of billions. Token is now the 'electricity consumption' of the AI era. With AI, the career growth ladder is compressed: the middle rungs disappear, leaving only beginners and experts. Five types of people AI can never replace: storytellers, idea creators, those who define beauty, system builders, and paradigm reformers. For most industries adopting AI, it's safer to be second, not first; but in IT, only first place survives. AI brings everyone back to the same starting line.

Yi Zhengchao, CEO of Funshion Online

Yi Zhengchao discussed AI programming and AI video, arguing that crowd creation is the core lever of AI productivity. AI has significantly changed the entertainment industry: lower costs and barriers, richer supply, higher challenges; more diverse forms of content such as web novels, IP images, video, interaction, and games; AI empowers not only creation but also business operations; content creation emphasizes imagination filtering, making crowd creation inevitable; AI creation blurs the boundary between creation and consumption. As an AI application company, Funshion Online focuses on five directions: trust AI but don't touch models; AI comics are hot but not all of AI video; success comes more from AI programming than AI video; amplifying individuals is important, but amplifying organization is more important; Agents are strong, but crowd creation is the lever. Crowd creation is the social structure of the AI era. A company is no longer a container for a super employee or super Agents, but an organizer of intelligent resources to leverage external collective wisdom. The crowd creation network of employees, digital employees, and external partners forms a socialized ecological structure. AI amplifies execution but also self-indulgence; the antidote is to deliver results.

Lin Dahua, Executive Vice President and Chief Scientist of SenseTime

Lin Dahua spoke on the topic 'From Multimodal Unification to Spatial Intelligence: Towards a New AI Frontier That Can Perceive, Generate, and Act.' He emphasized that long-term vision determines how far we go. In enterprise AI deployment, the model itself is not the key; the real bottleneck is connecting various data forms like charts, Excel, images, videos, web pages, and knowledge bases—often accounting for over 70% of AI costs. Agent is the engine of this era, but its effectiveness in real scenarios depends on multimodal capability. SenseTime's Xiaohuanxiong (Raccoon) achieved rapid growth by enabling end-to-end closed loops from messy data to deliverable results. Beyond the digital world, there is the physical space. Even the best multimodal models are fragile in real physical spaces, which is the core bottleneck for general-purpose robots. To achieve spatial intelligence, we must re-understand the world from first principles. The key is to integrate language models with visual understanding and generation into one model. SenseTime's new model, SenseNova U1, unifies understanding, reasoning, and generation on a new foundation, enabling seamless switching between language and vision. Unification opens new expressive possibilities, infusing thought into image generation and imagination into reasoning models. Future true agents should simultaneously analyze digital space and act in physical space within a single brain.

Deng Yafeng, Vice President of Shanghai Giant Group and CEO of EverMind

Deng Yafeng shared insights on long-term memory-driven self-evolution, from tool AI to digital productivity systems. He compared the Lobster (AI Agent) to the iPhone 4 of the Agent era, defining a product paradigm where users feel they have an AI J.A.R.V.I.S. that works 72 hours non-stop. But it's not perfect and needs constant iteration. Claude 4 marks a key node for Agents moving toward autonomy, shifting the paradigm from chat to agent, enabling Anthropic to surpass OpenAI and revolutionizing SaaS. Instead of delivering processes and interfaces, SaaS now delivers via messages. Two key features of Agents: autonomy and self-evolution. Long-term memory supports both, solving three things: abstracting and summarizing rapidly expanding context; remembering who the user is, their preferences, goals, and values; and proactively predicting what the user might need based on that. As models grow stronger, memory becomes the most easily accumulated differentiated asset in business. If AI truly knows you well, it becomes a new intent distribution entry point. Personal memory should be syncable across different Agents like Codex, Claude Code, and Lobster.

Wang Xiaoye, Technical Director of Product Technology at AWS China

Wang Xiaoye discussed bridging the gap to enterprise-grade AI Agents. He noted that building a Lobster for personal use vs. enterprise use are different—the enterprise must make Agents safe, trustworthy, and stable. Lobster shows what's possible, but enterprises need a bridge to production deployment. AWS believes building Agentic AI requires five layers: reasoning compute at the bottom, multi-model selection, enterprise data and knowledge, an Agent building platform, and ready-to-use Agent applications at the top. Coding Agents are mature; Working Agents are the next breakthrough. AWS's answer is Amazon Quick, enabling employees to use Agents securely, agilely, and freely. Agents pose new challenges to data management: memory needs sharing, isolation, and coexistence; erroneous knowledge, outdated information, and contradictions affect judgment. The token cost complaint often stems from feeding models useless information. In the Agent scenario, 'Harness' refers to all software infrastructure besides the model. Think of the model as a CPU; Harness provides the usable operating system; the final Agent appears as a complete application. Amazon Bedrock AgentCore is such a Harness, allowing users to focus on business value.

GenAI Talk: Dialogue with Shen Yujun, Chief Scientist of Ant Group's Lingbo Technology

Shen Yujun discussed the shift from AIGC to AIGA (AI Generated Action) in the second half of AI 2.0. He highlighted that large models have capitalized on decades of internet data, but physical world data for robots is largely missing. The key is transitioning data from digital to physical. For a universal robot brain in the physical world, spatial perception is critical—converting sensor input into better information. On the debate between VLA and world models, neither will be the final answer; as robot data accumulates, they will merge. His prediction: in 1–2 years, benchmark examples will emerge; in 2–3 years, these examples will be replicated across industries; then robots will enter the C-end, eventually spreading to homes. The ChatGPT moment for embodied AI will come when anyone can generate data for robots.

Qiu Xipeng, Distinguished Professor at Fudan University, Assistant Dean of Shanghai Innovation Institute, Founder of MOSS Intelligence

Qiu Xipeng presented on MOSS multimodal models and reasoning optimization. He said the next direction of AI is multimodal, understanding broader context in a generalized context-aware era. Real-time interactive multimodal models need to handle longer contexts, complex visual and audio information, and meet higher real-time reasoning requirements. Multimodal token consumption far exceeds text and coding. MOSS-VL uses cross-attention to enable video streaming input while text models fetch video information on demand. MOSS-Audio aims to understand speech content, scene, complex reasoning, and music in a broader context. It has reached the same level as leading specialized ASR models. MOSS-TTS covers voice synthesis, lightweight, sound design, and real-time performance, using a pure Transformer architecture. After open-sourcing, MOSS-TTS downloads exceeded 1 million. Future models will unify visual understanding, speech understanding, and speech output into one end-to-end model for contextual interaction.

Hu Weiqi, ToB Commercialization Lead for MiniMax China

Hu Weiqi shared MiniMax's exploration of AI with everyone. AGI means intelligence with everyone—AI affordable for all. This requires parallel development of models and applications, ToC and ToB equally. AI companies must first close the loop internally, not stint on token subsidies, let employees use Agents to build automation workflows. The usage also feeds back into model R&D. Instead of anxiety, join in; the most efficient way for enterprises is to try directly, starting with scenarios employees least want to do—usually the most valuable with least internal resistance. AI will flatten organizations: product teams can directly produce demos, then hand over to R&D for mass production. In 2–3 years, AI will deeply integrate with all industries, transforming productivity.