DeepSeek’s AI Shockwave: Why Its Own Platform Sinks

deepseeks-ai-shockwave-why-its-own-platform-sink-686adb7e75391

deepseek, the Chinese large language model (LLM) creator, sent ripples through the global AI landscape upon the public release of its R1 model just over 150 days ago (as of July 2025). R1 was notable as the first publicly available model that genuinely rivaled OpenAI’s reasoning capabilities. Beyond technical prowess, DeepSeek dramatically disrupted pricing, listing R1 at fractions of the cost of then-leading models – over 90% cheaper per output token than a competitor like o1. This aggressive pricing forced major players, including OpenAI, to swiftly slash their flagship model prices, with one seeing an 80% reduction. deepseek’s bold entry immediately raised questions about the commoditization of AI models.

Initial reception saw DeepSeek experience a surge in consumer app traffic and market share following the R1 launch. The model also saw continuous post-release development, significantly improving its capabilities, particularly in areas like coding. This ongoing refinement is a hallmark of the rapid AI development cycle we now observe. However, the story of DeepSeek’s user engagement has since become a paradox.

The DeepSeek Paradox: Booming Usage, Sinking Platform

While DeepSeek’s models, including R1 and the later V3, have gained significant traction overall, the location of that usage tells a crucial story. Despite an initial spike, DeepSeek’s market share and traffic on its own hosted services – its web app and API – have declined consistently. Web browser traffic data shows an absolute decrease for DeepSeek’s platform since launch, a stark contrast to the impressive growth seen by most other major AI providers during the same period. This decline occurred even as DeepSeek updated its models to be both more capable and cheaper than their initial release versions. By May 2025, data indicated that DeepSeek’s official platform accounted for only 16% of the total network usage of its models.

Yet, simultaneously, the aggregate usage of DeepSeek R1 and V3 on third-party hosting platforms has exploded. Data from services like OpenRouter shows this external usage growing rapidly, increasing by nearly 20 times since R1’s debut. So, despite DeepSeek’s own service offering seemingly low prices, users are abandoning it in favor of accessing the same models elsewhere.

Unpacking AI Economics: Beyond Price Per Token

To understand this counterintuitive trend, we need to look beyond the simple price tag and delve into the economics of serving AI models, known as “tokenomics.” Tokens are the fundamental units that AI models process and generate. Like a factory, an AI service makes money via Price multiplied by Quantity (P x Q). However, unlike traditional manufacturing, the price per token is not a fixed cost but rather an output determined by various operational decisions and trade-offs.

Merely comparing models based on their listed price per million tokens ($/Mtok) is insufficient and potentially misleading. It ignores critical performance indicators that directly impact the user experience and the practical utility of the model for specific tasks.

Key Performance Indicators (KPIs) in Tokenomics

Model providers must balance several key KPIs when serving an LLM. Manipulating these factors allows them to tune the price per token, often at the expense of user experience:

Latency (Time-to-First-Token): This measures how long it takes for the model to begin generating its response after receiving the user’s input. It includes the time needed for the prefill stage (processing the input) before the first output token is produced. Lower latency means a faster initial response, which is crucial for interactive applications.
Interactivity (Tokens per Second): This measures the speed at which the model generates subsequent tokens after the first one. It’s often expressed as tokens per second per user. Higher interactivity means the response streams out faster, leading to a quicker overall completion time. While humans read at 3-5 words per second, most model providers aim for output speeds of 20-60 tokens per second for a smoother user experience.
Context Window: This refers to the amount of information (measured in tokens) that the model can simultaneously hold in its ‘memory’ during a conversation or task. A larger context window allows the model to process and refer back to extensive inputs, which is vital for complex tasks like analyzing large documents or entire codebases.

For any given AI model, a provider can adjust settings (like batch size) that manipulate these three KPIs. This directly impacts the user experience but also allows them to arrive at a specific cost per token. A very low price might come at the cost of high latency or a small context window.

DeepSeek’s Strategic Trade-Offs

DeepSeek’s self-hosted service demonstrates a clear set of trade-offs designed to achieve an extremely low price per token, even if it sacrifices user experience. Analysis of DeepSeek’s R1 service shows that its low cost is tied to significantly higher latency compared to other providers offering the same model. Users on DeepSeek’s platform often experience delays of many seconds before the first token appears. In contrast, third-party hosts like Parasail or Friendli can offer near-zero latency for similar or only slightly higher prices ($3-4 per million tokens for minimal delay). Even providers like Microsoft Azure, though 2.5 times more expensive than DeepSeek’s listing, offer substantially faster initial responses (up to 25 seconds less latency). This data suggests DeepSeek’s official service isn’t even the cheapest option when comparing services at similar latency levels.

Another key trade-off DeepSeek makes on its own platform is a smaller context window. DeepSeek R1 runs with a 64K token context window. While substantial, this is considered one of the smaller sizes among leading model providers. A smaller context window limits the model’s ability in use cases requiring extensive memory, such as coding tasks that involve understanding large codebases. At the same price point offered by DeepSeek, other providers can offer context windows more than 2.5 times larger.

These specific compromises in latency and context window size on DeepSeek’s own platform are, according to analysis, a direct result of their operational strategy. Model providers can reduce the cost per token by increasing “batching” – serving more users simultaneously on the same hardware (GPUs). Higher batch sizes increase the total wait time for each user (higher latency, slower interactivity) but decrease the overall compute cost per token.

An Intentional Strategy: AGI Over User Experience

The evidence suggests that DeepSeek’s poor user experience on its own service is not an oversight but an intentional decision. The company is not primarily focused on maximizing revenue or external user satisfaction through its chat app or API. Instead, their singular goal is achieving Artificial General Intelligence (AGI). This objective requires vast amounts of compute resources for research and development. By aggressively batching and sacrificing external user experience, DeepSeek minimizes the compute needed for its public-facing inference service, keeping the maximum amount of hardware available for internal R&D.

This strategic choice is also influenced by geopolitical factors, including export controls that have limited China’s access to the high-end chips necessary for large-scale AI inference infrastructure. By open-sourcing their models and allowing third parties to host them, DeepSeek can gain global mind share and foster an ecosystem for its models without needing to invest heavily in its own external-facing compute capacity. This allows them to win adoption while conserving their limited internal compute for training and research.

Compute Constraints: A Shared Challenge (Anthropic Included)

The challenge of compute scarcity isn’t unique to DeepSeek; it’s a fundamental bottleneck in the AI race. Companies like Anthropic also face significant compute constraints. Anthropic has seen considerable success, particularly in coding applications, with their models adopted by popular tools like Cursor and their own Claude Code terminal tool.

The high and often token-intensive usage generated by coding tasks places significant stress on Anthropic’s available compute resources. Evidence of this pressure can be seen in the performance of models like Claude 4 Sonnet on the API. Since its launch, Sonnet’s output speed has reportedly decreased by 40%, dropping to around 45 tokens per second. This slowdown is attributed to increased batching – a necessary measure to manage high demand with limited compute, mirroring DeepSeek’s situation. Competitors like OpenAI and Google, with significantly larger compute infrastructures, often maintain much faster speeds. Anthropic is actively working to secure more compute, including major deals like the one with Amazon for Trainium chips and renting significant capacity from Google Cloud (TPUs and now GPUs, including some rented to OpenAI).

The Race for Efficiency: Intelligence Per Token

While Claude’s raw output speed might be slower than some competitors, its user experience is often perceived as better than DeepSeek’s. This is partly due to faster latency than DeepSeek’s service, but a more significant factor is Claude’s remarkable token efficiency. Claude models often require substantially fewer tokens than other models, including DeepSeek R1 and Gemini 2.5 Pro, to generate a comparable answer to a given question. For example, benchmarks show Claude using more than three times fewer tokens in certain tasks.

This means that despite a lower tokens-per-second rate, the total number of tokens that need to be generated is smaller, leading to a faster end-to-end response time for the user. This focus on “intelligence per token” – delivering more value and conciseness with fewer words – is becoming a critical dimension of competition, alongside raw model capability and speed. The AI race is not just about building smarter models but also more efficient ones.

The Geopolitical Dimension: A Gathering Storm

DeepSeek’s emergence has not occurred in a vacuum; it’s viewed within the broader context of geopolitical competition, particularly between the US and China. Concerns about national security and technological dominance have intensified. The introduction of the “Decoupling America’s Artificial Intelligence Capabilities from China Act of 2025” in the US Senate highlights this tension.

This proposed bill aims to make it illegal for “U.S. Persons” (including individuals, companies, and institutions) to use Chinese AI products. Proponents argue this is necessary to prevent China from harvesting American data, gaining military advantage, and engaging in IP theft (citing DOJ data showing a high nexus to China in espionage and trade secret cases). They frame DeepSeek as a “Sputnik moment,” signaling the need for aggressive measures. The bill proposes severe criminal penalties, including up to 20 years in prison and substantial fines, even for accessing banned Chinese AI via means like a VPN. There are also reports of investigations into whether DeepSeek might have used models from US companies like OpenAI or Microsoft during its training.

Such legislation, if enacted, would profoundly impact the AI landscape, potentially disrupting access to models like DeepSeek for US users and businesses, regardless of where they are hosted. It underscores the fact that the future of AI model usage is intertwined not just with technical performance and economics but also with international policy and security concerns.

The Rise of Inference Clouds

The success of companies like Cursor, Perplexity, and others that build user-facing applications on top of* various AI models has fueled the growth of “inference clouds.” These platforms aggregate access to multiple models from different providers, offering developers and users flexibility and choice. This trend aligns perfectly with DeepSeek’s strategy of open-sourcing and relying on third-party hosting. By making its models available on these platforms, DeepSeek ensures widespread access and adoption without bearing the full infrastructure cost itself. This shift towards a model-as-a-service ecosystem is changing how AI capabilities are delivered and consumed.

Frequently Asked Questions

Why is DeepSeek’s own platform traffic declining despite its model’s popularity?

DeepSeek’s internal platform traffic is declining because the company intentionally sacrifices user experience (high latency, smaller context windows) to minimize compute usage on its public service. This allows DeepSeek to save vital compute resources for its primary goal: internal AGI research and development. By contrast, third-party hosting platforms optimize for better user experience by making different tokenomics trade-offs, attracting users away from DeepSeek’s less responsive official service, even if the listed price is slightly higher or comparable at better performance levels.

Where are DeepSeek models primarily being used if not on their own service?

If not on their own web app or API, DeepSeek models are increasingly being used on third-party hosting platforms, often referred to as “inference clouds” or model marketplaces like OpenRouter. These platforms aggregate access to various AI models from different providers. DeepSeek’s strategy of open-sourcing allows these third parties to host and serve its models, leading to nearly 20x growth in usage on these external services since R1’s launch, while usage on DeepSeek’s official platform has fallen significantly.

What are the potential risks or challenges for Americans using Chinese AI models like DeepSeek?

For Americans, using Chinese AI models like DeepSeek faces potential risks primarily due to geopolitical tensions and national security concerns. Proposed US legislation, such as the Decoupling America’s Artificial Intelligence Capabilities from China Act of 2025, aims to ban the use of such products, citing fears of data harvesting by the Chinese government, intellectual property theft, and supporting China’s military advancements. Violations could carry severe criminal penalties, including prison time and significant fines. This makes the future availability and legal status of using Chinese AI models uncertain for US individuals and entities.

Conclusion

More than 150 days after its impactful debut, DeepSeek’s R1 model continues to shape the AI market, particularly through its influence on pricing and the rise of third-party hosting. The paradox of booming third-party usage against declining traffic on DeepSeek’s own platform highlights the intricate relationship between tokenomics, compute constraints, and strategic goals. DeepSeek’s intentional trade-offs prioritize internal AGI research over external user experience, a decision influenced by both ambition and geopolitical realities like export controls. As the AI race accelerates, factors like compute efficiency (“intelligence per token”) are becoming as crucial as raw speed or capability. Furthermore, the increasing politicization of AI, exemplified by proposed US legislation targeting Chinese models, adds another layer of complexity to the global AI landscape. The rise of inference clouds reflects the market’s adaptation, providing access to powerful models like DeepSeek through platforms optimized for user experience, even as the models’ original creators pursue different priorities. Understanding these dynamics is essential for anyone navigating the rapidly evolving world of artificial intelligence.

References

Leave a Reply