Crowdsourcing Gameplay Data

May 30, 2025

Reinforcement LearningCrowdsourcingOffline Learning

Reinforcement learning (RL) faces a fundamental tradeoff: exploration versus exploitation. Agents must explore their environment to discover high-reward strategies while also exploiting known successful actions. This balance is challenging, as too much exploration can be inefficient, while too little can lead to suboptimal policies.

One way to alleviate these challenges is through crowdsourcing gameplay data. By leveraging human players to explore environments, we can collect high-quality datasets that are denser with rewards. Human intuition and creativity often uncover valuable strategies more efficiently than agents exploring blindly, giving RL models a head start in learning.

Exploration vs Exploitation

Exploration involves trying unfamiliar actions to uncover new strategies and better long-term outcomes. It’s how agents gather information about parts of the environment they haven’t seen. Exploitation, by contrast, means selecting actions that the agent currently believes will lead to the highest returns — based on its learned value estimates or policy — favoring known strategies over uncertain ones.

Total Space

Explored

Figure 1: Illustration of how much an agent explored the state-action space

An agent that only exploits may become stuck in a narrow slice of the environment — repeating familiar actions without discovering better alternatives. The first diagram illustrates this: while the total state-action space is vast, an agent might only explore a small region without the right incentives to branch out.

This raises a natural question: why not just explore randomly? The issue is that most environments have structure. Progress often requires solving intermediate challenges or acquiring skills that open access to new areas. Random actions rarely accomplish this. Agents must exploit just enough to overcome early bottlenecks, enabling further exploration.

Random Exploration

Pure Exploitation

Guided Exploration

Start Animation

Figure 2: Animation comparing random, exploitative, and guided exploration strategies

The second diagram (purely illustrative) highlights this dynamic. Random exploration (left) leads to scattered, inefficient movement. Pure exploitation (middle) results in minimal coverage beyond a narrow trajectory. But guided exploration (right) — such as that from a human — balances both, expanding coverage while making meaningful progress.

We believe human-guided exploration is key to building high-quality datasets quickly. People instinctively navigate environments in ways that combine curiosity and goal-seeking behavior.

Even if individual human players tend to exploit known strategies, the aggregate of many players produces diverse trajectories — effectively combining broad exploration with high-quality actions. In this way, crowdsourcing gives us the best of both worlds.

Online vs Offline RL

To better understand how human-generated data fits into the training process, it helps to distinguish between online and offline reinforcement learning:

Online RL learns by interacting with the environment in real-time, updating policies iteratively based on new experiences.
Offline RL leverages pre-collected datasets, learning from past experiences without real-time interactions.

Offline RL is particularly well-suited for leveraging human gameplay. When the data is high-quality and reward-rich — as is often the case with human-generated trajectories — agents can achieve strong performance without needing to engage in extensive trial-and-error exploration.

This human-in-the-loop approach enables faster learning, especially in complex or hard-to-explore environments. However, it comes with a critical limitation: human gameplay data is inherently less scalable than data generated autonomously by agents. There’s a natural bottleneck to how much human experience we can collect.

Crowdsourced data tends to benefit reinforcement learning more than imitation learning. In imitation learning, conflicting human strategies can be averaged into indecisive or ineffective behavior. But in RL, these “inconsistencies” are actually valuable—representing diverse exploration paths that help the agent better understand the environment and learn a more robust policy.

To make this value concrete, the next section presents real gameplay examples—highlighting how human players reach parts of the environment that random agents struggle to access.

Example: Randomly Initialized Agents vs Human Players

To see this in action, consider a multiplayer game environment where both agents and humans are navigating the same map. We begin by looking at randomly initialized agents—no prior knowledge, just pure exploration.

These agents often get stuck in one area of the map. They lack the understanding or coordination needed to progress—never discovering the actions or sequences required to reach more rewarding regions. As a result, their exploration remains shallow and repetitive.

In contrast, here’s what it looks like when human players interact with the same environment:

Humans intuitively combine exploration and goal-seeking behavior. They advance through the environment, solve intermediate tasks, and interact with key elements—reaching state-action pairs that random agents fail to discover.

Because the game is multiplayer, we also benefit from multiple humans generating trajectories simultaneously.

This clear contrast reinforces the value of human gameplay: it not only covers more meaningful parts of the environment, but also unlocks areas that agents alone might never reach—especially early in training.

Conclusion

Crowdsourcing human gameplay provides a practical way to overcome the limitations of agent-only reinforcement learning. While agents often struggle with exploration, human players naturally cover diverse, reward-rich trajectories—especially in multiplayer settings where data can be collected at scale.

By combining this data with offline RL, we can train agents more efficiently and effectively, accelerating learning without requiring exhaustive trial-and-error. The result is a system that benefits from human intuition while retaining the strengths of RL optimization.

As environments grow more complex, this human-in-the-loop approach offers a scalable path toward training agents more efficiently.