
Beating PPO-LSTM Without Backpropagation: How signal-adaptive trust regions turn binary spiking policy search into a fast, stable alternative for continuous control
Jinhao Li et al.
May 10, 2026

Back To All
Overview
PPO-LSTM has long been one of the most reliable baselines for recurrent reinforcement learning. It is stable, well understood, and strong enough to solve high-dimensional continuous-
control tasks such as Humanoid, Hopper, and Walker2d.
So here is the surprising result:
A recurrent spiking neural network, trained without backpropagation-through-time and without surrogate gradients, can match or outperform PPO-LSTM under matched wall-clock training budgets.
This is the main story behind Signal-Adaptive Trust Regions, or SATR. Instead of training a recurrent spiking policy by differentiating through non-differentiable spike dynamics, SATR takes a different route: it treats the spiking network as a black box and optimizes a distribution over binary recurrent connectivity.
The result is not just a biologically inspired policy that works in toy environments. On high-dimensional Brax control benchmarks, SATR-RSNN reaches competitive or higher final returns than PPO-LSTM under matched runtime, while preserving the advantages that make spiking networks attractive in the first place: binary events, sparse computation, and hardware-friendly execution.
On Humanoid, for example, runtime-matched PPO-LSTM runs reach roughly 11.3k to 12.4k return, while SATR-RSNN* reaches roughly 13.1k to 13.9k depending on population size. The paper also reports that SATR achieves about 5× faster training than PPO-LSTM at matched control performance, and up to 8.9× speedup over ES in the reward-runtime comparison.
That raises the question:
How can a gradient-free binary spiking policy compete with PPO-LSTM at all?
The answer is that SATR fixes the main failure mode of population-based RSNN training. Population methods are attractive because they avoid backpropagating through spikes, but their update estimates can be noisy, especially when the population size is small. A noisy update can move the entire sampling distribution too far, causing abrupt changes in network behavior and unstable learning.
SATR solves this with a simple principle:
Trust the population signal only as much as the signal deserves.
When the population update is strong and coherent, SATR allows a larger distributional step. When the update is weak or noise-dominated, SATR automatically contracts the trust region. Combined with a bitset implementation that exploits binary spikes and binary weights, this turns recurrent spiking policy search into a practical alternative to PPO-LSTM.
The Alternative: Optimize the Distribution, Not the Network
To get there, we made one key design choice:
Stop trying to backpropagate through the spiking network.
Instead of treating recurrent spiking RL as a differentiable sequence-modeling problem, SATR treats the RSNN as a black-box policy. A candidate network is sampled, rolled out in the environment, ranked by return, and then used to update the distribution that produced it.
This changes the training problem from:
How do we differentiate through spikes and recurrent membrane dynamics?
to:
How do we search efficiently over binary recurrent spiking policies?
That shift matters because RSNNs are naturally binary and event-driven. Spikes are binary. Connectivity can be binary. The policy does not need to be made smooth just to satisfy the training algorithm.
The starting point is Evolving Connectivity, or EC. In EC, each possible synapse is represented as a Bernoulli random variable:
θᵢ ~ Bernoulli(ρᵢ)
Here, θᵢ is the sampled binary connection, and ρᵢ is the probability that the connection exists.
Training does not directly optimize one fixed network. Instead, it optimizes the vector of probabilities ρ. At each generation, we sample a population of binary RSNNs from this Bernoulli distribution, evaluate them, and update ρ so that high-performing connectivity patterns become more likely.
The loop is simple:
sample binary RSNNs
→ run them in the environment
→ rank them by return
→ update the Bernoulli connectivity distribution
This is a good match for spiking control because it avoids backpropagation-through-time entirely. There are no surrogate gradients, no differentiating through spike events, and no need to store long recurrent computational graphs.
But EC also exposes the central problem that SATR is designed to solve.
The Real Bottleneck: Noisy Population Updates
Population-based training sounds simple, but it depends heavily on the quality of the population signal.
With an infinite population, the update direction would be reliable. In practice, the population is finite. Sometimes very finite.
That means the update estimate can be noisy. A few lucky or unlucky samples can push the search distribution in the wrong direction. This becomes especially damaging in high-dimensional control tasks, where each rollout is expensive and the population budget cannot be increased without limit.
For RSNNs trained with Bernoulli connectivity, the problem is even sharper. The optimizer is not just changing a weight vector. It is changing a sampling distribution over binary networks.
A bad update does not merely perturb one policy. It changes which policies will be sampled next.
This is where the geometry of Bernoulli distributions becomes important. When a Bernoulli probability ρᵢ is near 0.5, changing it slightly is usually safe. But when ρᵢ is near 0 or 1, the same numerical change can represent a much larger movement in distribution space.
For a Bernoulli variable, the Fisher information is:
F(ρᵢ) = 1 / [ρᵢ(1 − ρᵢ)]
As ρᵢ approaches 0 or 1, the denominator goes to zero. The local curvature becomes large. In KL-divergence terms, the distribution becomes very sensitive near the boundaries.
So the failure mode looks like this:
small population
→ noisy update estimate
→ overly aggressive change in Bernoulli probabilities
→ large KL jump in the sampling distribution
→ abrupt change in sampled RSNN behavior
→ unstable learning
This is why simply using a population-based method is not enough. We need a way to control how far the sampling distribution moves.
But a fixed trust region is also not ideal.
Why a Fixed Trust Region Is Not Enough
A natural solution is to borrow the trust-region idea from policy optimization.
In methods such as TRPO, we constrain the KL divergence between the old and new policy so that one update cannot change behavior too drastically. SATR uses the same philosophy, but applies it to a different object:
the distribution over sampled RSNN connectivities.
In other words, SATR does not ask:
How far did the action policy move?
It asks:
How far did the sampling distribution over binary networks move?
The distributional change is measured by:
D_KL(p_ρ || p_ρ′)
where p_ρ is the current Bernoulli connectivity distribution and p_ρ′ is the updated distribution.
A fixed KL budget would say:
Every generation is allowed to move the same distance.
But population-based training does not work that way. Some generations produce a strong, coherent signal. Other generations are mostly noise. Treating them equally is the wrong bias.
When the signal is strong, a fixed trust region can be too conservative. When the signal is weak, the same trust region can be too permissive.
SATR replaces this fixed budget with a signal-adaptive one.
The rule is:
strong population signal → allow a larger distributional move
weak population signal → shrink the trust region
The signal strength is measured by the squared norm of the population update estimate:
E = ||g||²
Here, g is computed from the sampled connectivity patterns and their centered-rank normalized returns.
This gives the core SATR principle:
The trust region should scale with the amount of useful signal in the population update.
This is the main algorithmic difference. SATR is not just a smaller learning rate, and it is not just a fixed KL constraint. It adapts the allowed distributional movement generation by generation.
SATR for Bernoulli Connectivity
For Bernoulli connectivity, the SATR update becomes especially clean.
The local KL approximation for a factorized Bernoulli distribution is:
D_KL(p_ρ || p_ρ+Δρ)
≈ 1/2 Σi (Δρᵢ)² / [ρᵢ(1 − ρᵢ)]
This equation shows exactly why ordinary EC updates can be unstable near the boundaries. If ρᵢ is close to 0 or 1, then ρᵢ(1 − ρᵢ) is tiny, so even a small Δρᵢ can create a large KL displacement.
SATR fixes this by scaling each coordinate with the square root of the Bernoulli variance:
Δρ = η sqrt(ρ ⊙ (1 − ρ)) ⊙ g
where:
ρ = Bernoulli connectivity probabilities
g = population update estimate
η = learning rate
⊙ = element-wise multiplication
This update has two useful effects.
First, it is boundary-aware. When ρᵢ is near 0 or 1, the factor sqrt(ρᵢ(1 − ρᵢ)) becomes small. The update automatically slows down near deterministic connectivity decisions.
Second, it is signal-adaptive. Plugging the update into the local KL approximation gives:
D_KL(p_ρ || p_ρ+Δρ) ≈ η² / 2 · ||g||²
So the KL movement is proportional to the signal energy. If the population update is weak, SATR moves conservatively. If the population update is strong, SATR allows a larger step.
This is the whole method in one line:
ρ ← ρ + η sqrt(ρ ⊙ (1 − ρ)) ⊙ g
That one scaling term is doing most of the work. It prevents boundary-driven KL blow-ups and makes the trust region respond to the actual population signal.
The paper’s ablation supports this design. A fixed-KL EC+TR variant can improve over EC in some settings, but its best KL budget changes across population sizes. SATR is much less sensitive because it does not rely on one fixed KL radius working everywhere. On Humanoid, SATR outperforms EC and the tested fixed-trust-region variants across population sizes 256, 512, and 1024.
The Training Loop
The resulting training algorithm is compact.
Initialize Bernoulli connectivity probabilities ρ.
For each generation:
- Sample N binary connectivity patterns: θ(1), …, θ(N) ~ Bernoulli(ρ)
- Instantiate one RSNN policy from each sampled θ.
- Roll out each RSNN in the environment.
- Compute episodic returns.
- Convert returns into centered ranks.
- Estimate the population update:
g = 1/N Σn R̃(n) (θ(n) − ρ) - Apply the SATR update:
ρ ← ρ + η sqrt(ρ ⊙ (1 − ρ)) ⊙ g
There is no backpropagation-through-time.
There are no surrogate gradients.
There is no dense recurrent policy being optimized by PPO.
The policy class stays spiking and binary. The optimizer simply searches over the distribution that generates useful recurrent connectivity.
This is what makes SATR-RSNN a real alternative to PPO-LSTM rather than just another recurrent RL baseline. It attacks the problem from a different direction: not by making spiking networks easier to differentiate, but by making gradient-free spiking policy search stable enough to compete.
Binary Networks Also Deserve Binary Execution
The algorithmic part explains why SATR trains more stably. But the runtime advantage comes from another observation:
If both spikes and connectivity are binary, we should not execute them like dense floating-point matrices.
In an RSNN trained with Bernoulli connectivity, the sampled connectivity pattern is binary. The spiking activity is also binary. This means synaptic integration can be implemented using bit operations instead of standard matrix multiplication.
For a presynaptic spike vector sₜ and a binary connectivity mask mⱼ, the binary dot product can be computed as:
mⱼᵀ sₜ = Σ_b popcount(Mⱼ,b & Sₜ,b)
where:
& = bitwise AND
popcount = count the number of set bits
Mⱼ,b = packed connectivity word
Sₜ,b = packed spike word
Instead of multiplying and accumulating dense floating-point values, the implementation packs binary vectors into machine words and uses AND + popcount.
This does not change the learning rule. It does not change the objective. It does not change the policy class.
It only makes rollout execution faster.
That matters because population-based methods spend most of their time evaluating sampled policies. Faster rollouts directly improve the reward-runtime trade-off.
On Humanoid, the bitset backend reduces runtime substantially. For example, at population size 8192, runtime drops from 51,098 seconds for the non-bitset binary RSNN implementation to 18,668 seconds with bitset acceleration, a 2.74× speedup. Across tested population sizes, the reported speedup ranges from about 1.6× to 2.75×.
This is why the final system needs both pieces:
SATR update → stable gradient-free optimization
bitset backend → fast binary RSNN execution
Together, they turn recurrent spiking policy search into something that is not only biologically or hardware motivated, but practically competitive.
The Result: A Spiking Policy on the Reward–Runtime Frontier
This is where the PPO-LSTM story comes back.
The headline result is not simply that SATR can train recurrent spiking neural networks. The headline is that a gradient-free binary spiking policy can move onto a reward–runtime frontier that PPO-LSTM no longer clearly dominates.
In the paper, we report results on three Brax continuous-control benchmarks: Humanoid, Hopper, and Walker2d. These are not toy tasks. Humanoid, in particular, is a high-dimensional locomotion problem with 17 degrees of freedom, making it a strong test case for recurrent control. All methods are implemented in JAX and evaluated with Brax under a unified evaluation protocol, with final performance measured by undiscounted episodic return.
The most important plot is Figure 1: final return versus end-to-end wall-clock training time. The bitset-accelerated version of our method, denoted SATR-RSNN*, consistently lands on a better reward–runtime trade-off than the population-based baselines, and reaches performance competitive with or higher than PPO-LSTM under matched training time.
This matters because wall-clock time is the right comparison for this kind of system. Population-based methods may evaluate many policies, but those evaluations are embarrassingly parallel and, in our case, can exploit binary execution. PPO-LSTM uses a mature gradient-based training pipeline, but it relies on dense floating-point recurrent computation. The question is not which method looks cleaner on paper. The question is:
Given the same training time, which agent reaches better control performance?
Under that comparison, SATR-RSNN becomes a serious alternative.
Against PPO-LSTM
On Humanoid, the comparison is especially clear.
The paper evaluates PPO-LSTM under runtime-matched budgets. A 500k-iteration PPO-LSTM run matches the runtime of SATR-RSNN* with population size 4096. An 850k-iteration PPO-LSTM run matches the runtime of SATR-RSNN* with population size 8192. The paper also includes a longer 2.2M-iteration PPO-LSTM run as a stronger full-budget baseline.
The Humanoid returns are roughly:
PPO-LSTM, 500k iterations: 11,326.8
PPO-LSTM, 850k iterations: 12,396.8
PPO-LSTM, 2.2M iterations: 12,876.0
SATR-RSNN*, population 4096: 13,072
SATR-RSNN*, population 8192: 13,860
So the result is not that PPO-LSTM was under-trained. The appendix explicitly checks this by comparing the PPO-LSTM baseline to a tuned PPO result reported by a Brax maintainer on Humanoid, around 11,300 average return over three seeds. The 500k PPO-LSTM run already reaches that level, and longer training improves it further. Even then, SATR-RSNN* reaches higher returns in the matched-runtime Humanoid comparisons.
This is the core message of the blog:
We do not need to make recurrent spiking networks look like LSTMs to make them competitive with LSTMs.
SATR takes the opposite route. It keeps the policy binary and spiking, avoids backpropagation-through-time, avoids surrogate gradients, and optimizes the sampling distribution directly. The result is a recurrent spiking agent that can outperform a strong PPO-LSTM baseline under the wall-clock comparison that actually matters.
The Small-Population Regime Is Where SATR Really Shows Up
The large-population results are important, but they are not the most diagnostic.
The real test is what happens when the population budget shrinks.
Population-based optimization becomes fragile when the population is small. The update estimate becomes noisy, and a noisy update can move the connectivity distribution in a bad direction. This is exactly the failure mode SATR was designed to fix.
On Humanoid, the difference is dramatic:
Population 512:
ES-RSNN: 922 ± 16
EC-RSNN: 5,462 ± 434
SATR-RSNN: 8,928 ± 119
Population 128:
ES-RSNN: 750 ± 42
EC-RSNN: 765 ± 36
SATR-RSNN: 4,908 ± 32
At population size 512, SATR gets about 1.6× the return of EC. At population size 128, EC essentially collapses, while SATR still learns a meaningful Humanoid policy.
This is the strongest empirical evidence for the signal-adaptive trust-region idea.
If SATR were only improving large-population performance, we might interpret it as a better optimizer in a generic sense. But the fact that it helps most when the population is small tells a more specific story:
small population
→ weaker and noisier update signal
→ ordinary EC oversteps
→ SATR contracts the distributional step
→ training remains stable
That is exactly the behavior we wanted.
The same pattern appears on Hopper and Walker2d. Across the reported ES / EC / SATR comparisons in Table 1, SATR achieves the highest terminal returns for every listed task and population setting.
This Is Not Just a Learning-Rate Trick
A natural reaction is to ask whether SATR is just using a more conservative step size.
It is not.
A smaller learning rate can reduce instability, but it also slows learning when the signal is strong. A fixed trust region can help, but it introduces another hyperparameter: the KL budget. Worse, the best KL budget may change with population size.
The paper tests this directly with an ablation called EC+TR, which adds a fixed-KL trust region to the EC baseline. This is the obvious alternative to SATR: instead of adapting the trust region to the population signal, choose a fixed KL radius and normalize the update.
The result is mixed. EC+TR can improve over EC in some settings, but its performance is sensitive to the choice of KL budget. A setting that helps at one population size does not reliably transfer to another. At large budgets, the method can still collapse.
SATR avoids this by removing the need for one fixed KL budget to work everywhere.
On Humanoid, the ablation shows:
Population 256:
EC: 3,155 ± 2,508
SATR: 6,656 ± 1,643
Population 512:
EC: 5,462 ± 434
SATR: 8,928 ± 119
Population 1024:
EC: 7,729 ± 1,028
SATR: 10,520 ± 457
Across these settings, SATR also outperforms the tested fixed-trust-region variants.
This is important because it changes the role of the trust region. SATR is not saying:
Always move less.
It is saying:
Move as far as the population signal justifies.
That is a much better fit for population-based RL, where update reliability changes from generation to generation.
The Runtime Gain Comes from Taking Binary Computation Seriously
SATR improves the optimizer. The bitset backend improves the system.
Both are needed.
Under the same population size and number of generations, SATR evaluates the same number of rollouts as EC. So SATR does not get a wall-clock speedup by secretly doing less work. The speedup comes from changing how the binary RSNN is executed.
The paper reports the Humanoid runtime comparison between a non-bitset binary RSNN and the bitset-accelerated RSNN*:
Population 512:
RSNN*: 3,168 seconds
RSNN w/o bitset: 5,212 seconds
Speedup: 1.64×
Population 8192:
RSNN*: 18,668 seconds
RSNN w/o bitset: 51,098 seconds
Speedup: 2.74×
Across tested population sizes, bitset acceleration gives roughly 1.6× to 2.75× faster rollout execution.
This matters because rollout execution dominates population-based training. Every generation requires evaluating many sampled policies. If each policy rollout becomes faster, the whole training loop becomes faster.
The broader system-level picture is:
EC-style binary connectivity
→ binary sampled networks
RSNN spiking dynamics
→ binary spike activity
binary connectivity + binary spikes
→ AND + popcount execution
faster rollout execution
→ better reward–runtime trade-off
The paper reports that EC already gives a speedup over ES due to more efficient population-based optimization. Combined with the bitset backend, SATR achieves up to 8.9× speedup over ES in the reward–runtime comparison, and about 5× faster training than PPO-LSTM at matched control performance.
This is why the result is more than an algorithmic win. It is an algorithm–system co-design result.
SATR makes the update stable enough.
Bitsets make the rollouts fast enough.
Together, they make recurrent spiking RL competitive enough.
A Compact Policy Class, Not Just a Fast Training Trick
There is another detail that makes the PPO-LSTM comparison more interesting: the models are parameter-matched.
The RSNN uses a recurrent spiking layer with 256 neurons and about 193K parameters. The PPO-LSTM baseline uses a hidden size of 128 and about 191K parameters. So the comparison is not between a huge spiking model and a tiny LSTM. The recurrent policy sizes are deliberately kept comparable.
But the storage and computation look very different.
The SATR-RSNN* policy uses binary connectivity and binary spikes, giving a compact 1-bit representation with a model footprint of about 24 KB in the paper’s architecture table. The PPO-LSTM baseline uses FP32 parameters and dense matrix multiplications, with a footprint of about 764 KB.
That means SATR-RSNN is not merely faster because of implementation tuning. It is using a different computational regime:
PPO-LSTM:
dense FP32 recurrent computation
SATR-RSNN*:
binary spikes
binary connectivity
bitwise AND + popcount
This is the kind of difference that matters for deployment on edge devices and neuromorphic hardware.
The Energy Story Is Promising, but We Should Be Careful
The paper also includes an analytical estimate of neuromorphic energy consumption using published Loihi operation-energy numbers.
For Humanoid, the estimated on-chip energy for SATR-based RSNN training ranges from roughly 5.7 kJ to 45.6 kJ, depending on population size. The measured GPU energy for PPO-LSTM training is reported as 18.4 MJ. This suggests a potential energy reduction on the order of hundreds to thousands of times.
But this should be framed carefully.
These are analytical estimates, not direct measurements on neuromorphic hardware. The paper is explicit about this limitation. The numbers are useful for understanding the possible order-of-magnitude advantage of sparse event-driven computation, but actual energy and latency need to be validated on real neuromorphic deployments.
So the honest version is:
SATR-RSNN has the right computational structure for large energy gains, but hardware measurements are still future work.
That is still a strong story. The point is not that the energy question is fully solved. The point is that SATR gives us an RSNN training method whose performance is finally strong enough to make the hardware question worth asking seriously.
Read the full paper
More Like This
Get In Touch
Sapient Intelligence is pursuing Artificial General Intelligence (AGI) by developing a next-generation, brain-inspired hierarchical latent-space architecture that overcomes the structural limitations of traditional AI frameworks. By integrating reinforcement learning (RL), evolutionary algorithms, and neurodynamic principles, Sapient develops models with advanced logical reasoning, lifelong learning, and high interpretability.


