Affected
Partial outage from 5:01 PM to 1:10 AM
One GPU worker went into a bad state, it was restarted and returned to normal operation
A mechanism to detect and auto-restart such states was developed and deployed.
Some of the requests to the nightly endpoint (experimental in app) are returning errors