Hudson jobs missing after a crash!? Restore them from the ashes
This morning, the VM containing the master of our Hudson instance froze. I "powered it off" and then restarted...ah, Hudson restarted and the list of jobs appeared. No, oh no, not all the jobs! Some of them vanished. Noooooo...
Wait, don't panic, this blog post provides an explanation of how to raise Hudson jobs from their ashes!
How to find the issue
On the main Hudson page, go to Manage Hudson --> System Log --> All Hudson logs
You will find this kind of error:
SEVERE: Failed Loading job XXX-JOBNAME-XXX hudson.util.IOException2: /var/lib/hudson/jobs/XXX-JOBNAME-XXX/nextBuildNumber doesn't contain a number at hudson.model.Job.onLoad(Job.java:369) at hudson.model.AbstractProject.onLoad(AbstractProject.java:342) at hudson.model.BaseBuildableProject.onLoad(BaseBuildableProject.java:102) at hudson.model.Items.load(Items.java:117) at hudson.model.Hudson$13.run(Hudson.java:2368) at org.jvnet.hudson.reactor.TaskGraphBuilder$TaskImpl.run(TaskGraphBuilder.java:146) at org.jvnet.hudson.reactor.Reactor.runTask(Reactor.java:259) at hudson.model.Hudson$4.runTask(Hudson.java:698) at org.jvnet.hudson.reactor.Reactor$2.run(Reactor.java:187) at org.jvnet.hudson.reactor.Reactor$Node.run(Reactor.java:94) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) at java.lang.Thread.run(Thread.java:636) Caused by: java.lang.NumberFormatException: For input string: "" at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65) at java.lang.Integer.parseInt(Integer.java:493) at java.lang.Integer.parseInt(Integer.java:514) at hudson.model.Job.onLoad(Job.java:366) ... 12 more
In fact, the file nextBuildNumber mentioned in the stack is empty.
How to restore jobs
It is really simple (but can be really tedious if you have a lot of missing jobs).
Put a number in nextBuildNumber files for each job mentioned in the logs.
Be sure to put a number higher than the last job run.
It would be nice to have the job visible in the job list even if the nextBuildNumber file is corrupted. Perhaps we could put a warning decorator on this job, and also put an arbitrary value in this file from the UI.
I suppose the files were corrupted because the jobs were running when the VM crashed. So, it seems that this file is empty when a job is running. Could we prevent it from being "empty" for too long, and so minimize the opportunity for corruption in the event of a crash?
Add your 2 cents!
To continue this discussion, come on over to the Eclipse forum and share your ideas. I opened a topic just for this :-)
Note: you might also have issues with fingerprints, but it shouldn't be a blocking point. See this topic on the Eclipse forum for more details.