Posts by mdettweiler

1) Message boards : RALPH@home bug list : Bug reports for rosetta_beta_5.77 and rosetta_5.69 (Message 3336)
Posted 30 Aug 2007 by mdettweiler
Post:
The work unit 550011 is stuck at about 94% complete, with 10 minutes left to run. The BOINC Manager says it's running but there is no CPU use.

It's on a Linux box and the work unit was suspended but I didn't notice if that was the direct cause also the box was rebooted.

Should it be aborted?

NO. Rosetta bases its progress bar and time to completion estimates off of your preferred run time--which, for some of the larger workunits that seem to be very common nowadays, is less (sometimes drastically) than the amount of time actually required to complete one model (the minimum to complete a WU). Thus, if the workunit goes over your preferred runtime, it will stick at about 10 minutes left, and cut down that and up the % done very slowly, because it really has no idea how long the workunit's going to take. The % done and time left to completion, at least for Rosetta/RALPH workunits, are just rough estimates, and with the new, bigger workunits, if you have a lower set runtime (which is recommended for RALPH anyway), most of your workunits will probably go over, unless you have a very fast, modern CPU.

Long story short, this is normal, so don't abort the workunit, let it run. Some workunits can take up to 4 hours (a couple close to 5, even) per model on my P4 3.2Ghz HT, so in my case, they'll take at the very least that amount of time, no matter what time preferences you have set. Rosetta doesn't know ahead of time how much time they'll take, so once it goes over your preferred run time, all it can do is make underestimates so people don't freak out if it goes over 100%. :-)


I don't think it's normal for the BOINC Manager to report the work unit as running but the CPU to be inactive.

Oh! Sorry. I made a blooper--I didn't notice that you said that the CPU was not being active at all. If the CPU was being used, yet the progress and time to completion were as you said, then what I said would be correct, but not in the case that it's not using any CPU time at all. In the case of it using no CPU time at all, I would recommend that you abort the WU.

Sorry! :-(
2) Message boards : RALPH@home bug list : Bug reports for rosetta_beta_5.77 and rosetta_5.69 (Message 3333)
Posted 29 Aug 2007 by mdettweiler
Post:
The work unit 550011 is stuck at about 94% complete, with 10 minutes left to run. The BOINC Manager says it's running but there is no CPU use.

It's on a Linux box and the work unit was suspended but I didn't notice if that was the direct cause also the box was rebooted.

Should it be aborted?

NO. Rosetta bases its progress bar and time to completion estimates off of your preferred run time--which, for some of the larger workunits that seem to be very common nowadays, is less (sometimes drastically) than the amount of time actually required to complete one model (the minimum to complete a WU). Thus, if the workunit goes over your preferred runtime, it will stick at about 10 minutes left, and cut down that and up the % done very slowly, because it really has no idea how long the workunit's going to take. The % done and time left to completion, at least for Rosetta/RALPH workunits, are just rough estimates, and with the new, bigger workunits, if you have a lower set runtime (which is recommended for RALPH anyway), most of your workunits will probably go over, unless you have a very fast, modern CPU.

Long story short, this is normal, so don't abort the workunit, let it run. Some workunits can take up to 4 hours (a couple close to 5, even) per model on my P4 3.2Ghz HT, so in my case, they'll take at the very least that amount of time, no matter what time preferences you have set. Rosetta doesn't know ahead of time how much time they'll take, so once it goes over your preferred run time, all it can do is make underestimates so people don't freak out if it goes over 100%. :-)
3) Message boards : RALPH@home bug list : Bug reports for 5.66-5.68 (Message 3203)
Posted 15 Jun 2007 by mdettweiler
Post:
zonealarm used to have a 'changes frequently' option that you could give to a program so it always allowed internet access. Don't know if that was just ZA Pro or not though...


With each new application release, the Rosetta application's executable file name changes--the version number is part of the filename. Thus, there's no way a firewall would be able to keep track of the changes automatically.
4) Message boards : RALPH@home bug list : Bug reports for 5.65 (Message 3147)
Posted 24 May 2007 by mdettweiler
Post:
Trying to get more work, amidst the “no work from project” messages I see:
Wed 23 May 19:35:45 2007|ralph@home|Sending scheduler request to http://ralph.bakerlab.org/ralph_cgi/cgi
Wed 23 May 19:35:45 2007|ralph@home|Reason: To fetch work
Wed 23 May 19:35:45 2007|ralph@home|Requesting 172800 seconds of new work
Wed 23 May 19:35:50 2007|ralph@home|Scheduler request succeeded
Wed 23 May 19:35:50 2007|ralph@home|Message from server: Project encountered internal error: shared memory
[color=red]Wed 23 May 19:35:50 2007|ralph@home|Project is down[/color]



I'm getting the same "shared memory" error. What could be causing this?
5) Message boards : RALPH@home bug list : Bug reports for 5.65 (Message 3124)
Posted 22 May 2007 by mdettweiler
Post:
I got an error for this workunit. Here's what my BOINC client logged about the error:

5/22/2007 5:56:27 PM|ralph@home|Deferring communication for 1 min 0 sec
5/22/2007 5:56:27 PM|ralph@home|Reason: Unrecoverable error for result CNTRL_01RELAXNATIVE_SAVE_ALL_OUT_-1n0u_-_2064_9_0 ( - exit code -1073741819 (0xc0000005))
5/22/2007 5:56:28 PM|ralph@home|Computation for task CNTRL_01RELAXNATIVE_SAVE_ALL_OUT_-1n0u_-_2064_9_0 finished
5/22/2007 5:56:28 PM|ralph@home|Output file CNTRL_01RELAXNATIVE_SAVE_ALL_OUT_-1n0u_-_2064_9_0_0 for task CNTRL_01RELAXNATIVE_SAVE_ALL_OUT_-1n0u_-_2064_9_0 absent


The odd thing is, after it was done, my firewall told me that the Ralph application needed to access the internet. According to my firewall's logs, it sent back a couple of megabytes worth of information to the Ralph server after I clicked to allow internet access for the Ralph application. I've noticed that sometimes Ralph (and Rosetta, for that matter) workunits will oddly need to send back tons of data to the server if there is an error and the workunit has to stop. Is this because BOINC otherwise won't send back any data if the workunit errors out, and the Rosetta/Ralph admins want to see more error data than BOINC sends back?
6) Message boards : RALPH@home bug list : bug report for version 5.64 (Message 3099)
Posted 14 May 2007 by mdettweiler
Post:
First time I've seen one and wish I'd had more time and do a screen capture, but this unit http://ralph.bakerlab.org/result.php?resultid=509878 displayed the freaky blue dot syndrome.


What's the blue dot syndrome?



see this message: http://ralph.bakerlab.org/forum_thread.php?id=295&nowrap=true#2872


Oh, I see now. Thanks!
7) Message boards : RALPH@home bug list : bug report for version 5.64 (Message 3095)
Posted 13 May 2007 by mdettweiler
Post:
First time I've seen one and wish I'd had more time and do a screen capture, but this unit http://ralph.bakerlab.org/result.php?resultid=509878 displayed the freaky blue dot syndrome.


What's the blue dot syndrome?
8) Message boards : RALPH@home bug list : Bug reports for 5.60-5.62 (Message 3033)
Posted 1 May 2007 by mdettweiler
Post:
There is checkpointing for that type of work unit. It just happens less frequently. It may be that your R@h workunit displayed 0% complete after a restart but there is a bug in the % complete display so I would ignore it and go by the cpu run time. This minor bug is fixed in this ralph version.


Yes, I figured there might be a problem with the progress reporting--but even though the WU was on the 12th model or so before, it had now jumped back to the first one, so I knew a checkpoint must have fallen through.
9) Message boards : RALPH@home bug list : Bug reports for 5.60-5.62 (Message 3030)
Posted 1 May 2007 by mdettweiler
Post:
The new checkpointing is just for pose/jumping jobs. The recent long jobs, such as the one anonymous pointed out (FRA_a011_IG9_hom001_1_a011_1_bfac_S_00001_0000495_0.pdb), use the older method of checkpointing. I'll discuss the possibility of improving the checkpointing for these jobs with Bin, who is the one who developed it and is running these long jobs. I don't think we'll be able to add this to the next release though.


So, are you saying that checkpointing is a no-go for this type of workunit? It sure looks like it from what I'm seeing.

I'm also having a similar problem with a regular Rosetta@Home workunit (a 10 hour one); except that it checkpointed fine until it was, like, 65-70 percent done, and it appears that one of the checkpoints fell through and it started from the beginning. (See the "problems with version 5.59" thread over there for details).
10) Message boards : RALPH@home bug list : Bug reports for 5.60-5.62 (Message 3027)
Posted 30 Apr 2007 by mdettweiler
Post:
Anonymous,

That work unit uses the standard checkpointing method which doesn't have the same resolution as the new checkpointing for pose/jumping jobs. I also noticed that this particular job runs quite long, just over an hour per decoy on average. Unfortunately, the long run time is required for this particular experiment.


So...do you mean that the checkpointing is worthless in this particular task? Or, was that a problem but now fixed, after I already got that workunit?
11) Message boards : RALPH@home bug list : Bug reports for 5.60-5.62 (Message 3025)
Posted 30 Apr 2007 by mdettweiler
Post:
I got this workunit a couple of days ago, and it appears that after shutting down and restarting BOINC, it always picks up from the beginning of the model (which is model 1, since it normally doesn't get to do more in the default run time of 1 hour). This is despite the fact that sometimes it's been running for more than an hour on end--which is telling me that the checkpointing may not be working (although I wouldn't have suspected this at first, because the progress reporting was fine).

Does anyone know if the problem is with the checkpoints, or with something else? Should I abort the WU?

Edit: I forgot to mention this, but yes, I am sure that this workunit does use the new 5.60 application, so it's not an old holdover.






©2024 University of Washington
http://www.bakerlab.org