Message boards : RALPH@home bug list : Bug reports for 5.60-5.62
Author | Message |
---|---|
dekim Volunteer moderator Project administrator Project developer Project scientist Send message Joined: 20 Jan 06 Posts: 250 Credit: 543,579 RAC: 0 |
This update contains the following: 1. a fix for the percent complete going back to zero after restarts 2. checkpointing for new pose and jumping jobs 3. optimization in jumping jobs which were having issues with long and variable run times. I'll queue up some test jobs soon. |
Mike Gelvin Send message Joined: 17 Feb 06 Posts: 50 Credit: 55,397 RAC: 0 |
First workunit I got promptly crashed: https://ralph.bakerlab.org/result.php?resultid=496686 <core_client_version>5.8.15</core_client_version> <![CDATA[ <message> Incorrect function. (0x1) - exit code 1 (0x1) </message> <stderr_txt> ERROR:: Unable to obtain total_residue & sequence. start pdb file must be provided. ERROR:: Exit from: .input_pdb.cc line: 2944 # cpu_run_time_pref: 3600 </stderr_txt> ]]> with 0 CPU time.... this does not bode well... |
dekim Volunteer moderator Project administrator Project developer Project scientist Send message Joined: 20 Jan 06 Posts: 250 Credit: 543,579 RAC: 0 |
I accidentally submitted a few bad jobs that will fail. Please ignore these. |
feet1st Send message Joined: 7 Mar 06 Posts: 313 Credit: 116,623 RAC: 0 |
I've got a search pairings WU. It was 3hrs 48min in to the run, on the 19th model and just cross into the second stage. I exited BOINC, restarted, and % completed still looks good and I only lost 2.5min of work! So, the checkpointing must be working too! |
feet1st Send message Joined: 7 Mar 06 Posts: 313 Credit: 116,623 RAC: 0 |
I did another end and restart of BOINC on the same WU and it ended upon restart when it had 8.5hrs more to go to reach target CPU time. Upon restart it was initializing for 45 seconds or so and then ended. No indication of why in the result and the messages tab just says computation finished like normal. Looks like the 76 models it had completed were preserved and reported though, and credit was granted. |
feet1st Send message Joined: 7 Mar 06 Posts: 313 Credit: 116,623 RAC: 0 |
Another search pairings WU doesn't seem to be displaying the sidechains properly in the graphic. Running on Windows XP Pro. They just appear as little dots, and the ribbons of the backbone are almost translucent. Using "C" to change coloration just changes the color of the dots. Never had such occur before. |
dekim Volunteer moderator Project administrator Project developer Project scientist Send message Joined: 20 Jan 06 Posts: 250 Credit: 543,579 RAC: 0 |
Odd issues. For the display, did you try 'b' which changes the backbone display or 's' which changes the sidechain display. Your description sounds like abnormal behavior. I'll have to look into what may have caused the premature ending. thanks for your help! |
dekim Volunteer moderator Project administrator Project developer Project scientist Send message Joined: 20 Jan 06 Posts: 250 Credit: 543,579 RAC: 0 |
Found the cause of the premature ending and will have to fix it. It's not a bad bug though, since the jobs return successfully. Thanks! |
feet1st Send message Joined: 7 Mar 06 Posts: 313 Credit: 116,623 RAC: 0 |
Just did another end and restart and had another end prematurely, so I can't try hitting other characters. Perhaps I had hit one of the other characters by mistake. Here is a screenshot of this WU with the funky sidechains (or lack thereof). |
dekim Volunteer moderator Project administrator Project developer Project scientist Send message Joined: 20 Jan 06 Posts: 250 Credit: 543,579 RAC: 0 |
that doesn't look right. Does it happen often? The premature ending can actually be fixed without a new compile. The problem is that the job is going by an nstruct argument we give that should never be used instead of the cpu run time pref after a restart. |
feet1st Send message Joined: 7 Mar 06 Posts: 313 Credit: 116,623 RAC: 0 |
I just got this one down to same host, and it is happening there as well. Rosetta running on the other thread of HT CPU and looks fine. Does it happen often?? No, just these two WUs is the only time I've seen this happen. [edit] I forgot to mention! I only lost 30 seconds of runtime on that last restart! The new checkpointing must be working well. |
feet1st Send message Joined: 7 Mar 06 Posts: 313 Credit: 116,623 RAC: 0 |
I got this one on another host and it seems to have same issue with the graphic. |
dekim Volunteer moderator Project administrator Project developer Project scientist Send message Joined: 20 Jan 06 Posts: 250 Credit: 543,579 RAC: 0 |
The checkpoint interval for these test jobs is set at 60 seconds but it is also limited by the disk write interval user preference. |
Conan Send message Joined: 16 Feb 06 Posts: 364 Credit: 1,368,421 RAC: 0 |
Error <core_client_version>5.8.11</core_client_version> <![CDATA[ <message> process exited with code 1 (0x1) </message> <stderr_txt> Graphics are disabled due to configuration... # cpu_run_time_pref: 21600 ERROR:: Unable to obtain total_residue & sequence. start pdb file must be provided. ERROR:: Exit from: input_pdb.cc line: 2944 https://ralph.bakerlab.org/result.php?resultid=496662 https://ralph.bakerlab.org/result.php?resultid=496727 Also 0 stderr out <core_client_version>5.8.11</core_client_version> <![CDATA[ <message> process exited with code 1 (0x1) </message> <stderr_txt> Graphics are disabled due to configuration... # cpu_run_time_pref: 21600 ERROR:: Unable to determine sequence length from starting structure coordinate file ERROR:: Exit from: input_pdb.cc line: 2962 https://ralph.bakerlab.org/result.php?resultid=498608 |
mdettweiler Send message Joined: 4 Apr 07 Posts: 11 Credit: 1,010 RAC: 0 |
I got this workunit a couple of days ago, and it appears that after shutting down and restarting BOINC, it always picks up from the beginning of the model (which is model 1, since it normally doesn't get to do more in the default run time of 1 hour). This is despite the fact that sometimes it's been running for more than an hour on end--which is telling me that the checkpointing may not be working (although I wouldn't have suspected this at first, because the progress reporting was fine). Does anyone know if the problem is with the checkpoints, or with something else? Should I abort the WU? Edit: I forgot to mention this, but yes, I am sure that this workunit does use the new 5.60 application, so it's not an old holdover. |
dekim Volunteer moderator Project administrator Project developer Project scientist Send message Joined: 20 Jan 06 Posts: 250 Credit: 543,579 RAC: 0 |
Anonymous, That work unit uses the standard checkpointing method which doesn't have the same resolution as the new checkpointing for pose/jumping jobs. I also noticed that this particular job runs quite long, just over an hour per decoy on average. Unfortunately, the long run time is required for this particular experiment. |
mdettweiler Send message Joined: 4 Apr 07 Posts: 11 Credit: 1,010 RAC: 0 |
Anonymous, So...do you mean that the checkpointing is worthless in this particular task? Or, was that a problem but now fixed, after I already got that workunit? |
feet1st Send message Joined: 7 Mar 06 Posts: 313 Credit: 116,623 RAC: 0 |
Rosetta has several different modes. Different types of tasks. They have had better checkpointing for some types of tasks for some time. They've now added better checkpointing for some additional specific types of tasks... but they aren't done yet adding checkpointing to all of different types of work that Rosetta is capable of doing. David Kim, is the plan to roll out the next release with only the improvements to the pose/jumping checkpointing as it is? Or will you have all task types enhanced prior to a Rosetta release? |
dekim Volunteer moderator Project administrator Project developer Project scientist Send message Joined: 20 Jan 06 Posts: 250 Credit: 543,579 RAC: 0 |
The new checkpointing is just for pose/jumping jobs. The recent long jobs, such as the one anonymous pointed out (FRA_a011_IG9_hom001_1_a011_1_bfac_S_00001_0000495_0.pdb), use the older method of checkpointing. I'll discuss the possibility of improving the checkpointing for these jobs with Bin, who is the one who developed it and is running these long jobs. I don't think we'll be able to add this to the next release though. |
mdettweiler Send message Joined: 4 Apr 07 Posts: 11 Credit: 1,010 RAC: 0 |
The new checkpointing is just for pose/jumping jobs. The recent long jobs, such as the one anonymous pointed out (FRA_a011_IG9_hom001_1_a011_1_bfac_S_00001_0000495_0.pdb), use the older method of checkpointing. I'll discuss the possibility of improving the checkpointing for these jobs with Bin, who is the one who developed it and is running these long jobs. I don't think we'll be able to add this to the next release though. So, are you saying that checkpointing is a no-go for this type of workunit? It sure looks like it from what I'm seeing. I'm also having a similar problem with a regular Rosetta@Home workunit (a 10 hour one); except that it checkpointed fine until it was, like, 65-70 percent done, and it appears that one of the checkpoints fell through and it started from the beginning. (See the "problems with version 5.59" thread over there for details). |
Message boards :
RALPH@home bug list :
Bug reports for 5.60-5.62
©2024 University of Washington
http://www.bakerlab.org