Bug reports for 5.60-5.62

Message boards : RALPH@home bug list : Bug reports for 5.60-5.62

To post messages, you must log in.

1 · 2 · Next

AuthorMessage
Profile dekim
Volunteer moderator
Project administrator
Project developer
Project scientist

Send message
Joined: 20 Jan 06
Posts: 250
Credit: 543,579
RAC: 0
Message 3011 - Posted: 26 Apr 2007, 19:02:33 UTC

This update contains the following:

1. a fix for the percent complete going back to zero after restarts
2. checkpointing for new pose and jumping jobs
3. optimization in jumping jobs which were having issues with long and variable run times.

I'll queue up some test jobs soon.
ID: 3011 · Report as offensive    Reply Quote
Mike Gelvin
Avatar

Send message
Joined: 17 Feb 06
Posts: 50
Credit: 55,397
RAC: 0
Message 3012 - Posted: 26 Apr 2007, 20:51:06 UTC

First workunit I got promptly crashed:

https://ralph.bakerlab.org/result.php?resultid=496686

<core_client_version>5.8.15</core_client_version>
<![CDATA[
<message>
Incorrect function. (0x1) - exit code 1 (0x1)
</message>
<stderr_txt>
ERROR:: Unable to obtain total_residue & sequence.
start pdb file must be provided.
ERROR:: Exit from: .input_pdb.cc line: 2944
# cpu_run_time_pref: 3600

</stderr_txt>
]]>




with 0 CPU time.... this does not bode well...


ID: 3012 · Report as offensive    Reply Quote
Profile dekim
Volunteer moderator
Project administrator
Project developer
Project scientist

Send message
Joined: 20 Jan 06
Posts: 250
Credit: 543,579
RAC: 0
Message 3013 - Posted: 26 Apr 2007, 21:04:03 UTC

I accidentally submitted a few bad jobs that will fail. Please ignore these.
ID: 3013 · Report as offensive    Reply Quote
Profile feet1st

Send message
Joined: 7 Mar 06
Posts: 313
Credit: 116,623
RAC: 0
Message 3014 - Posted: 27 Apr 2007, 1:41:12 UTC

I've got a search pairings WU. It was 3hrs 48min in to the run, on the 19th model and just cross into the second stage.

I exited BOINC, restarted, and % completed still looks good and I only lost 2.5min of work! So, the checkpointing must be working too!
ID: 3014 · Report as offensive    Reply Quote
Profile feet1st

Send message
Joined: 7 Mar 06
Posts: 313
Credit: 116,623
RAC: 0
Message 3015 - Posted: 27 Apr 2007, 13:29:42 UTC

I did another end and restart of BOINC on the same WU and it ended upon restart when it had 8.5hrs more to go to reach target CPU time. Upon restart it was initializing for 45 seconds or so and then ended. No indication of why in the result and the messages tab just says computation finished like normal. Looks like the 76 models it had completed were preserved and reported though, and credit was granted.
ID: 3015 · Report as offensive    Reply Quote
Profile feet1st

Send message
Joined: 7 Mar 06
Posts: 313
Credit: 116,623
RAC: 0
Message 3016 - Posted: 27 Apr 2007, 14:51:19 UTC

Another search pairings WU doesn't seem to be displaying the sidechains properly in the graphic. Running on Windows XP Pro. They just appear as little dots, and the ribbons of the backbone are almost translucent. Using "C" to change coloration just changes the color of the dots. Never had such occur before.
ID: 3016 · Report as offensive    Reply Quote
Profile dekim
Volunteer moderator
Project administrator
Project developer
Project scientist

Send message
Joined: 20 Jan 06
Posts: 250
Credit: 543,579
RAC: 0
Message 3017 - Posted: 27 Apr 2007, 19:04:53 UTC

Odd issues.

For the display, did you try 'b' which changes the backbone display or 's' which changes the sidechain display. Your description sounds like abnormal behavior.

I'll have to look into what may have caused the premature ending.

thanks for your help!
ID: 3017 · Report as offensive    Reply Quote
Profile dekim
Volunteer moderator
Project administrator
Project developer
Project scientist

Send message
Joined: 20 Jan 06
Posts: 250
Credit: 543,579
RAC: 0
Message 3018 - Posted: 27 Apr 2007, 19:17:39 UTC

Found the cause of the premature ending and will have to fix it. It's not a bad bug though, since the jobs return successfully. Thanks!
ID: 3018 · Report as offensive    Reply Quote
Profile feet1st

Send message
Joined: 7 Mar 06
Posts: 313
Credit: 116,623
RAC: 0
Message 3019 - Posted: 27 Apr 2007, 19:32:53 UTC

Just did another end and restart and had another end prematurely, so I can't try hitting other characters. Perhaps I had hit one of the other characters by mistake.

Here is a screenshot of this WU with the funky sidechains (or lack thereof).
ID: 3019 · Report as offensive    Reply Quote
Profile dekim
Volunteer moderator
Project administrator
Project developer
Project scientist

Send message
Joined: 20 Jan 06
Posts: 250
Credit: 543,579
RAC: 0
Message 3020 - Posted: 27 Apr 2007, 19:49:29 UTC

that doesn't look right. Does it happen often? The premature ending can actually be fixed without a new compile. The problem is that the job is going by an nstruct argument we give that should never be used instead of the cpu run time pref after a restart.
ID: 3020 · Report as offensive    Reply Quote
Profile feet1st

Send message
Joined: 7 Mar 06
Posts: 313
Credit: 116,623
RAC: 0
Message 3021 - Posted: 27 Apr 2007, 19:55:18 UTC
Last modified: 27 Apr 2007, 19:58:33 UTC

I just got this one down to same host, and it is happening there as well. Rosetta running on the other thread of HT CPU and looks fine.

Does it happen often?? No, just these two WUs is the only time I've seen this happen.

[edit] I forgot to mention! I only lost 30 seconds of runtime on that last restart! The new checkpointing must be working well.
ID: 3021 · Report as offensive    Reply Quote
Profile feet1st

Send message
Joined: 7 Mar 06
Posts: 313
Credit: 116,623
RAC: 0
Message 3022 - Posted: 27 Apr 2007, 20:21:35 UTC

I got this one on another host and it seems to have same issue with the graphic.
ID: 3022 · Report as offensive    Reply Quote
Profile dekim
Volunteer moderator
Project administrator
Project developer
Project scientist

Send message
Joined: 20 Jan 06
Posts: 250
Credit: 543,579
RAC: 0
Message 3023 - Posted: 27 Apr 2007, 21:27:45 UTC

The checkpoint interval for these test jobs is set at 60 seconds but it is also limited by the disk write interval user preference.
ID: 3023 · Report as offensive    Reply Quote
Profile Conan
Avatar

Send message
Joined: 16 Feb 06
Posts: 364
Credit: 1,368,421
RAC: 0
Message 3024 - Posted: 28 Apr 2007, 7:07:14 UTC
Last modified: 28 Apr 2007, 7:09:32 UTC

Error

<core_client_version>5.8.11</core_client_version>
<![CDATA[
<message>
process exited with code 1 (0x1)
</message>
<stderr_txt>
Graphics are disabled due to configuration...
# cpu_run_time_pref: 21600
ERROR:: Unable to obtain total_residue & sequence.
start pdb file must be provided.
ERROR:: Exit from: input_pdb.cc line: 2944

https://ralph.bakerlab.org/result.php?resultid=496662
https://ralph.bakerlab.org/result.php?resultid=496727

Also
0
stderr out

<core_client_version>5.8.11</core_client_version>
<![CDATA[
<message>
process exited with code 1 (0x1)
</message>
<stderr_txt>
Graphics are disabled due to configuration...
# cpu_run_time_pref: 21600
ERROR:: Unable to determine sequence length from starting structure coordinate file
ERROR:: Exit from: input_pdb.cc line: 2962

https://ralph.bakerlab.org/result.php?resultid=498608

ID: 3024 · Report as offensive    Reply Quote
mdettweiler
Avatar

Send message
Joined: 4 Apr 07
Posts: 11
Credit: 1,010
RAC: 0
Message 3025 - Posted: 30 Apr 2007, 5:10:44 UTC
Last modified: 30 Apr 2007, 5:11:43 UTC

I got this workunit a couple of days ago, and it appears that after shutting down and restarting BOINC, it always picks up from the beginning of the model (which is model 1, since it normally doesn't get to do more in the default run time of 1 hour). This is despite the fact that sometimes it's been running for more than an hour on end--which is telling me that the checkpointing may not be working (although I wouldn't have suspected this at first, because the progress reporting was fine).

Does anyone know if the problem is with the checkpoints, or with something else? Should I abort the WU?

Edit: I forgot to mention this, but yes, I am sure that this workunit does use the new 5.60 application, so it's not an old holdover.
ID: 3025 · Report as offensive    Reply Quote
Profile dekim
Volunteer moderator
Project administrator
Project developer
Project scientist

Send message
Joined: 20 Jan 06
Posts: 250
Credit: 543,579
RAC: 0
Message 3026 - Posted: 30 Apr 2007, 17:14:35 UTC

Anonymous,

That work unit uses the standard checkpointing method which doesn't have the same resolution as the new checkpointing for pose/jumping jobs. I also noticed that this particular job runs quite long, just over an hour per decoy on average. Unfortunately, the long run time is required for this particular experiment.
ID: 3026 · Report as offensive    Reply Quote
mdettweiler
Avatar

Send message
Joined: 4 Apr 07
Posts: 11
Credit: 1,010
RAC: 0
Message 3027 - Posted: 30 Apr 2007, 19:58:51 UTC - in response to Message 3026.  
Last modified: 30 Apr 2007, 19:59:18 UTC

Anonymous,

That work unit uses the standard checkpointing method which doesn't have the same resolution as the new checkpointing for pose/jumping jobs. I also noticed that this particular job runs quite long, just over an hour per decoy on average. Unfortunately, the long run time is required for this particular experiment.


So...do you mean that the checkpointing is worthless in this particular task? Or, was that a problem but now fixed, after I already got that workunit?
ID: 3027 · Report as offensive    Reply Quote
Profile feet1st

Send message
Joined: 7 Mar 06
Posts: 313
Credit: 116,623
RAC: 0
Message 3028 - Posted: 30 Apr 2007, 21:44:38 UTC

Rosetta has several different modes. Different types of tasks. They have had better checkpointing for some types of tasks for some time. They've now added better checkpointing for some additional specific types of tasks... but they aren't done yet adding checkpointing to all of different types of work that Rosetta is capable of doing.

David Kim, is the plan to roll out the next release with only the improvements to the pose/jumping checkpointing as it is? Or will you have all task types enhanced prior to a Rosetta release?
ID: 3028 · Report as offensive    Reply Quote
Profile dekim
Volunteer moderator
Project administrator
Project developer
Project scientist

Send message
Joined: 20 Jan 06
Posts: 250
Credit: 543,579
RAC: 0
Message 3029 - Posted: 1 May 2007, 0:29:23 UTC

The new checkpointing is just for pose/jumping jobs. The recent long jobs, such as the one anonymous pointed out (FRA_a011_IG9_hom001_1_a011_1_bfac_S_00001_0000495_0.pdb), use the older method of checkpointing. I'll discuss the possibility of improving the checkpointing for these jobs with Bin, who is the one who developed it and is running these long jobs. I don't think we'll be able to add this to the next release though.
ID: 3029 · Report as offensive    Reply Quote
mdettweiler
Avatar

Send message
Joined: 4 Apr 07
Posts: 11
Credit: 1,010
RAC: 0
Message 3030 - Posted: 1 May 2007, 4:14:26 UTC - in response to Message 3029.  

The new checkpointing is just for pose/jumping jobs. The recent long jobs, such as the one anonymous pointed out (FRA_a011_IG9_hom001_1_a011_1_bfac_S_00001_0000495_0.pdb), use the older method of checkpointing. I'll discuss the possibility of improving the checkpointing for these jobs with Bin, who is the one who developed it and is running these long jobs. I don't think we'll be able to add this to the next release though.


So, are you saying that checkpointing is a no-go for this type of workunit? It sure looks like it from what I'm seeing.

I'm also having a similar problem with a regular Rosetta@Home workunit (a 10 hour one); except that it checkpointed fine until it was, like, 65-70 percent done, and it appears that one of the checkpoints fell through and it started from the beginning. (See the "problems with version 5.59" thread over there for details).
ID: 3030 · Report as offensive    Reply Quote
1 · 2 · Next

Message boards : RALPH@home bug list : Bug reports for 5.60-5.62



©2024 University of Washington
http://www.bakerlab.org