Bug reports for Ralph 5.52-5.54

Message boards : RALPH@home bug list : Bug reports for Ralph 5.52-5.54

To post messages, you must log in.

Previous · 1 · 2 · 3 · Next

AuthorMessage
Profile anders n

Send message
Joined: 16 Feb 06
Posts: 166
Credit: 131,419
RAC: 0
Message 2887 - Posted: 19 Mar 2007, 5:20:20 UTC - in response to Message 2884.  

Bug ?

I have one WU https://ralph.bakerlab.org/result.php?resultid=463216
that now is at 7H 15min on a pref. of 4 H.
It is at 76.9% and has been on that for the last 2H at least.
I can't get it to show grafics.
I have 2 cores on that MAC and on the other core there is a Rosetta running
where I can se grafics ok.

I'll let it run and se how it turns out.

Anders n

[edit] 8H 15 min same 76.9% [/edit]


After 16H 28min it was swiched for Rosetta. It was finished when I woke up
this morning and reported 3H and 5 min.

Anders n

ID: 2887 · Report as offensive    Reply Quote
Profile feet1st

Send message
Joined: 7 Mar 06
Posts: 313
Credit: 116,623
RAC: 0
Message 2888 - Posted: 19 Mar 2007, 15:24:17 UTC

anders n, do the messages indicate that the task was preempted and restarted at all during the night? Almost sounds like it was removed from memory and reverted back to a prior checkpoint.
ID: 2888 · Report as offensive    Reply Quote
Profile anders n

Send message
Joined: 16 Feb 06
Posts: 166
Credit: 131,419
RAC: 0
Message 2889 - Posted: 19 Mar 2007, 15:36:09 UTC - in response to Message 2888.  

anders n, do the messages indicate that the task was preempted and restarted at all during the night? Almost sounds like it was removed from memory and reverted back to a prior checkpoint.


It was preempted but not removed from memory.

Anders n

ID: 2889 · Report as offensive    Reply Quote
Rhiju
Volunteer moderator
Project developer
Project scientist

Send message
Joined: 14 Feb 06
Posts: 161
Credit: 3,725
RAC: 0
Message 2890 - Posted: 19 Mar 2007, 23:48:04 UTC - in response to Message 2885.  

Sorry, more work coming now!


rosetta_beta_5.52_i686-pc-linux-gnu: error while loading shared libraries: libstdc++.so.6: cannot open shared object file: No such file or directory



Looks like my computer tried to get some Ralph work done last night: at least 15 of those errors.
I have installed libstdc++.so.6.0.3 (there is no official package for SuSE 9.3, but I found a third party package that happened to include this library because they needed it too).
Of course now the project is out of work, so I don't know whether that would have solved the problem or if there are other shared libraries that are missing as well.


ID: 2890 · Report as offensive    Reply Quote
Rhiju
Volunteer moderator
Project developer
Project scientist

Send message
Joined: 14 Feb 06
Posts: 161
Credit: 3,725
RAC: 0
Message 2891 - Posted: 20 Mar 2007, 0:23:33 UTC - in response to Message 2890.  
Last modified: 20 Mar 2007, 0:32:43 UTC

On second thought, I'm a little worried about our recent Linux build based on your comment. I think we changed the way libraries are used in the build -- let me see if I can fix this.


rosetta_beta_5.52_i686-pc-linux-gnu: error while loading shared libraries: libstdc++.so.6: cannot open shared object file: No such file or directory



Looks like my computer tried to get some Ralph work done last night: at least 15 of those errors.
I have installed libstdc++.so.6.0.3 (there is no official package for SuSE 9.3, but I found a third party package that happened to include this library because they needed it too).
Of course now the project is out of work, so I don't know whether that would have solved the problem or if there are other shared libraries that are missing as well.

[/quote]

ID: 2891 · Report as offensive    Reply Quote
Rhiju
Volunteer moderator
Project developer
Project scientist

Send message
Joined: 14 Feb 06
Posts: 161
Credit: 3,725
RAC: 0
Message 2892 - Posted: 20 Mar 2007, 1:33:04 UTC - in response to Message 2891.  

OK 5.54 should have the old style build with static libraries -- let's see how it goes. I'm seeing at least one host that had consistent success up through 5.51, then has been giving shared library errors in 5.52 and 5.53. So I'll see if it returns good results with 5.54.

On second thought, I'm a little worried about our recent Linux build based on your comment. I think we changed the way libraries are used in the build -- let me see if I can fix this.


rosetta_beta_5.52_i686-pc-linux-gnu: error while loading shared libraries: libstdc++.so.6: cannot open shared object file: No such file or directory



Looks like my computer tried to get some Ralph work done last night: at least 15 of those errors.
I have installed libstdc++.so.6.0.3 (there is no official package for SuSE 9.3, but I found a third party package that happened to include this library because they needed it too).
Of course now the project is out of work, so I don't know whether that would have solved the problem or if there are other shared libraries that are missing as well.


[/quote]

ID: 2892 · Report as offensive    Reply Quote
Thomas Leibold

Send message
Joined: 25 Feb 07
Posts: 27
Credit: 77,464
RAC: 0
Message 2893 - Posted: 20 Mar 2007, 2:52:25 UTC - in response to Message 2890.  

Sorry, more work coming now!

Thanks! I got one and it appears to be running fine using Ralph 5.53 with the newly installed libstdc++.so.6 which means that this was the only library that was missing.

I saw the news that Ralph 5.54 fixes the library issue, but haven't gotten that version of the client yet.
ID: 2893 · Report as offensive    Reply Quote
Rhiju
Volunteer moderator
Project developer
Project scientist

Send message
Joined: 14 Feb 06
Posts: 161
Credit: 3,725
RAC: 0
Message 2894 - Posted: 20 Mar 2007, 17:59:00 UTC - in response to Message 2893.  

Hi Thomas, good to hear you client is working again. I also haven't seen any more
libstdc++.so.6 errors after the release of 5.54. So hopefully that's fixed for other Linux users who
aren't as library-savvy as you. Thanks for posting!

Sorry, more work coming now!

Thanks! I got one and it appears to be running fine using Ralph 5.53 with the newly installed libstdc++.so.6 which means that this was the only library that was missing.

I saw the news that Ralph 5.54 fixes the library issue, but haven't gotten that version of the client yet.


ID: 2894 · Report as offensive    Reply Quote
Profile feet1st

Send message
Joined: 7 Mar 06
Posts: 313
Credit: 116,623
RAC: 0
Message 2895 - Posted: 20 Mar 2007, 21:49:10 UTC

Just for giggles, I ended and started BOINC while a 5.54 RNA task was in progress... lost all of my progress on model #2, over 30min of work. The first model took just over 40min to complete, so it should have been some 3/4 through with model 2.

Do the RNA WUs have any checkpointing? Or is progress only saved upon model completion?
ID: 2895 · Report as offensive    Reply Quote
genes
Avatar

Send message
Joined: 16 Feb 06
Posts: 45
Credit: 43,300
RAC: 0
Message 2896 - Posted: 21 Mar 2007, 11:18:54 UTC
Last modified: 21 Mar 2007, 11:23:31 UTC

No problems so far... :) returned 12 valid 5.54, 1 valid 5.53, 50 valid 5.52 WU's all on Windows boxes. No errors yet.
ID: 2896 · Report as offensive    Reply Quote
Profile anders n

Send message
Joined: 16 Feb 06
Posts: 166
Credit: 131,419
RAC: 0
Message 2898 - Posted: 21 Mar 2007, 15:58:27 UTC - in response to Message 2887.  
Last modified: 21 Mar 2007, 15:59:06 UTC

Bug ?

I have one WU https://ralph.bakerlab.org/result.php?resultid=463216
that now is at 7H 15min on a pref. of 4 H.
It is at 76.9% and has been on that for the last 2H at least.
I can't get it to show grafics.
I have 2 cores on that MAC and on the other core there is a Rosetta running
where I can se grafics ok.

I'll let it run and se how it turns out.

Anders n

[edit] 8H 15 min same 76.9% [/edit]


After 16H 28min it was swiched for Rosetta. It was finished when I woke up
this morning and reported 3H and 5 min.

Anders n

A new Wu "stuck" https://ralph.bakerlab.org/result.php?resultid=467564

Now at 16H 33 min at 69,7%.
I have stopped other projects so it will not be preemted this time.
Lets hope for the best.

Anders n



ID: 2898 · Report as offensive    Reply Quote
Profile anders n

Send message
Joined: 16 Feb 06
Posts: 166
Credit: 131,419
RAC: 0
Message 2899 - Posted: 22 Mar 2007, 4:49:06 UTC - in response to Message 2898.  


A new Wu "stuck" https://ralph.bakerlab.org/result.php?resultid=467564

Now at 16H 33 min at 69,7%.
I have stopped other projects so it will not be preemted this time.
Lets hope for the best.

Anders n


It is now at 30H 19 min.
The watchdog should have done it's work by now.
What do you want me to do?

Anders n

ID: 2899 · Report as offensive    Reply Quote
Profile feet1st

Send message
Joined: 7 Mar 06
Posts: 313
Credit: 116,623
RAC: 0
Message 2900 - Posted: 22 Mar 2007, 13:28:32 UTC
Last modified: 22 Mar 2007, 13:29:14 UTC

Anders n, what is your runtime preference? The watchdog should step in within about 15min of crossing 4x the preferred runtime.

Has it made no progress on model numbers in that time?
ID: 2900 · Report as offensive    Reply Quote
Profile anders n

Send message
Joined: 16 Feb 06
Posts: 166
Credit: 131,419
RAC: 0
Message 2901 - Posted: 22 Mar 2007, 15:53:53 UTC - in response to Message 2900.  

Anders n, what is your runtime preference? The watchdog should step in within about 15min of crossing 4x the preferred runtime.

Has it made no progress on model numbers in that time?


It is now at 40H 50 min.

Still at 69,7%.

My pref. runtime is 4H.

I can not open grafics on this WU.

Anders n

ID: 2901 · Report as offensive    Reply Quote
Profile feet1st

Send message
Joined: 7 Mar 06
Posts: 313
Credit: 116,623
RAC: 0
Message 2902 - Posted: 22 Mar 2007, 16:09:51 UTC

I'd suggest you suspend it, then resume it again and there are a few possible outcomes. Knowing the outcome may be of use to the Project Team.

1) it may end with errors.

2) it may restart the same model and end up in the same state of not being able to complete the model. (I'd only let it run 4hrs this time, I've not seen any tasks that should take that long to complete a model on a 2ghz CPU, which you would see by the % completed changing).

3) it may then complete the model normally in <2hrs.
ID: 2902 · Report as offensive    Reply Quote
Profile anders n

Send message
Joined: 16 Feb 06
Posts: 166
Credit: 131,419
RAC: 0
Message 2903 - Posted: 22 Mar 2007, 16:24:41 UTC - in response to Message 2902.  
Last modified: 22 Mar 2007, 16:25:40 UTC

I'd suggest you suspend it, then resume it again and there are a few possible outcomes. Knowing the outcome may be of use to the Project Team.

1) it may end with errors.

2) it may restart the same model and end up in the same state of not being able to complete the model. (I'd only let it run 4hrs this time, I've not seen any tasks that should take that long to complete a model on a 2ghz CPU, which you would see by the % completed changing).

3) it may then complete the model normally in <2hrs.


Test 1Suspend and resume

- the task continues where it was.

Test 2 Restarting Boinc

- It resumes at 2H 47min and I can now view grafics.

Anders n
ID: 2903 · Report as offensive    Reply Quote
Profile feet1st

Send message
Joined: 7 Mar 06
Posts: 313
Credit: 116,623
RAC: 0
Message 2904 - Posted: 22 Mar 2007, 19:20:58 UTC

Oops, yep, you must be keeping the application in memory (a good thing). I wasn't thinking. Yes, ending BOINC and restarting... basically we're forcing the WU to pick up from the last checkpoint.

Curious, what % done did it say upon restart?
ID: 2904 · Report as offensive    Reply Quote
Profile anders n

Send message
Joined: 16 Feb 06
Posts: 166
Credit: 131,419
RAC: 0
Message 2905 - Posted: 22 Mar 2007, 19:27:00 UTC - in response to Message 2904.  
Last modified: 22 Mar 2007, 19:27:48 UTC

Oops, yep, you must be keeping the application in memory (a good thing). I wasn't thinking. Yes, ending BOINC and restarting... basically we're forcing the WU to pick up from the last checkpoint.

Curious, what % done did it say upon restart?


69,7% It started decoy 10.

Anders n
ID: 2905 · Report as offensive    Reply Quote
Profile feet1st

Send message
Joined: 7 Mar 06
Posts: 313
Credit: 116,623
RAC: 0
Message 2906 - Posted: 22 Mar 2007, 22:18:12 UTC

I've got several that seem to have ended normally (lasted through to my 24hr time preference), but the reported WU shows a "No heartbeat from core client for 31 sec - exiting" message.

https://ralph.bakerlab.org/result.php?resultid=459533 v5.52
https://ralph.bakerlab.org/result.php?resultid=466179 v5.52
https://ralph.bakerlab.org/result.php?resultid=467168 v5.54
ID: 2906 · Report as offensive    Reply Quote
Profile anders n

Send message
Joined: 16 Feb 06
Posts: 166
Credit: 131,419
RAC: 0
Message 2908 - Posted: 24 Mar 2007, 9:31:46 UTC

I may be on to what is happening with my MAC.
A Ralph task was preemted and Rosetta continued and all worked as usual.
The Rosetta task finished and Ralph was to continue.
For several seconds the Ralph task showed as running but no ticking on the
CPU time then the time started to go up But I could not watch Grafics on
that task again.
I restarted Boinc and everything was back to normal.

Anders n
ID: 2908 · Report as offensive    Reply Quote
Previous · 1 · 2 · 3 · Next

Message boards : RALPH@home bug list : Bug reports for Ralph 5.52-5.54



©2024 University of Washington
http://www.bakerlab.org