Posts by David@home

1) Message boards : RALPH@home bug list : removed from memory by benchmark (Message 819)
Posted 5 Mar 2006 by Profile David@home
Post:
When the benchmark ran it forced RALPH out of memory. Not sure how this can be managed better.


This was a problem with BOINC not with the science apps, I'm not sure which version fixed it (it might be in the development version) but try updating to the current version for your computer.

--Nathan


Cool, if a newer dev version of BOINC handles this better then that is good. Part of the alpha test should be to feedback on the BOINC infrstructure as well if it highlights an issue but it sounds like this is already covered off.
2) Message boards : RALPH@home bug list : Report - Previously Unclassified Work Unit Errors (Message 818)
Posted 5 Mar 2006 by Profile David@home
Post:
This WU had unrecoverable error result 13723

in BOINC log:

05/03/2006 20:37:03|ralph@home|Unrecoverable error for result BARCODE_30_1c8cA_236_4_0 ( - exit code -1073741819 (0xc0000005))

XP Pro SP2, Intel P4 single CPU no HT. BOINC 5.2.13

3) Message boards : RALPH@home bug list : removed from memory by benchmark (Message 816)
Posted 5 Mar 2006 by Profile David@home
Post:
The way to solve it is to manually do a benchmark when a RALPH or Rosetta WU is not in the cache if at all possible

That is NOT a solution - that is a work-around.

A solution would be to fix the Rosetta client application so that it does not lose the work when suspended and removed from memory for the scheduled benchmark calibration.


Exactly. For test projects it is be expected to spend time doing things to help but for production experiments you just want to let BOINC do its stuff. You should not have to micro manage BOINC. Look at the success of the BBC climate project. It must be the fastest growing project which in part will be due to the ease of running it: download, install, enter an email address to register and thats it.

BOINC was developed initially alongside SETI@home which has a 60 second checkpoint cycle. Newer projects that use longer checkpoints do not fit well. E.g. a user that only uses it as a screen saver or set to run when idle could lose a significant amount of elapsed time to complete the CPU time between checkpoints. The BOINC infrastructure needs to manage running the bench mark differently. BOINC should not need to remove clients out of memory, e.g. could it check for free RAM before running the benchmark and just suspend clients? Can Rosetta use a different checkpoint algorithm?



4) Message boards : RALPH@home bug list : removed from memory by benchmark (Message 809)
Posted 3 Mar 2006 by Profile David@home
Post:
More an observation, but one of concern after reading the FAQ about checkpoint times and that it is best to keep Rosetta in memory etc. When the benchmark ran it forced RALPH out of memory. Not sure how this can be managed better.


2006-03-03 09:29:51 [ralph@home] Resuming computation for result BARCODE_30_2ci2I_237_4_0 using rosetta_beta version 4.92
2006-03-03 09:31:38 [---] Suspending computation and network activity - running CPU benchmarks
2006-03-03 09:31:38 [ralph@home] Pausing result BARCODE_30_2ci2I_237_4_0 (removed from memory)
2006-03-03 09:31:39 [---] request_reschedule_cpus: process exited
2006-03-03 09:31:40 [---] Running CPU benchmarks
2006-03-03 09:32:37 [---] Benchmark results:
2006-03-03 09:32:37 [---] Number of CPUs: 1
2006-03-03 09:32:37 [---] 1369 double precision MIPS (Whetstone) per CPU
2006-03-03 09:32:37 [---] 2854 integer MIPS (Dhrystone) per CPU
2006-03-03 09:32:37 [---] Finished CPU benchmarks
2006-03-03 09:32:37 [---] Resuming computation and network activity
2006-03-03 09:32:37 [---] schedule_cpus: must schedule
2006-03-03 09:32:37 [ralph@home] Restarting result BARCODE_30_2ci2I_237_4_0 using rosetta_beta version 4.92
5) Message boards : RALPH@home bug list : Report - Previously Unclassified Work Unit Errors (Message 807)
Posted 3 Mar 2006 by Profile David@home
Post:
This WU finished using v 4.92 and claimed credit but contained some interesting messages so worth a look by the experts:

# Exception caught in nstruct loop ii=1 i=40
# num_decoys:39 attempts:40 cpu_run_time:26311.8

***UNHANDLED EXCEPTION****
Reason: Access Violation (0xc0000005) at address 0x7C910E03 write attempt to address 0x00000000

# cpu_run_time_pref: 28800

WU result is resultid 14051
6) Message boards : RALPH@home bug list : Nice work and new core-clients, but NO WORK !! (Message 794)
Posted 2 Mar 2006 by Profile David@home
Post:
What should i do ??
Without work, no tests ... :(


Just leave the project running, the BOINC core client will keep trying to connect and download work. RALPH is a test project and releases work when something needs to be tested. I am in the same situation, aborted my 4.90 WUs and downloaded 4.91 but no work. Which is a shame as with SETI@home being down I could have crunched quite a bit of RALPHA work.

7) Message boards : RALPH@home bug list : application not staying in memory (Message 782)
Posted 2 Mar 2006 by Profile David@home
Post:

Hmmm, perhaps Windows decided it was time to run one of those findfast-Utilities that scan your harddisks?


I disabled the indexing service on my PC a long time ago as fast search is a pointless CPU wasting activity IMHO. (My Computer > Drive letter > right mouse click > properties > General tab and uncheck "Allow Indexing Service to index this disk for fast file searching". No Google or MSN desktop search either :-) The PC would only have been running SETI at the time. :-(

I have aborted the 4.90 WUs as per the news, anybody know if v 4.91 has any updates to try to address this issue? The project seems to carry on from the last checkpoint but the loss of credit would be an issue in the production environment. E.g if this were to happen one hour from the end of a 10 hour run you would only get credit for the last hour of CPU time. Looking at the result returned this WU dropped out of memory three times so this would be a common problem in production at least on my PC.

http://ralph.bakerlab.org/result.php?resultid=12783


8) Message boards : RALPH@home bug list : application not staying in memory (Message 773)
Posted 1 Mar 2006 by Profile David@home
Post:
Alas not good news.

I updated to BOINC v 5.2.13 and I am currently running Rosetta Beta 4.90. I am still getting the client dropping out of memory when it is swapped for another project and reside in memory is set on.

e.g.

28/02/2006 23:43:04|ralph@home|Result HOMSdi_homDB018_1di2__228_10_0 exited with zero status but no 'finished' file
28/02/2006 23:43:04|ralph@home|If this happens repeatedly you may need to reset the project.

When this happens you lose all credit for the work done up to this point and it restarts calculating credit when the client is becomes active again. Not an issue for RALPHA but one which would stop me running it on Rosseta live system.

The PC was only running SETI@home at the time above, no user activity, no backup, no antivirus etc was running. The PC has 1GB of RAM so there is no issue with physical memory availability.



Any ideas?
9) Message boards : RALPH@home bug list : Rosetta does not give up CPU time to cleanmgr.exe (Message 571)
Posted 24 Feb 2006 by Profile David@home
Post:
It kindof defeats the purpose to disable the process, if I WANT to run it yeah?



No. RALPH is a test project so this test will show if it is the compress files check that is causing an issue with RALPH or not. You can restore it back as per the instructions afterwards.
10) Message boards : RALPH@home bug list : application not staying in memory (Message 561)
Posted 24 Feb 2006 by Profile David@home
Post:
I can update BOINC Mgr and see if this helps.
11) Message boards : RALPH@home bug list : Rosetta does not give up CPU time to cleanmgr.exe (Message 560)
Posted 24 Feb 2006 by Profile David@home
Post:
This runs cleanmgr exactly the same way. The first thing cleanmgr does is look for the amount of disk space that can be saved by compressing old files. This check can be disabled using the information in the Microsft article.

Try creating a backup of this registry key then delete it then check RALPH@home.

After the test you can always restore the registry key back using the backup you made.

This is all explained in the Microsoft article.

12) Message boards : RALPH@home bug list : Rosetta does not give up CPU time to cleanmgr.exe (Message 543)
Posted 23 Feb 2006 by Profile David@home
Post:
cleanmgr.exe runs at normal priority. I have not seen this on my XP Pro SP2 system. Cleanmgr only uses a very small amount of CPU when it is checking for space that can be saved by compress old files.

If you are happy using regedit there is a registry key that you can set to stop XP from running this compress space check which is very slow. I have done this on my system.

This url has more info

http://support.microsoft.com/?id=812248


13) Message boards : RALPH@home bug list : application not staying in memory (Message 541)
Posted 23 Feb 2006 by Profile David@home
Post:
I have noticed that RALPH WUs regularly fall out of memory when the client is in paused state.

e.g. from the log file:

23/02/2006 17:17:06|ralph@home|Restarting result BARCODE_30_1cc8A_215_22_0 using rosetta_beta version 4.86
23/02/2006 17:17:06|SETI@home|Pausing result 23dc00aa.19627.28578.1009650.1.195_0 (left in memory)
23/02/2006 18:17:06|ralph@home|Pausing result BARCODE_30_1cc8A_215_22_0 (left in memory)
23/02/2006 18:17:06|SETI@home|Resuming result 23dc00aa.19627.28578.1009650.1.195_0 using setiathome version 4.11
23/02/2006 18:36:56|ralph@home|Result BARCODE_30_1cc8A_215_22_0 exited with zero status but no 'finished' file
23/02/2006 18:36:56|ralph@home|If this happens repeatedly you may need to reset the project.
23/02/2006 18:36:56||request_reschedule_cpus: process exited


As a project reset will delete all files associated with RALPH it would not make sense to do this if this failure to remain in memory is something to do with the new client under test.

Using Windows XP Pro SP2, BOINC v 4.45, Intel P4 single core no hyperthreading. Client perferences set to leave in memory. Sharing two applications RALPH@home and SETI@home.





When you say "Client is in paused state", are you saying -
1) That the rosetta client application has been swapped out by BOINC to run another project application
2) You have paused the workunit from the work tab
3) you have suspended BOINC client activities from the BOINC menu
4) You have suspended the Ralph project in the projects tab in BOINC Manager.

If you are talking about 1, 3 or 3 then there is a problem, if you are talking about 4 then it might be normal



I am refering to 1 i.e. the normal swapping of applications by the BOINC core. I have noticed it happen several times. E.g. from the sample log RALPH had been paused to allow SETI to run, but in this case 19 minutes after being paused the RALPH application just dropped out of memory for no reason. It was nolonger visible in Windows Task Manager. I am using a one hour switch time.

14) Message boards : RALPH@home bug list : application not staying in memory (Message 538)
Posted 23 Feb 2006 by Profile David@home
Post:
I have noticed that RALPH WUs regularly fall out of memory when the client is in paused state.

e.g. from the log file:

23/02/2006 17:17:06|ralph@home|Restarting result BARCODE_30_1cc8A_215_22_0 using rosetta_beta version 4.86
23/02/2006 17:17:06|SETI@home|Pausing result 23dc00aa.19627.28578.1009650.1.195_0 (left in memory)
23/02/2006 18:17:06|ralph@home|Pausing result BARCODE_30_1cc8A_215_22_0 (left in memory)
23/02/2006 18:17:06|SETI@home|Resuming result 23dc00aa.19627.28578.1009650.1.195_0 using setiathome version 4.11
23/02/2006 18:36:56|ralph@home|Result BARCODE_30_1cc8A_215_22_0 exited with zero status but no 'finished' file
23/02/2006 18:36:56|ralph@home|If this happens repeatedly you may need to reset the project.
23/02/2006 18:36:56||request_reschedule_cpus: process exited


As a project reset will delete all files associated with RALPH it would not make sense to do this if this failure to remain in memory is something to do with the new client under test.

Using Windows XP Pro SP2, BOINC v 4.45, Intel P4 single core no hyperthreading. Client perferences set to leave in memory. Sharing two applications RALPH@home and SETI@home.





15) Message boards : RALPH@home bug list : Discussion of the \"1% Hang\" issue (Message 335)
Posted 19 Feb 2006 by Profile David@home
Post:
I was out for about 1 hour to do love.
I'm back, now.
WU still undisturbed, suspended into RAM.
Anything to do ?


Hi,

There is a request from dekim a few posts down in this thread:

http://ralph.bakerlab.org/forum_thread.php?id=1#328p

My best guess is that this is for both of us.


16) Message boards : RALPH@home bug list : Report \"stuck at 1%\" bugs here (Message 332)
Posted 19 Feb 2006 by Profile David@home
Post:
Can you restart boinc and see if it continues on?



Restarted BOINC, the WU appears to have gone back to the start, according to the graphic it is at Model 1 step 78 (and incrementing), if it gets stuck again is there any thing we can do to help pin this down?

From the log:

19/02/2006 19:42:20||request_reschedule_cpus: project op
19/02/2006 19:42:40|ralph@home|Restarting result BARCODE_30_256bA_NATIVE_210_24_0 using rosetta_beta version 4.84
19/02/2006 19:42:40|SETI@home|Pausing result 14au00aa.7506.496.234660.1.92_2 (left in memory)





OK, this time it went thought to completion OK and credit was granted:

http://ralph.bakerlab.org/workunit.php?wuid=3325

Interestingly this is about 30 mins of CPU time, it had done 30 minutes previously before it hung. These test WUs typically take just under an hour on my PC. It is as if it has only claimed credit for the second 30 mins of CPU but carried on the calculations from where it got stuck. Should it have claimed the credit for both periods of CPU activity?

Any comments on the credit and the fact that it did not hang the second time?

It would be interesting to hear any ideas. Thanks.



17) Message boards : RALPH@home bug list : Report \"stuck at 1%\" bugs here (Message 329)
Posted 19 Feb 2006 by Profile David@home
Post:
Can you restart boinc and see if it continues on?



Restarted BOINC, the WU appears to have gone back to the start, according to the graphic it is at Model 1 step 78 (and incrementing), if it gets stuck again is there any thing we can do to help pin this down?

From the log:

19/02/2006 19:42:20||request_reschedule_cpus: project op
19/02/2006 19:42:40|ralph@home|Restarting result BARCODE_30_256bA_NATIVE_210_24_0 using rosetta_beta version 4.84
19/02/2006 19:42:40|SETI@home|Pausing result 14au00aa.7506.496.234660.1.92_2 (left in memory)


18) Message boards : RALPH@home bug list : Report \"stuck at 1%\" bugs here (Message 306)
Posted 19 Feb 2006 by Profile David@home
Post:
The people with hung work units in memory waiting for instructions:

I will send a note to David Kim to get his attention to this thread and provide you furthur instructions on what to do. As I write this it is 7:00 am Sunday on the West coast, so assuming he checks his mail on Sunday mornings he should get back to you soon. The information you can provide him is valuable so please hang in there till he gets back to you.



Many thanks for the update. I just checked and for some reason the WU has dropped out of memory. Even though the project was suspended and BOINC manager shows the work unit still as preempted it is nolonger in Windows Task Manager and in the BOINC Manager log there is this info:

19/02/2006 10:59:19|ralph@home|Result BARCODE_30_256bA_NATIVE_210_24_0 exited with zero status but no 'finished' file
19/02/2006 10:59:19|ralph@home|If this happens repeatedly you may need to reset the project.
19/02/2006 10:59:19||request_reschedule_cpus: process exited


Why after it was happily suspended for several hours it did this is not clear. The other project was not doing anything other than crunch its work unit at this time so it was not a side effect of the other project.

My understanding is that the CC will retry this WU once again when I unsuspend the client. I will wait to hear from the devs before doing this.

Edit >> Hmmm, interesting, I just checked something... the last Antispyware scan I ran was at around 11:00. maybe Windows defender kicked the binary in memory which caused it to fail as above.


19) Message boards : RALPH@home bug list : Report \"stuck at 1%\" bugs here (Message 303)
Posted 19 Feb 2006 by Profile David@home
Post:
Sorry just seen your request for step number etc.

There is a screen shot of the graphic at

http://mercury.walagata.com/w/appetiser/ralph.gif
20) Message boards : RALPH@home bug list : Report \"stuck at 1%\" bugs here (Message 302)
Posted 19 Feb 2006 by Profile David@home
Post:
I'd be curious to know the model# and Step number it's frozen at, but don't want you to lose the possiblility of them asking you to do something first. This data is on the graphic.
Is it a 4.83, 4.85??
What's your switch between projects time?
Are you doing more than one project?
Is this a Hyperthreading host?
CPU type?


WU is BARCODE_30_256bA_NATIVE_210_24_0
Application is rosetta_beta 4.84
3 projects RALPH and SETI active (+Rosetta suspended)
Switch interval 60 minutes
No hyperthreading unfortunately
CPU: Pentium 4 2.5GHz
OS is Windows XP Pro SP2


Full Proc specs:

Intel(R) Processor Frequency ID Utility
Version: 5.5.20030402
Time Stamp: 2006/02/19 07:58:58
Number of processors in system: 1
Current processor: #1
Processor Name: Intel(R) Pentium(R) 4 CPU 2.53GHz
Type: 0
Family: F
Model: 2
Stepping: 4
Revision: 1E
L1 Trace Cache: 12 Kµops
L1 Data Cache: 8 KB
L2 Cache: 512 KB
Packaging: FC-PGA2
MMX(TM): Yes
SIMD: Yes
SIMD2: Yes
NetBurst(TM) Microarchitecture: Yes
Expected Processor Frequency: 2.53 GHz
Reported Processor Frequency: 2.53 GHz
Expected System Bus Frequency: 533 MHz
Reported System Bus Frequency: 533 MHz



Next 20



©2024 University of Washington
http://www.bakerlab.org