Message boards : RALPH@home bug list : Report - Previously Unclassified Work Unit Errors
Author | Message |
---|---|
Moderator9 Volunteer moderator Send message Joined: 16 Feb 06 Posts: 251 Credit: 0 RAC: 0 |
This thread is for reporting work Unit errors that are NOT related to - Hangs at 1% or Application Swapping. Moderator9 RALPH@home FAQs RALPH@home Guidelines Moderator Contact |
Mathias Becher Send message Joined: 16 Feb 06 Posts: 1 Credit: 2,970 RAC: 0 |
This WU: https://ralph.bakerlab.org/result.php?resultid=5959 errorred out, when my router was reconnecting to the internet. (Here in Germany DSL-connections are reconnected every 24 hours.) Hope that helps debugging this issue. PS: I was looking at the screensaver when that happened |
Carlos_Pfitzner Send message Joined: 16 Feb 06 Posts: 182 Credit: 22,792 RAC: 0 |
https://ralph.bakerlab.org/result.php?resultid=5734 Dont know what happened, I was busy doing other things. The computer of this WU is a network server with no users, allways on (24/365), no keyboard, no mouse, nor monitor *Internet link is eth1 (rtl 8139c) half duplex, 10mbit/s, mtu 1500, operating at 2mbit/s, has syn cookies enabled, ecn disabled. Click signature for global team stats |
Rebirther Send message Joined: 17 Feb 06 Posts: 9 Credit: 1,491 RAC: 0 |
At the actually running "Barcode.xxx" WU, after 2h 23% (Model 10) and now again falls back to 9% (Model 11). What now? |
Snake Doctor Send message Joined: 16 Feb 06 Posts: 37 Credit: 998,880 RAC: 0 |
At the actually running "Barcode.xxx" WU, after 2h 23% (Model 10) and now again falls back to 9% (Model 11). What now? did you change your time setting or something? Regards Phil |
Rebirther Send message Joined: 17 Feb 06 Posts: 9 Credit: 1,491 RAC: 0 |
At the actually running "Barcode.xxx" WU, after 2h 23% (Model 10) and now again falls back to 9% (Model 11). What now? yes, from default to 1 day, because it looks like to take very long. |
Rebirther Send message Joined: 17 Feb 06 Posts: 9 Credit: 1,491 RAC: 0 |
The WU jumped now from 11% to 100% after 4,5h. Don`t know how, only finished Model 15. I don`t understand the time settings. Is it only for getting new WUs? Anybody help me about this? |
Snake Doctor Send message Joined: 16 Feb 06 Posts: 37 Credit: 998,880 RAC: 0 |
The WU jumped now from 11% to 100% after 4,5h. Don`t know how, only finished Model 15. I don`t understand the time settings. Is it only for getting new WUs? Anybody help me about this? The time setting is only used to determine how long a work unit will run. But, the jump in % complete you have reported may be normal for the time setting you have. There is more information in the Rosetta FAQa here about this. It explains how the % could jump like you say it did. Here is another one that says exactaally what you said happened. It looks like you adjusted the time pref. to less time than the time the work unit had already processed, so when it finished the model it was working on it jumped to 100% and reported. Here is the data from the work unit - stderr out <core_client_version>5.2.13</core_client_version> <stderr_txt> # random seed: 3993914 # cpu_run_time_pref: 28800 # cpu_run_time_pref: 86400 # cpu_run_time_pref: 7200 # DONE :: 1 starting structures built 6 (nstruct) times # This process generated 15 decoys from 15 attempts </stderr_txt> Regards Phil |
Rebirther Send message Joined: 17 Feb 06 Posts: 9 Credit: 1,491 RAC: 0 |
The WU jumped now from 11% to 100% after 4,5h. Don`t know how, only finished Model 15. I don`t understand the time settings. Is it only for getting new WUs? Anybody help me about this? Thx, the second link answered my question. All a little bit confused but learned something like that ;) |
doc :) Send message Joined: 16 Feb 06 Posts: 46 Credit: 4,437 RAC: 0 |
this WU, result ended with the following error: 22/02/2006 19:51:37|ralph@home|Unrecoverable error for result BARCODE_30_1c8cA_215_24_0 ( - exit code -1073741811 (0xc000000d)) i had the graphics window open in the background, it was not getting removed from memory, i was browsing some websites when i noticed the graphics were gone and saw that error in my boincmanager. no clue what happened. |
Carlos_Pfitzner Send message Joined: 16 Feb 06 Posts: 182 Credit: 22,792 RAC: 0 |
Access Violation - Rosetta 4.82 https://boinc.bakerlab.org/rosetta/result.php?resultid=11863895 This is the New rosetta client for windows, updated 18 Feb 2006 Click signature for global team stats |
Moderator9 Volunteer moderator Send message Joined: 16 Feb 06 Posts: 251 Credit: 0 RAC: 0 |
Access Violation - Rosetta 4.82 Carlos, In looking at this I see your system only has 256 MB of memory, When I look at your results, many of them that did not actually fail, have memory access violation and exception errors. On the home page the system Requirements For Rosetta/Ralph, are for 512 MB of memory. While it is possible to run the project software with less than the specified minimum memory, that can cause work units to have errors of exactally the type you are having. I suspect that that may be why you are seeing the errors. The Workunits that are having the errors are all of the larger work unit types, this also makes this look like a memory issue. Remember the work must be kept in memory between checkpoints, so it is possible that as time passes the system has less and less memory to work with, until it can checkpoint. It looks like the workunits that are failing are doing so before they finish the first model, but I am only guessing based on the work unit type and the CPU time reported, because there is incomplete error data on many of them. So you will have to see if that is the case by checking on your end. Do keep reporting the bad ones with a link, and I will try to get the project guys to look at your work unit errors. Moderator9 RALPH@home FAQs RALPH@home Guidelines Moderator Contact |
genes Send message Joined: 16 Feb 06 Posts: 45 Credit: 43,706 RAC: 20 |
I just posted this over at Rosetta: https://boinc.bakerlab.org/rosetta/forum_thread.php?id=1143 The machine in question has been failing Rosetta WU's, and just failed a Ralph WU the same way. I will try setting the run time lower to see if that helps. They all failed with (0xc000000d) errors. [edit] setting run time to 4 hours for now. (2 hours won't do too much on a P3.) [/edit] [edit] the failed WU: https://ralph.bakerlab.org/result.php?resultid=6153 [/edit] |
hob Send message Joined: 17 Feb 06 Posts: 3 Credit: 33,852 RAC: 0 |
2 jobs failed today on an mp2200 dual processor machine both had run for just over 8 hours. the machine is not used for anything else except dc work from the messages view 24/02/2006 08:59:39 AM|ralph@home|Unrecoverable error for result BARCODE_30_1a68__219_2_0 (<file_xfer_error> <file_name>BARCODE_30_1a68__219_2_0_0</file_name> <error_code>-131</error_code> <error_message></error_message></file_xfer_error>) 24/02/2006 02:56:22 PM|ralph@home|Unrecoverable error for result BARCODE_30_5croA_219_3_0 (<file_xfer_error> <file_name>BARCODE_30_5croA_219_3_0_0</file_name> <error_code>-131</error_code> <error_message></error_message></file_xfer_error>) both seem to have failed with error code 131 (whatever that is) this machine has run rosetta for over 2 months without producing any errors. |
Aglarond Send message Joined: 16 Feb 06 Posts: 11 Credit: 1,094 RAC: 0 |
Hi, I found this on Rosetta.. already posted this also on Rosetta forums.. Hi I have picture of screensaver where actual Accepted Energy is running somewhere above the box where it should remain. It is here: RosettaScreen01.jpg last message in Boinc is: 18.2.2006 14:49:27|rosetta@home|Resuming task PRODUCTION_ABINITIO_QUADRUPLELONGRANGEANTIPARALLEL_1fna__311_26_0 using rosetta version 481 and it is Result ID 11551703 and Work unit ID 9376478 Hope it helps to repair it. |
Carlos_Pfitzner Send message Joined: 16 Feb 06 Posts: 182 Credit: 22,792 RAC: 0 |
stuck at 58.37% https://ralph.bakerlab.org/result.php?resultid=9280 load average: 0.00, 0.00, 0.02 crobertp [/home/boinc/BOINC] > ps xu USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND boinc 15528 0.0 1.2 5732 2996 ? S 14:21 0:05 ./boinc -redirectio -allow_remote_gui_rpc -return_results_imme boinc 27815 0.3 19.0 96728 47196 ? SN 16:02 1:41 rosetta_beta_4.84_i686-pc-linux-gnu cc 1scj B -abrelax -string boinc 27816 0.0 19.0 96728 47196 ? SN 16:02 0:00 rosetta_beta_4.84_i686-pc-linux-gnu cc 1scj B -abrelax -string boinc 27817 0.0 19.0 96728 47196 ? SN 16:02 0:00 rosetta_beta_4.84_i686-pc-linux-gnu cc 1scj B -abrelax -string boinc 16136 0.0 1.0 7204 2484 ? S 22:20 0:00 /usr/sbin/sshd boinc 16139 0.0 0.9 3476 2336 pts/0 S 22:20 0:00 -bash boinc 21561 0.0 0.2 2548 668 pts/0 R 23:44 0:00 ps xu crobertp [/home/boinc/BOINC] > Restarting boinc ... Click signature for global team stats |
Carlos_Pfitzner Send message Joined: 16 Feb 06 Posts: 182 Credit: 22,792 RAC: 0 |
Access Violation - Rosetta 4.82 Access Violation is a Security Issue, it has nothing to do with the amount of ram that the pc has. Only the program is trying to read/write into pages that it does not own. Like into a school u trying to write in the exercise book of another pupil, instead of writing into your own exercise book *Most probably cause is: C libs of the pc incompatible with the C libs used to compile the app static compiled programs are not subject to incompatible C libs versions however are somewhat bigger ... what would increase even more the need for ram, or increase swap activities ... decreased performance ... less work done into same time ... This is direct related to few ram ... However few ram does *not* cause any erros! I will attach this pc to another project ... this is best than let it continue producing wrong results for rosetta. ps: none of simap WUs this pc crunched has any erros http://boinc.bio.wzw.tum.de/boincsimap/results.php?hostid=2396 however almost every WU it crunched for rosetta has a "access violation" https://boinc.bakerlab.org/rosetta/results.php?hostid=118809 Click signature for global team stats |
Moderator9 Volunteer moderator Send message Joined: 16 Feb 06 Posts: 251 Credit: 0 RAC: 0 |
You are corect that it is a protection fault. But these can be caused by the system running out of legal memory locations inside the partitioned menory space for the application and attempting to write into a protected memory area. When the system detects this, the program fails with an access violation error. Moderator9 RALPH@home FAQs RALPH@home Guidelines Moderator Contact |
doc :) Send message Joined: 16 Feb 06 Posts: 46 Credit: 4,437 RAC: 0 |
another one, can not answer if it was rosettas or my fault this time though. i had a 4.89 wu running close to completion when i started some game (a mod for enemy territory), minimized it while connecting to a server and then the game and the ralph wu crashed at the same time (it is very very rare for that game here to crash). never seen that error before, and now i am stuck at my daily quota again :) 25/02/2006 12:56:36|ralph@home|Unrecoverable error for result BARCODE_30_1c9oA_225_11_0 ( - exit code -529697949 (0xe06d7363)) WU - result |
Astro Send message Joined: 16 Feb 06 Posts: 141 Credit: 32,977 RAC: 0 |
I too have experienced an error with 4.87 on the upload. Note: this was on my host 103, Celeron 500, 256 Mram, Win98se, Boinc CC V5.3.22 2/25/06 8:45:00 AM||Rescheduling CPU: application exited 2/25/06 8:45:00 AM|ralph@home|Computation for task BARCODE_30_1ctf__219_2_0 finished 2/25/06 8:45:00 AM|ralph@home|Output file BARCODE_30_1ctf__219_2_0_0 for task BARCODE_30_1ctf__219_2_0 exceeds size limit. 2/25/06 8:45:00 AM|ralph@home|File size: 26504055.000000 bytes. Limit: 25000000.000000 bytes 2/25/06 8:45:03 AM|ralph@home|Starting task BARCODE_30_1iibA_219_2_0 using rosetta_beta version 487 2/25/06 8:45:03 AM|ralph@home|Unrecoverable error for result BARCODE_30_1ctf__219_2_0 (<file_xfer_error> <file_name>BARCODE_30_1ctf__219_2_0_0</file_name> <error_code>-131</error_code> <error_message></error_message></file_xfer_error>) 2/25/06 8:45:03 AM|ralph@home|Deferring scheduler requests for 1 minutes and 0 seconds 2/25/06 8:45:05 AM|ralph@home|Started upload of file BARCODE_30_1ctf__219_2_0_1 2/25/06 8:45:11 AM|ralph@home|Finished upload of file BARCODE_30_1ctf__219_2_0_1 2/25/06 8:45:11 AM|ralph@home|Throughput 20455 bytes/sec Formerly mmciastro. Name and avatar changed for a change The New Online Helpsytem help is just a call away. |
Message boards :
RALPH@home bug list :
Report - Previously Unclassified Work Unit Errors
©2024 University of Washington
http://www.bakerlab.org