Report - Previously Unclassified Work Unit Errors

Message boards : RALPH@home bug list : Report - Previously Unclassified Work Unit Errors

To post messages, you must log in.

1 · 2 · 3 · 4 . . . 5 · Next

AuthorMessage
Moderator9
Volunteer moderator

Send message
Joined: 16 Feb 06
Posts: 251
Credit: 0
RAC: 0
Message 330 - Posted: 19 Feb 2006, 19:47:59 UTC
Last modified: 19 Feb 2006, 19:48:25 UTC

This thread is for reporting work Unit errors that are NOT related to - Hangs at 1% or Application Swapping.
Moderator9
RALPH@home FAQs
RALPH@home Guidelines
Moderator Contact
ID: 330 · Report as offensive    Reply Quote
Mathias Becher

Send message
Joined: 16 Feb 06
Posts: 1
Credit: 2,970
RAC: 0
Message 469 - Posted: 22 Feb 2006, 12:38:52 UTC

This WU:
https://ralph.bakerlab.org/result.php?resultid=5959

errorred out, when my router was reconnecting to the internet. (Here in Germany DSL-connections are reconnected every 24 hours.)

Hope that helps debugging this issue.

PS: I was looking at the screensaver when that happened
ID: 469 · Report as offensive    Reply Quote
Profile Carlos_Pfitzner
Avatar

Send message
Joined: 16 Feb 06
Posts: 182
Credit: 22,792
RAC: 0
Message 479 - Posted: 22 Feb 2006, 15:14:43 UTC

https://ralph.bakerlab.org/result.php?resultid=5734

Dont know what happened, I was busy doing other things.

The computer of this WU is a network server with no users, allways on (24/365),
no keyboard, no mouse, nor monitor

*Internet link is eth1 (rtl 8139c) half duplex, 10mbit/s, mtu 1500,
operating at 2mbit/s, has syn cookies enabled, ecn disabled.
Click signature for global team stats
ID: 479 · Report as offensive    Reply Quote
Profile Rebirther
Avatar

Send message
Joined: 17 Feb 06
Posts: 9
Credit: 1,491
RAC: 0
Message 487 - Posted: 22 Feb 2006, 16:51:40 UTC

At the actually running "Barcode.xxx" WU, after 2h 23% (Model 10) and now again falls back to 9% (Model 11). What now?
ID: 487 · Report as offensive    Reply Quote
Snake Doctor

Send message
Joined: 16 Feb 06
Posts: 37
Credit: 998,880
RAC: 0
Message 495 - Posted: 22 Feb 2006, 18:33:27 UTC - in response to Message 487.  

At the actually running "Barcode.xxx" WU, after 2h 23% (Model 10) and now again falls back to 9% (Model 11). What now?


did you change your time setting or something?

Regards
Phil

ID: 495 · Report as offensive    Reply Quote
Profile Rebirther
Avatar

Send message
Joined: 17 Feb 06
Posts: 9
Credit: 1,491
RAC: 0
Message 496 - Posted: 22 Feb 2006, 18:36:26 UTC - in response to Message 495.  

At the actually running "Barcode.xxx" WU, after 2h 23% (Model 10) and now again falls back to 9% (Model 11). What now?


did you change your time setting or something?

Regards
Phil


yes, from default to 1 day, because it looks like to take very long.
ID: 496 · Report as offensive    Reply Quote
Profile Rebirther
Avatar

Send message
Joined: 17 Feb 06
Posts: 9
Credit: 1,491
RAC: 0
Message 501 - Posted: 22 Feb 2006, 19:42:46 UTC

The WU jumped now from 11% to 100% after 4,5h. Don`t know how, only finished Model 15. I don`t understand the time settings. Is it only for getting new WUs? Anybody help me about this?
ID: 501 · Report as offensive    Reply Quote
Snake Doctor

Send message
Joined: 16 Feb 06
Posts: 37
Credit: 998,880
RAC: 0
Message 502 - Posted: 22 Feb 2006, 19:48:55 UTC - in response to Message 501.  
Last modified: 22 Feb 2006, 19:57:53 UTC

The WU jumped now from 11% to 100% after 4,5h. Don`t know how, only finished Model 15. I don`t understand the time settings. Is it only for getting new WUs? Anybody help me about this?


The time setting is only used to determine how long a work unit will run. But, the jump in % complete you have reported may be normal for the time setting you have. There is more information in the Rosetta FAQa here about this. It explains how the % could jump like you say it did.

Here is another one that says exactaally what you said happened.

It looks like you adjusted the time pref. to less time than the time the work unit had already processed, so when it finished the model it was working on it jumped to 100% and reported. Here is the data from the work unit -

stderr out
<core_client_version>5.2.13</core_client_version>
<stderr_txt>
# random seed: 3993914
# cpu_run_time_pref: 28800
# cpu_run_time_pref: 86400
# cpu_run_time_pref: 7200
# DONE :: 1 starting structures built 6 (nstruct) times
# This process generated 15 decoys from 15 attempts

</stderr_txt>

Regards
Phil

ID: 502 · Report as offensive    Reply Quote
Profile Rebirther
Avatar

Send message
Joined: 17 Feb 06
Posts: 9
Credit: 1,491
RAC: 0
Message 503 - Posted: 22 Feb 2006, 19:57:53 UTC - in response to Message 502.  

The WU jumped now from 11% to 100% after 4,5h. Don`t know how, only finished Model 15. I don`t understand the time settings. Is it only for getting new WUs? Anybody help me about this?


The time setting is only used to determine how long a work unit will run. But, the jump in % complete you have reported may be normal for the time setting you have. There is more information in the Rosetta FAQa here about this. It explains how the % could jump like you say it did.

Here is another one that says exactaally what you said happened.

Regards
Phil


Thx, the second link answered my question. All a little bit confused but learned something like that ;)

ID: 503 · Report as offensive    Reply Quote
doc :)

Send message
Joined: 16 Feb 06
Posts: 46
Credit: 4,437
RAC: 0
Message 516 - Posted: 23 Feb 2006, 4:07:28 UTC

this WU, result
ended with the following error:
22/02/2006 19:51:37|ralph@home|Unrecoverable error for result BARCODE_30_1c8cA_215_24_0 ( - exit code -1073741811 (0xc000000d))
i had the graphics window open in the background, it was not getting removed from memory, i was browsing some websites when i noticed the graphics were gone and saw that error in my boincmanager. no clue what happened.
ID: 516 · Report as offensive    Reply Quote
Profile Carlos_Pfitzner
Avatar

Send message
Joined: 16 Feb 06
Posts: 182
Credit: 22,792
RAC: 0
Message 519 - Posted: 23 Feb 2006, 7:54:27 UTC
Last modified: 23 Feb 2006, 7:55:19 UTC

Access Violation - Rosetta 4.82
https://boinc.bakerlab.org/rosetta/result.php?resultid=11863895
This is the New rosetta client for windows, updated 18 Feb 2006
Click signature for global team stats
ID: 519 · Report as offensive    Reply Quote
Moderator9
Volunteer moderator

Send message
Joined: 16 Feb 06
Posts: 251
Credit: 0
RAC: 0
Message 539 - Posted: 23 Feb 2006, 22:24:12 UTC - in response to Message 519.  

Access Violation - Rosetta 4.82
https://boinc.bakerlab.org/rosetta/result.php?resultid=11863895
This is the New rosetta client for windows, updated 18 Feb 2006


Carlos,

In looking at this I see your system only has 256 MB of memory, When I look at your results, many of them that did not actually fail, have memory access violation and exception errors. On the home page the system Requirements For Rosetta/Ralph, are for 512 MB of memory.

While it is possible to run the project software with less than the specified minimum memory, that can cause work units to have errors of exactally the type you are having. I suspect that that may be why you are seeing the errors.

The Workunits that are having the errors are all of the larger work unit types, this also makes this look like a memory issue. Remember the work must be kept in memory between checkpoints, so it is possible that as time passes the system has less and less memory to work with, until it can checkpoint.

It looks like the workunits that are failing are doing so before they finish the first model, but I am only guessing based on the work unit type and the CPU time reported, because there is incomplete error data on many of them. So you will have to see if that is the case by checking on your end.

Do keep reporting the bad ones with a link, and I will try to get the project guys to look at your work unit errors.




Moderator9
RALPH@home FAQs
RALPH@home Guidelines
Moderator Contact
ID: 539 · Report as offensive    Reply Quote
genes
Avatar

Send message
Joined: 16 Feb 06
Posts: 45
Credit: 43,706
RAC: 20
Message 554 - Posted: 24 Feb 2006, 3:26:38 UTC
Last modified: 24 Feb 2006, 3:34:24 UTC

I just posted this over at Rosetta:

https://boinc.bakerlab.org/rosetta/forum_thread.php?id=1143

The machine in question has been failing Rosetta WU's, and just failed a Ralph WU the same way. I will try setting the run time lower to see if that helps.

They all failed with (0xc000000d) errors.
[edit]
setting run time to 4 hours for now. (2 hours won't do too much on a P3.)
[/edit]
[edit]
the failed WU:
https://ralph.bakerlab.org/result.php?resultid=6153
[/edit]
ID: 554 · Report as offensive    Reply Quote
hob

Send message
Joined: 17 Feb 06
Posts: 3
Credit: 33,852
RAC: 0
Message 582 - Posted: 24 Feb 2006, 20:35:01 UTC

2 jobs failed today on an mp2200 dual processor machine both had run for just over 8 hours. the machine is not used for anything else except dc work

from the messages view

24/02/2006 08:59:39 AM|ralph@home|Unrecoverable error for result BARCODE_30_1a68__219_2_0 (<file_xfer_error> <file_name>BARCODE_30_1a68__219_2_0_0</file_name> <error_code>-131</error_code> <error_message></error_message></file_xfer_error>)




24/02/2006 02:56:22 PM|ralph@home|Unrecoverable error for result BARCODE_30_5croA_219_3_0 (<file_xfer_error> <file_name>BARCODE_30_5croA_219_3_0_0</file_name> <error_code>-131</error_code> <error_message></error_message></file_xfer_error>)


both seem to have failed with error code 131 (whatever that is) this machine has run rosetta for over 2 months without producing any errors.
ID: 582 · Report as offensive    Reply Quote
Aglarond

Send message
Joined: 16 Feb 06
Posts: 11
Credit: 1,094
RAC: 0
Message 596 - Posted: 25 Feb 2006, 1:16:49 UTC

Hi, I found this on Rosetta.. already posted this also on Rosetta forums..
Hi I have picture of screensaver where actual Accepted Energy is running somewhere above the box where it should remain. It is here: RosettaScreen01.jpg
last message in Boinc is:
18.2.2006 14:49:27|rosetta@home|Resuming task PRODUCTION_ABINITIO_QUADRUPLELONGRANGEANTIPARALLEL_1fna__311_26_0 using rosetta version 481
and it is Result ID 11551703
and Work unit ID 9376478
Hope it helps to repair it.
ID: 596 · Report as offensive    Reply Quote
Profile Carlos_Pfitzner
Avatar

Send message
Joined: 16 Feb 06
Posts: 182
Credit: 22,792
RAC: 0
Message 600 - Posted: 25 Feb 2006, 2:38:37 UTC

stuck at 58.37%
https://ralph.bakerlab.org/result.php?resultid=9280

load average: 0.00, 0.00, 0.02

crobertp [/home/boinc/BOINC] > ps xu
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
boinc 15528 0.0 1.2 5732 2996 ? S 14:21 0:05 ./boinc -redirectio -allow_remote_gui_rpc -return_results_imme
boinc 27815 0.3 19.0 96728 47196 ? SN 16:02 1:41 rosetta_beta_4.84_i686-pc-linux-gnu cc 1scj B -abrelax -string
boinc 27816 0.0 19.0 96728 47196 ? SN 16:02 0:00 rosetta_beta_4.84_i686-pc-linux-gnu cc 1scj B -abrelax -string
boinc 27817 0.0 19.0 96728 47196 ? SN 16:02 0:00 rosetta_beta_4.84_i686-pc-linux-gnu cc 1scj B -abrelax -string
boinc 16136 0.0 1.0 7204 2484 ? S 22:20 0:00 /usr/sbin/sshd
boinc 16139 0.0 0.9 3476 2336 pts/0 S 22:20 0:00 -bash
boinc 21561 0.0 0.2 2548 668 pts/0 R 23:44 0:00 ps xu
crobertp [/home/boinc/BOINC] >

Restarting boinc ...

Click signature for global team stats
ID: 600 · Report as offensive    Reply Quote
Profile Carlos_Pfitzner
Avatar

Send message
Joined: 16 Feb 06
Posts: 182
Credit: 22,792
RAC: 0
Message 607 - Posted: 25 Feb 2006, 6:54:40 UTC - in response to Message 539.  
Last modified: 25 Feb 2006, 7:03:52 UTC

Access Violation - Rosetta 4.82
https://boinc.bakerlab.org/rosetta/result.php?resultid=11863895
This is the New rosetta client for windows, updated 18 Feb 2006


Carlos,

In looking at this I see your system only has 256 MB of memory, When I look at your results, many of them that did not actually fail, have memory access violation and exception errors. On the home page the system Requirements For Rosetta/Ralph, are for 512 MB of memory.

While it is possible to run the project software with less than the specified minimum memory, that can cause work units to have errors of exactally the type you are having. I suspect that that may be why you are seeing the errors.

The Workunits that are having the errors are all of the larger work unit types, this also makes this look like a memory issue. Remember the work must be kept in memory between checkpoints, so it is possible that as time passes the system has less and less memory to work with, until it can checkpoint.

It looks like the workunits that are failing are doing so before they finish the first model, but I am only guessing based on the work unit type and the CPU time reported, because there is incomplete error data on many of them. So you will have to see if that is the case by checking on your end.

Do keep reporting the bad ones with a link, and I will try to get the project guys to look at your work unit errors.





Access Violation is a Security Issue,
it has nothing to do with the amount of ram that the pc has.
Only the program is trying to read/write into pages that it does not own.
Like into a school u trying to write in the exercise book of another pupil,
instead of writing into your own exercise book

*Most probably cause is: C libs of the pc incompatible
with the C libs used to compile the app


static compiled programs are not subject to incompatible C libs versions
however are somewhat bigger

... what would increase even more the need for ram,
or increase swap activities ...

decreased performance ... less work done into same time ...
This is direct related to few ram ...
However few ram does *not* cause any erros!

I will attach this pc to another project ...
this is best than let it continue producing wrong results for rosetta.

ps:
none of simap WUs this pc crunched has any erros
http://boinc.bio.wzw.tum.de/boincsimap/results.php?hostid=2396
however almost every WU it crunched for rosetta has a "access violation"
https://boinc.bakerlab.org/rosetta/results.php?hostid=118809
Click signature for global team stats
ID: 607 · Report as offensive    Reply Quote
Moderator9
Volunteer moderator

Send message
Joined: 16 Feb 06
Posts: 251
Credit: 0
RAC: 0
Message 610 - Posted: 25 Feb 2006, 7:30:48 UTC



You are corect that it is a protection fault. But these can be caused by the system running out of legal memory locations inside the partitioned menory space for the application and attempting to write into a protected memory area. When the system detects this, the program fails with an access violation error.

Moderator9
RALPH@home FAQs
RALPH@home Guidelines
Moderator Contact
ID: 610 · Report as offensive    Reply Quote
doc :)

Send message
Joined: 16 Feb 06
Posts: 46
Credit: 4,437
RAC: 0
Message 619 - Posted: 25 Feb 2006, 12:43:41 UTC

another one, can not answer if it was rosettas or my fault this time though. i had a 4.89 wu running close to completion when i started some game (a mod for enemy territory), minimized it while connecting to a server and then the game and the ralph wu crashed at the same time (it is very very rare for that game here to crash).
never seen that error before, and now i am stuck at my daily quota again :)

25/02/2006 12:56:36|ralph@home|Unrecoverable error for result BARCODE_30_1c9oA_225_11_0 ( - exit code -529697949 (0xe06d7363))

WU - result
ID: 619 · Report as offensive    Reply Quote
Profile Astro

Send message
Joined: 16 Feb 06
Posts: 141
Credit: 32,977
RAC: 0
Message 621 - Posted: 25 Feb 2006, 14:03:44 UTC

I too have experienced an error with 4.87 on the upload. Note: this was on my host 103, Celeron 500, 256 Mram, Win98se, Boinc CC V5.3.22

2/25/06 8:45:00 AM||Rescheduling CPU: application exited
2/25/06 8:45:00 AM|ralph@home|Computation for task BARCODE_30_1ctf__219_2_0 finished
2/25/06 8:45:00 AM|ralph@home|Output file BARCODE_30_1ctf__219_2_0_0 for task BARCODE_30_1ctf__219_2_0 exceeds size limit.
2/25/06 8:45:00 AM|ralph@home|File size: 26504055.000000 bytes. Limit: 25000000.000000 bytes
2/25/06 8:45:03 AM|ralph@home|Starting task BARCODE_30_1iibA_219_2_0 using rosetta_beta version 487
2/25/06 8:45:03 AM|ralph@home|Unrecoverable error for result BARCODE_30_1ctf__219_2_0 (<file_xfer_error> <file_name>BARCODE_30_1ctf__219_2_0_0</file_name> <error_code>-131</error_code> <error_message></error_message></file_xfer_error>)
2/25/06 8:45:03 AM|ralph@home|Deferring scheduler requests for 1 minutes and 0 seconds
2/25/06 8:45:05 AM|ralph@home|Started upload of file BARCODE_30_1ctf__219_2_0_1
2/25/06 8:45:11 AM|ralph@home|Finished upload of file BARCODE_30_1ctf__219_2_0_1
2/25/06 8:45:11 AM|ralph@home|Throughput 20455 bytes/sec

Formerly
mmciastro. Name and avatar changed for a change

The New Online Helpsytem help is just a call away.
ID: 621 · Report as offensive    Reply Quote
1 · 2 · 3 · 4 . . . 5 · Next

Message boards : RALPH@home bug list : Report - Previously Unclassified Work Unit Errors



©2024 University of Washington
http://www.bakerlab.org