Release 2.8.5

Discussion to talk about software related topics only.
SeeCwriter
Posts: 605
Joined: Mon May 12, 2008 10:55 am

Release 2.8.5

Post by SeeCwriter »

This post is mainly informational and to see if anyone else has experienced something similar with v2.8.5 of the NNDK. I know it's early, but surely others are using this version.

About June 6 I upgraded the development suite from v2.8.3 to v2.8.4. v2.8.4 was found to have a semiphore issue which was fixed with v2.8.5, which I installed in early August. About 2-weeks ago during system testing, we noticed that the MOD5441X modules would randomly crash and reboot about once a day. Sometimes it would take 2-days or more, but for most, at least once within a 24-hour period. The crash occurred on all units (we had 4 running in the same system), at random times. Most would reboot, as is typical of the modules when they crash, but at least one would always crash into the alternate boot monitor and stay there until it was power-cycled.

My initial reaction was that I introduced a bug into my application. That has not been ruled out yet. So I went through the changes I made, several times, and was unable to find anything obvious. It's always "not obvious" until you discover the bug. I used the WinAddr2line utility to try to determine where in the code it was crashing. Every crash was in a different location, but same thread (Main), and was one of three different errors: Access error, invalid opcode, and unknown opcode. But only one program counter value returned a line of code, which was in file __strtod__. Which I'm guessing is being called from a printf statement somewhere. The rest returned "??".

3-days ago I recompiled my application with v2.8.3 and installed it on 9-units. So far not a single unit has crashed or rebooted. I am not declaring victory yet. We are going to continue to run these units into next week, and to install the firmware into additional units for more testing.

I'm willing to concede that the problem is my code, and that perhaps the different compiler version has just changed the program layout enough to avoid a crash.

I am open to suggestions as to how to trouble this.
sulliwk06
Posts: 118
Joined: Tue Sep 17, 2013 7:14 am

Re: Release 2.8.5

Post by sulliwk06 »

If walking through the program counters on the task stack where the trap occurred doesn't point you to a particular function of yours where there may be an error, sometimes I have found that I simply need to walk up the stack further. I actually moved the smarttrap functions to my own code instead of the library code so that I could modify the depth of the stack I can check there. It's also a good way to add your own diagnostic information to the trap printouts. I've added numerous trap handlers to each section of my code so that when an error occurs I know exactly what was going on and what went wrong.
User avatar
TomNB
Posts: 538
Joined: Tue May 10, 2016 8:22 am

Re: Release 2.8.5

Post by TomNB »

A couple of other things to try:

Does it behave the same way with a debug build? Debug builds turn optimization off.

You can try turning off code stripping in the project (gcc calls it garbage collection). Its in Properties, C/C++ build, settings, GNU C/C++ linker, optimization ( I know, a long way down).

Have you tried turning on stack checking?
SeeCwriter
Posts: 605
Joined: Mon May 12, 2008 10:55 am

Re: Release 2.8.5

Post by SeeCwriter »

Most of the units that I ran over the weekend rebooted multiple times a day all weekend beginning about 9pm Friday night. After running for three days with no issues. Go figure. In any case, that points to my code rather than the NNDK.

The errors were the same (2, 3, & 4), with a new one, "Unimplemented line-f opcode (11)".

Again, using WinAddr2line returned nothing.

I tried turning off code stripping, but I get a bunch of linker errors for functions I'm not using, so I turned it on again.

I thought I had SmartTraps enabled, but I didn't. So I will make a build with that enabled. If there's another method of stack checking, I'm not aware of it.
SeeCwriter
Posts: 605
Joined: Mon May 12, 2008 10:55 am

Re: Release 2.8.5

Post by SeeCwriter »

I rebuilt my application with SmartTraps enabled and loaded it. The application just crashed and the crash dump on MTTY doesn't look any different than what it was before I enabled SmartTraps. So where is this additional information I'm supposed to get? And WinAddr2line continues to report nothing.

Here is the crash dump:

-------------------Trap information-----------------------------
Exception Frame/A7 =80002be8
Trap Vector =Illegal Instruction (4)
Format =04
Status register SR =2000
Fault Status =00
Faulted PC =40045df8

-------------------Register information-------------------------
A0=80000c92 A1=80000a24 A2=4032f854 A3=400157b4
A4=40015b34 A5=40045978 A6=80002c18 A7=80002be8
D0=00000009 D1=00000010 D2=00000001 D3=00000002
D4=40063390 D5=4032d8fe D6=40034d3c D7=4004d750
SR=2000 PC=40045df8
-------------------RTOS information-----------------------------
The OSTCBCur current task control block = 80000a24
This looks like a valid TCB
The current running task is: Main#32
-------------------Task information-----------------------------
Task | State |Wait| Call Stack
Idle#3f|Ready | |40061baa,40061874,0
Main#32|Running | |40045df8,40016952,40061874,0
TCPD#28|Semaphore |0008|40062562,4006ee86,40061874,0
IP#27|Fifo |0002|40062958,400651de,40061874,0
Enet#26|Fifo |0028|40062958,40058de4,40061874,0
HTTP#2d|Semaphore |000a|40062562,400727e6,40070ca8,40061874,0
User,#2f|Fifo |FRVR|40062958,40067a1a,40080c9c,40061874,0
User,#30|Fifo |FRVR|40062958,40067a1a,4003b1e4,40061874,0
FTPD#2e|Semaphore |0014|40062562,400727e6,4006ac14,40061874,0
User,#31|Timer |247f|40061f82,40049d28,40061874,0

-------------------End of Trap Diagnostics----------------------
sulliwk06
Posts: 118
Joined: Tue Sep 17, 2013 7:14 am

Re: Release 2.8.5

Post by sulliwk06 »

So the address highlighted below doesn't point to anything?

Main#32|Running | |40045df8,40016952,40061874,0
SeeCwriter
Posts: 605
Joined: Mon May 12, 2008 10:55 am

Re: Release 2.8.5

Post by SeeCwriter »

Instead of just putting the Faulted PC value in the Address box of WinAddr2Line, I put in the call stack list of Main#32 in the "Task information" section of the crash dump in my post above. This is what I get now:

SerialInterface()
C:\Projects\nburn\O2\Release/..\Protocol.cpp:335
UserMain
C:\Projects\nburn\O2\Release/..\main.cpp:968
TopOfStackKillfunction()
C:\nburn\system/ucosmcfc.cpp:67
??
??:0

If I'm reading it right, the crash supposedly occurred at line 335 of file Protocol.cpp. Here is that line of code:

Code: Select all

  sockets_available = TOTAL_PROTOCOL_LINKS - FIRST_TCP_LINK - sockets_connected;
Where:

Code: Select all

#define TOTAL_PROTOCOL_LINKS  13
#define FIRST_TCP_LINK  4

int sockets_available, sockets_connected;
Note, there was no TCP communication with the unit. The only communication is through the unit's webpage.

In any case, how could integer math, regardless of the values of the parameters, and assigning the result to an integer, cause a crash?

Here is the section of code to give some context to line 335:

Code: Select all

  // Count active sockets, dump the idle ones.
  sockets_connected=0;
  for ( int i = FIRST_TCP_LINK; i < TOTAL_PROTOCOL_LINKS; i++ )
    if ( mandc[i].port < 0 ) continue; // open slot.
    else if ( mandc[i].idle->Expired() ) CloseSocket(i); // close idle tcp sockets.
    else ++sockets_connected;  // track used sockets

  sockets_available = TOTAL_PROTOCOL_LINKS - FIRST_TCP_LINK - sockets_connected;

  // Was full but slots freed?  Start listening.
  if ( (listen_fd < 0) && (sockets_available > 0) )
    {
    listen_fd = listen( INADDR_ANY, 2000, 1 );
    return;
    }
Since the Faulted PC value is never the same, I don't believe the above is the actual crash location.
SeeCwriter
Posts: 605
Joined: Mon May 12, 2008 10:55 am

Re: Release 2.8.5

Post by SeeCwriter »

Here's the next crash dump. This time WinAddr2Line gives the following:

??
??:0
UserMain
C:\Projects\nburn\O2\Release/..\main.cpp:968
TopOfStackKillfunction()
C:\nburn\system/ucosmcfc.cpp:67
??
??:0

So, is it line 968 that is crashing? That's a function call. Which, if true, suggests the previous function stomped on the stack.

Code: Select all

    ...
    SerialInterface();
    CheckPendingUdpPackets();  <-- Line 968.
    ...
Notice that function SerialInterface is the function that was first in the previous readout of WinAddr2Line.


-------------------Trap information-----------------------------
Exception Frame/A7 =80002be8
Trap Vector =Unimplemented line-f opcode (11)
Format =04
Status register SR =2000
Fault Status =00
Faulted PC =0b040bf9

-------------------Register information-------------------------
A0=80000c92 A1=80000a24 A2=4032f854 A3=400157b4
A4=40015b34 A5=40045978 A6=80002c18 A7=80002be8
D0=00000009 D1=00000010 D2=00000001 D3=00000002
D4=40063390 D5=4032d8fe D6=40034d3c D7=4004d750
SR=2000 PC=0b040bf9
-------------------RTOS information-----------------------------
The OSTCBCur current task control block = 80000a24
This looks like a valid TCB
The current running task is: Main#32
-------------------Task information-----------------------------
Task | State |Wait| Call Stack
Idle#3f|Ready | |40061bb2,40061874,0
Main#32|Running | |0b040bf9,40016952,40061874,0
TCPD#28|Semaphore |009c|40062562,4006ee86,40061874,0
IP#27|Fifo |0004|40062958,400651de,40061874,0
Enet#26|Fifo |0027|40062958,40058de4,40061874,0
HTTP#2d|Semaphore |0010|40062562,400727e6,40070ca8,40061874,0
User,#2f|Fifo |FRVR|40062958,40067a1a,40080c9c,40061874,0
User,#30|Fifo |FRVR|40062958,40067a1a,4003b1e4,40061874,0
FTPD#2e|Semaphore |000a|40062562,400727e6,4006ac14,40061874,0
User,#31|Timer |1e2e|40061f82,40049d28,40061874,0

-------------------End of Trap Diagnostics----------------------
sulliwk06
Posts: 118
Joined: Tue Sep 17, 2013 7:14 am

Re: Release 2.8.5

Post by sulliwk06 »

Typically when I have a crash where the faulted program counter is invalid, it's either because of buffer overflow or mis-handling a pointer, which can do things like mess with your stack. So I would look in your CheckPendingUdpPackets or SerialInterface functions for something like that.
ecasey
Posts: 164
Joined: Sat Mar 26, 2011 9:34 pm

Re: Release 2.8.5

Post by ecasey »

I had a similar problem when going from 2.81 or 2.82 to 2.83. Program that ran perfectly on earlier version periodically crashed, sometimes it ran for days. It turned out to be a buffer overflow that was latent in the earlier version because it overflowed into another buffer space with no consequence. In 2.83, the buffers did not align the same way and the overflow was trapped. The trap appeared as an illegal instruction or bad interrupt vector. The buffer was completely unrelated to the code where the trap happened which made it hard to debug. I spent a lot of time chasing dead ends because I focussed on the code at the traps.
Post Reply