Problem with network occasionally failing (MOD 5270)
-
- Posts: 9
- Joined: Mon Dec 15, 2008 11:05 pm
Problem with network occasionally failing (MOD 5270)
Hi,
We have an application in a solar power plant with around 150 Netburner Mod5270s installed on a network.
Every so often, unpredictably, but maybe two or three times a week, one of the units will stop responding to Network traffic. The only way to restore communication is to restart the unit. When the unit is restarted, everything returns to normal.
During the outage, the unit continues to function in other ways and responds to communication on the serial port, for example. (Normally, there is no serial port in our hardware, but we have installed a development board on site, and I was lucky enough to catch one). The link lights on the network socket stay on, and the unit can still detect the link, and its absence if the cable is removed.
Because the unit stops communicating, it is very difficult to gather more information about what is happening! I have also been unable to reproduce the condition in a controlled situation with a small number of Netburners in a lab network.
I suspect that a rare event, such as a particular type of packet, packet fragmentation, or packet error, is causing this problem. It does seem to happen more frequently at times when there would be increased network traffic. It also seems to happen more frequently to units connected with longer cables (but that may be because there are more of them).
I think that the problem is in receiving rather than transmitting frames (the rx_frames counter stops incrementing but the tx_frames counter goes up). Once I noticed that the number of free buffers while the error was occurring was higher than it normally is (but that may have been because I was communicating via serial instead of via IP).
The units are regularly polled on TCP (using a Modbus protocol), have HTTP interfaces, use UDP for logging, and also use NTP to synchronise clocks.
Does anyone have any ideas what this problem might be or what to look for?
Thanks,
Matthew
We have an application in a solar power plant with around 150 Netburner Mod5270s installed on a network.
Every so often, unpredictably, but maybe two or three times a week, one of the units will stop responding to Network traffic. The only way to restore communication is to restart the unit. When the unit is restarted, everything returns to normal.
During the outage, the unit continues to function in other ways and responds to communication on the serial port, for example. (Normally, there is no serial port in our hardware, but we have installed a development board on site, and I was lucky enough to catch one). The link lights on the network socket stay on, and the unit can still detect the link, and its absence if the cable is removed.
Because the unit stops communicating, it is very difficult to gather more information about what is happening! I have also been unable to reproduce the condition in a controlled situation with a small number of Netburners in a lab network.
I suspect that a rare event, such as a particular type of packet, packet fragmentation, or packet error, is causing this problem. It does seem to happen more frequently at times when there would be increased network traffic. It also seems to happen more frequently to units connected with longer cables (but that may be because there are more of them).
I think that the problem is in receiving rather than transmitting frames (the rx_frames counter stops incrementing but the tx_frames counter goes up). Once I noticed that the number of free buffers while the error was occurring was higher than it normally is (but that may have been because I was communicating via serial instead of via IP).
The units are regularly polled on TCP (using a Modbus protocol), have HTTP interfaces, use UDP for logging, and also use NTP to synchronise clocks.
Does anyone have any ideas what this problem might be or what to look for?
Thanks,
Matthew
- Chris Ruff
- Posts: 222
- Joined: Thu Apr 24, 2008 4:09 pm
- Location: topsail island, nc
- Contact:
Re: Problem with network occasionally failing (MOD 5270)
do ALL units occasionally experience this failure? Or is it a certain set of devices?
does the NB know it has lost comm?
you could have the 5270 DO something when the comm is lost, like twiggle an LED or transmit an UDP packet (assuming that it still can)once a second to some diagnostic IP address so that you could catch the NB in the broken state for further diagnostics (try to connect to a/the listening socket, etc.)
If it is all units occasionally failing the problem would be one or more of (in order of likelyhood)
-your code
-the switch
-nb code
-nb hardware/power events/overheating
If only certain units do it:
-the switch
-nb hardware/EMI events/power events/overheating
-your code (there is something different about the messaging of these units)
-nb code
Good Hunting,
Chris
does the NB know it has lost comm?
you could have the 5270 DO something when the comm is lost, like twiggle an LED or transmit an UDP packet (assuming that it still can)once a second to some diagnostic IP address so that you could catch the NB in the broken state for further diagnostics (try to connect to a/the listening socket, etc.)
If it is all units occasionally failing the problem would be one or more of (in order of likelyhood)
-your code
-the switch
-nb code
-nb hardware/power events/overheating
If only certain units do it:
-the switch
-nb hardware/EMI events/power events/overheating
-your code (there is something different about the messaging of these units)
-nb code
Good Hunting,
Chris
Real Programmers don't comment their code. If it was hard to write, it should be hard to understand
Re: Problem with network occasionally failing (MOD 5270)
Possibly some logic in your code that is not freeing sockets under certain conditions, so that you eventually run out of sockets? The free buffer check was a good idea, but you could still have a lot of buffers, but be unable to accept connections to to a half open socket type issue.
Re: Problem with network occasionally failing (MOD 5270)
we experience the same type of problem, but we're using the 5282,
though adding 5270 devices as i type.
i'll be following this link and hopefully you'll find and post the
magic bullet that solved your problem.
what version of the nndk are you using?
if you know someone should be talking to you and you don't
get anything after awhile, could you kill the stack and then
start it again?
how about pinging yourself? normal ip address and also
127.0.0.1? any luck with that?
though adding 5270 devices as i type.
i'll be following this link and hopefully you'll find and post the
magic bullet that solved your problem.
what version of the nndk are you using?
if you know someone should be talking to you and you don't
get anything after awhile, could you kill the stack and then
start it again?
how about pinging yourself? normal ip address and also
127.0.0.1? any luck with that?
Re: Problem with network occasionally failing (MOD 5270)
Are you using ReadWithTimeout()? If so are you getting 0 bytes every time it times out, or has something happened and the loop has died. If you are doing a Read that waits forever I would suggest changing to ReadWithTimeout so that you can give yourself some better diagnostic info. You could then add a diagnostic page to the web server and just do an Ajax request to get this info on a periodic basis. This could all be done in situ and should have minimal performance impact when the diagnostic web page isn't displayed.
Also does the incoming data get passed off to another task? Are you communicating via Mailboxes and Semaphores? Do you check all the return codes for OS_NO_ERR?
Also does the incoming data get passed off to another task? Are you communicating via Mailboxes and Semaphores? Do you check all the return codes for OS_NO_ERR?
Re: Problem with network occasionally failing (MOD 5270)
This can be a symptom of several problems...
The debugging flow chart should look like:
Can you still find the devices with IP setup or do they respond to a ping?
If the answer is no then you either have run out of buffers or have a dead Ethernet or IP task.
Report the number of buffers
#include <buffers.h>
report this value.... GetFreeCount();
If your having this sort of problem the count will be decreasing toward zero long before the world hangs up.
The most common error eating buffers is to setup a UDP listening socket with FIFO and then never read from it.
It accumulates packets waiting to be read until there are none left.
If you aren't loosing buffers and have an ethernet link light then its possible you had a warm start that
put the Ethernet phy in a bad state. This was a bug that has been fixed in the latest release.
The bug was such that sometimes after a warm reboot Ethernet comm would not work.
The other side is if you can still ping or find with ipsetup then the problem is at the application level,
probably means your out of TCP connections because you forgot to close (32 times) or your code is nto doing what you expect.
The good news is that if IPSETUP works you can use taskscan to figure out what is going on internally.
Hope that helps.
The debugging flow chart should look like:
Can you still find the devices with IP setup or do they respond to a ping?
If the answer is no then you either have run out of buffers or have a dead Ethernet or IP task.
Report the number of buffers
#include <buffers.h>
report this value.... GetFreeCount();
If your having this sort of problem the count will be decreasing toward zero long before the world hangs up.
The most common error eating buffers is to setup a UDP listening socket with FIFO and then never read from it.
It accumulates packets waiting to be read until there are none left.
If you aren't loosing buffers and have an ethernet link light then its possible you had a warm start that
put the Ethernet phy in a bad state. This was a bug that has been fixed in the latest release.
The bug was such that sometimes after a warm reboot Ethernet comm would not work.
The other side is if you can still ping or find with ipsetup then the problem is at the application level,
probably means your out of TCP connections because you forgot to close (32 times) or your code is nto doing what you expect.
The good news is that if IPSETUP works you can use taskscan to figure out what is going on internally.
Hope that helps.
Re: Problem with network occasionally failing (MOD 5270)
in our case, we can't ping the hung unit and ipsetup doesn't see it.
i believe the link light is on though it is rare so i don't see it much.
we were using 1.99 and 2.2 when we've seen the problem. we will be
shipping with 2.4 soon - i looked at the phy changes - thanks. (it
got me to take more control of a marvell gige phy on our board
rather than letting it just do its own thing.)
every time i've looked at the buffer count it has been ok but haven't seen it
when the hang has been reported. maybe i'll periodically look at the
counters and save them to flash for a possible post mortem.
paul, would the killstack() and startstack() get everything going again if we
"have a dead Ethernet or IP task". actually for our apps, doing a warm
start isn't too bad of a corrective action if we can determine that it is our
problem and not just nothing talking to us.
i believe the link light is on though it is rare so i don't see it much.
we were using 1.99 and 2.2 when we've seen the problem. we will be
shipping with 2.4 soon - i looked at the phy changes - thanks. (it
got me to take more control of a marvell gige phy on our board
rather than letting it just do its own thing.)
every time i've looked at the buffer count it has been ok but haven't seen it
when the hang has been reported. maybe i'll periodically look at the
counters and save them to flash for a possible post mortem.
paul, would the killstack() and startstack() get everything going again if we
"have a dead Ethernet or IP task". actually for our apps, doing a warm
start isn't too bad of a corrective action if we can determine that it is our
problem and not just nothing talking to us.
Re: Problem with network occasionally failing (MOD 5270)
Rev 2.2 has the phy hang up problem where sometimes the phy comes up in isolation mode on warm starts...
It only happes after warm starts such as Autoupdate, or ipsetup changes, or possibly when reset is pulled, buth the power is not cycled.
If you can't ping and buffer count is ok, (Id still double check you have no unread UDP fifos)
then Its either the ethernet phy going into isolation mode, or is some kind of software fault.
This could be stack overflow, bad pointer some kind of trap etc....
I would not use start trap/stop stack (It does not get exercised a lot)
You seemed to indicate that the serial port and other functions were running ok so assuming the uffer count is ok
that sort of narrows it down to the ethernet chip being in isolation mode or something in your code calls USER_ENTER_CRITICAL or OSLOCK And never calling USER_EXIT_CRITICAL or OSUNLOCK.
Paul
It only happes after warm starts such as Autoupdate, or ipsetup changes, or possibly when reset is pulled, buth the power is not cycled.
If you can't ping and buffer count is ok, (Id still double check you have no unread UDP fifos)
then Its either the ethernet phy going into isolation mode, or is some kind of software fault.
This could be stack overflow, bad pointer some kind of trap etc....
I would not use start trap/stop stack (It does not get exercised a lot)
You seemed to indicate that the serial port and other functions were running ok so assuming the uffer count is ok
that sort of narrows it down to the ethernet chip being in isolation mode or something in your code calls USER_ENTER_CRITICAL or OSLOCK And never calling USER_EXIT_CRITICAL or OSUNLOCK.
Paul
-
- Posts: 9
- Joined: Mon Dec 15, 2008 11:05 pm
Re: Problem with network occasionally failing (MOD 5270)
Thanks all for your suggestions, I will look into them.
For the record, we are using Rev 2.2 so we will upgrade to 2.4 ASAP.
When hung the unit does not respond to pings nor Autosetup, nor can it ping other addresses.
It can not open a TCP socket to another unit.
I have not tried pinging itself but that is a good suggestion and I will try it if I get the chance.
Is there a way to test if the ethernet phy is in isolation mode? (I don't actually know what that means, but I will look it up).
Matthew
For the record, we are using Rev 2.2 so we will upgrade to 2.4 ASAP.
When hung the unit does not respond to pings nor Autosetup, nor can it ping other addresses.
It can not open a TCP socket to another unit.
I have not tried pinging itself but that is a good suggestion and I will try it if I get the chance.
Is there a way to test if the ethernet phy is in isolation mode? (I don't actually know what that means, but I will look it up).
Matthew
Re: Problem with network occasionally failing (MOD 5270)
If you look in the latest Ethernet drivers...
You will find...
phy_data = GetMII ( PHY_addr, 0x0 );
if ( (phy_data & 0x400) != 0 ) //if phy is in isolation mode restart auto-negotiation
{
SetMII ( PHY_addr, 0x0, 0x3900 ); //set phy to power down
OSTimeDly (2);
SetMII ( PHY_addr, 0x0, 0x3300 ); //re-auto-negotiate
}
You will find...
phy_data = GetMII ( PHY_addr, 0x0 );
if ( (phy_data & 0x400) != 0 ) //if phy is in isolation mode restart auto-negotiation
{
SetMII ( PHY_addr, 0x0, 0x3900 ); //set phy to power down
OSTimeDly (2);
SetMII ( PHY_addr, 0x0, 0x3300 ); //re-auto-negotiate
}