Bug 875 - firmware error under load - requires reboot or ACPI sleep/wake to fix (not rmmod)
: firmware error under load - requires reboot or ACPI sleep/wake to fix (not rm...
Status: ASSIGNED
: IPW2200
firmware error
: 1.0.8
: IBM Debian
: P1 normal
Assigned To:
:
:
:
:
:
  Show dependency treegraph
 
Reported: 2006-01-01 11:50 by
Modified: 2007-02-02 14:07 (History)


Attachments
data from syslog just before lock up, until module unload (6.70 KB, application/gzip)
2006-01-05 15:47, Vivek Dasmohapatra
Details
data from syslog from before lockup to after card acpi sleep/wake (284.07 KB, text/plain)
2006-04-09 08:09, Vivek Dasmohapatra
Details


Note

You need to log in before you can comment on or make changes to this bug.


Description From 2006-01-01 11:50:25
I don't think this is the same as the other firmware error messages in 
the bug db as this one requires a reboot to fix, not just a module reload:

From time to time (often when copying large chunks of data over nfs) or 
viewing video (eg via google video) the ipw2200 driver (1.0.6+2.3,1.0.8+2.4)
will emit a "firmware error detected. restarting" message.

At this point, all network connectivity stops. rmmod/modprobe of the module 
does not fix this.

Once it has happened, firmware error messages keep cropping up at irregular
intervals, and the card can scan only very unreliably. (iwlist scan will 
show a random selection of networks - interestingly, the card becomes unable
to pick up the network it was part of when the error ocurred: weaker
neighbouring networks are seen but the original associated network is not 
even visible)

So far nothing short of a reboot has ever brought the network back.

Heavy nfs/video-data style load is not necessary to trigger the bug: It 
will happen once or twice a week regardless of load, but intensive 
nfs/video traffic will almost always trigger it.

Kernel version: 2.6.11.12
modinfo ipw2200: 
version:        1.0.8 (also 1.0.6)
vermagic:       2.6.11.12 preempt PENTIUMM 4KSTACKS gcc-3.3
modinfo ieee80211:
version:        1.1.6

[apologies if the module won't work with the above kernel options: 
 I don't recall seeing any warnings to that effect]
------- Comment #1 From 2006-01-01 12:24:21 -------
Have you tried the various stack corruption and similar patches that have been
made for the similar bugs to the one you're posting?

[Search the bug DB for slab corruption and you'll find the first relevant patch
I'm thinking of]
------- Comment #2 From 2006-01-01 13:46:17 -------
I will look into this, thanks for the tip. I'll post the results either way.
------- Comment #3 From 2006-01-02 11:16:24 -------
Assuming you meant attachment #615 [details] from bug #821, that made no difference.
I don't think I'm seeing the "firmware error detected. restarting" message,
but the behaviour is otherwise identical.
------- Comment #4 From 2006-01-04 15:45:20 -------
In fact, I am still seeing firmware restarts. What debug mask should I set to
get some useful data to attach to this bug?

[ If anything, I'd say 1.0.8 fails ~5x as often as 1.0.6 - 1.0.6 locked up
  once or twice a week (typically), so far 1.0.8 has locked up every day ]
------- Comment #5 From 2006-01-05 07:08:10 -------
Could you attach a log with debug=0x43fff?

Thanks!

-- Ben --
------- Comment #6 From 2006-01-05 15:47:16 -------
Created an attachment (id=645) [details]
data from syslog just before lock up, until module unload

Syslog just before the bug was triggered, up to the point (after the failure)
when the module was rmmod'ed
------- Comment #7 From 2006-01-06 06:48:48 -------
Scratch that comment about 1.0.8 failing more often. I've been trying both 
today and they've been failing at least once an hour. Whatever is triggering
it is clearly hapening more often now :(
------- Comment #8 From 2006-01-09 03:37:56 -------
I don't know if this is significant, but since changing the channel on 
my AP the lockups haven't happened again (yet). It's too soon to tell if
the spike in lockups was an unrelated blip though. There was one other 
network nearby on the default channel but it was significantly weaker
than my AP in signal strength.
------- Comment #9 From 2006-01-24 05:12:22 -------
Ok, the firmware just locked up again. This time I could see my AP in an iwlist
scan, but was unable to reestablish contact with it. (DHCP failed, no response)
I could see one neighbouring AP on channel 11, with a hidden ESSID and encryption
turned on, and my own AP on channel 6 with encryption off. It does seem to happen 
a lot less often now I'm not on the default channel though.
------- Comment #10 From 2006-02-04 08:05:55 -------
Workaround: putting the chipset to sleep and waking it up resets it properly,
where unloading/reloading the module has no effect.

echo -n 3 > /sys/class/net/eth1/device/power/state; 
sleep 1; 
echo -n 0 > /sys/class/net/eth1/device/power/state;

Brings it back to life, without taking the interface down.
[Assuming eth1 is the ipw2200 interface]
------- Comment #11 From 2006-03-11 19:58:14 -------
Just tried ipw2200-1.1.1/ieee80211-1.1.12/ipw-fw-3.0.

Same behaviour.

I have discovered an SQL query which when made to another machine on the 
network seems to trigger a lockup for every few rows of data retrieved. Which 
is *unbelievably* annoying when I'm trying to get some work done.
------- Comment #12 From 2006-03-13 07:45:38 -------
Sorry, this one slipped by me for a while!

The syslog that you attached in comment #6 shows that you're having a 
difficult time with roaming/scanning, to the point that the driver's scan 
watchdog timer kicks off and (intentionally) causes the firmware restart.

Try using with roaming turned off (load with module param roaming=0), and see 
if that makes things better.  This *is* the sort of thing for which changing 
channels can make a difference, as you reported.

-- Ben --
------- Comment #13 From 2006-03-13 07:46:48 -------
p.s. the "roaming" module param first appeared in 1.0.9, so it won't help with 
1.0.8, but you've upgraded already to 1.1.1.

-- Ben --
------- Comment #14 From 2006-03-15 08:48:51 -------
No change. I'll capture another set of debugging info later and add it. Same
debug flags?
------- Comment #15 From 2006-03-15 09:49:28 -------
Try debug=0x01043fff to add RX debug messages.  Thanks.

-- Ben --
------- Comment #16 From 2006-04-09 08:09:32 -------
Created an attachment (id=755) [details]
data from syslog from before lockup to after card acpi sleep/wake

data from syslog from before lockup to after card acpi sleep/wake
with debug=0x01043fff 
------- Comment #17 From 2006-05-14 03:25:10 -------
Any further info?
------- Comment #18 From 2006-05-18 07:33:14 -------
Have you turned hwcrypto on (driver load param, default is off)?

hwcrypto seems to have problems under high traffic or noisy conditions.

Also, separately, you might try playing with bluetooth coexistence on/off.

Also see bug # 1029 ... it looks like similar situation.

-- Ben --
------- Comment #19 From 2006-05-25 05:57:06 -------
No, I haven't turned hwcrypto on. The SSIDs of the neighbouring APs are
different, and I'm on a different channel (they're all on 11, I'm on 6)
What does bluetooth coexistence do?
------- Comment #20 From 2007-02-02 12:34:31 -------
Can we have the driver enhanced to be able to power-cycle (through the PCI PM
control) the card for when the card needs a strong reset?

The /power sysfs interface is going away, so drivers will have to pick up the
slack.  If ipw2200 can be locked into a state where you need to power-cycle it
through the PCI PM layer, then the driver should be able to do it.

The fact that there is a bug somewhere causing the lock-up is orthogonal.  Error
handling needs to be improved *anyway*.
------- Comment #21 From 2007-02-02 14:07:52 -------
A patch would be welcome.

-- Ben --