Bugzilla – Bug 875
firmware error under load - requires reboot or ACPI sleep/wake to fix (not rmmod)
Last modified: 2007-02-02 14:07:52
You need to log in before you can comment on or make changes to this bug.
I don't think this is the same as the other firmware error messages in the bug db as this one requires a reboot to fix, not just a module reload: From time to time (often when copying large chunks of data over nfs) or viewing video (eg via google video) the ipw2200 driver (1.0.6+2.3,1.0.8+2.4) will emit a "firmware error detected. restarting" message. At this point, all network connectivity stops. rmmod/modprobe of the module does not fix this. Once it has happened, firmware error messages keep cropping up at irregular intervals, and the card can scan only very unreliably. (iwlist scan will show a random selection of networks - interestingly, the card becomes unable to pick up the network it was part of when the error ocurred: weaker neighbouring networks are seen but the original associated network is not even visible) So far nothing short of a reboot has ever brought the network back. Heavy nfs/video-data style load is not necessary to trigger the bug: It will happen once or twice a week regardless of load, but intensive nfs/video traffic will almost always trigger it. Kernel version: 2.6.11.12 modinfo ipw2200: version: 1.0.8 (also 1.0.6) vermagic: 2.6.11.12 preempt PENTIUMM 4KSTACKS gcc-3.3 modinfo ieee80211: version: 1.1.6 [apologies if the module won't work with the above kernel options: I don't recall seeing any warnings to that effect]
Have you tried the various stack corruption and similar patches that have been made for the similar bugs to the one you're posting? [Search the bug DB for slab corruption and you'll find the first relevant patch I'm thinking of]
I will look into this, thanks for the tip. I'll post the results either way.
Assuming you meant attachment #615 [details] from bug #821, that made no difference. I don't think I'm seeing the "firmware error detected. restarting" message, but the behaviour is otherwise identical.
In fact, I am still seeing firmware restarts. What debug mask should I set to get some useful data to attach to this bug? [ If anything, I'd say 1.0.8 fails ~5x as often as 1.0.6 - 1.0.6 locked up once or twice a week (typically), so far 1.0.8 has locked up every day ]
Could you attach a log with debug=0x43fff? Thanks! -- Ben --
Created an attachment (id=645) [details] data from syslog just before lock up, until module unload Syslog just before the bug was triggered, up to the point (after the failure) when the module was rmmod'ed
Scratch that comment about 1.0.8 failing more often. I've been trying both today and they've been failing at least once an hour. Whatever is triggering it is clearly hapening more often now :(
I don't know if this is significant, but since changing the channel on my AP the lockups haven't happened again (yet). It's too soon to tell if the spike in lockups was an unrelated blip though. There was one other network nearby on the default channel but it was significantly weaker than my AP in signal strength.
Ok, the firmware just locked up again. This time I could see my AP in an iwlist scan, but was unable to reestablish contact with it. (DHCP failed, no response) I could see one neighbouring AP on channel 11, with a hidden ESSID and encryption turned on, and my own AP on channel 6 with encryption off. It does seem to happen a lot less often now I'm not on the default channel though.
Workaround: putting the chipset to sleep and waking it up resets it properly, where unloading/reloading the module has no effect. echo -n 3 > /sys/class/net/eth1/device/power/state; sleep 1; echo -n 0 > /sys/class/net/eth1/device/power/state; Brings it back to life, without taking the interface down. [Assuming eth1 is the ipw2200 interface]
Just tried ipw2200-1.1.1/ieee80211-1.1.12/ipw-fw-3.0. Same behaviour. I have discovered an SQL query which when made to another machine on the network seems to trigger a lockup for every few rows of data retrieved. Which is *unbelievably* annoying when I'm trying to get some work done.
Sorry, this one slipped by me for a while! The syslog that you attached in comment #6 shows that you're having a difficult time with roaming/scanning, to the point that the driver's scan watchdog timer kicks off and (intentionally) causes the firmware restart. Try using with roaming turned off (load with module param roaming=0), and see if that makes things better. This *is* the sort of thing for which changing channels can make a difference, as you reported. -- Ben --
p.s. the "roaming" module param first appeared in 1.0.9, so it won't help with 1.0.8, but you've upgraded already to 1.1.1. -- Ben --
No change. I'll capture another set of debugging info later and add it. Same debug flags?
Try debug=0x01043fff to add RX debug messages. Thanks. -- Ben --
Created an attachment (id=755) [details] data from syslog from before lockup to after card acpi sleep/wake data from syslog from before lockup to after card acpi sleep/wake with debug=0x01043fff
Any further info?
Have you turned hwcrypto on (driver load param, default is off)? hwcrypto seems to have problems under high traffic or noisy conditions. Also, separately, you might try playing with bluetooth coexistence on/off. Also see bug # 1029 ... it looks like similar situation. -- Ben --
No, I haven't turned hwcrypto on. The SSIDs of the neighbouring APs are different, and I'm on a different channel (they're all on 11, I'm on 6) What does bluetooth coexistence do?
Can we have the driver enhanced to be able to power-cycle (through the PCI PM control) the card for when the card needs a strong reset? The /power sysfs interface is going away, so drivers will have to pick up the slack. If ipw2200 can be locked into a state where you need to power-cycle it through the PCI PM layer, then the driver should be able to do it. The fact that there is a bug somewhere causing the lock-up is orthogonal. Error handling needs to be improved *anyway*.
A patch would be welcome. -- Ben --