Bug 1260 - Microcode SW error (SYSASSERT (#5)) under high load
: Microcode SW error (SYSASSERT (#5)) under high load
Status: VERIFIED WONTFIX
: IPW3945
firmware error
: 1.2.0
: IBM Debian
: P2 normal
Assigned To:
:
:
:
:
:
  Show dependency treegraph
 
Reported: 2007-04-13 12:27 by
Modified: 2008-12-08 22:14 (History)


Attachments
Kernel log with debug=0x43fff (105.30 KB, text/plain)
2007-04-13 12:29, Sebastian Schmidt
Details
Kernel configuration (75.03 KB, text/plain)
2007-04-13 12:31, Sebastian Schmidt
Details
Kernel log with debug=0x43fff and led=0 (67.52 KB, text/plain)
2007-04-20 11:42, Sebastian Schmidt
Details


Note

You need to log in before you can comment on or make changes to this bug.


Description From 2007-04-13 12:27:26
Moin!

I have the problem that the driver crashes with "ipw3945: Microcode SW error
detected.  Restarting.". I noticed every time before this the message "Error
sending LEDS_CMD: time out after 500ms." in the kernel log.

It *seems* to me that this problem occurs when I transmit a large amout of data
(so the LED can't flash fast enough? ;)).

I really cannot tell if this problem also occured with older firmware versions.
I also had some Microcode SW errors, but I didn't watch if a LEDS_CMD timeout
was preceding it (and also I even couldn't tell you the firmware version - it
was shipped with Debian without any version information :().

The problem occurs with any proximity to the Access Point (sitting asside or one
floor below doesn't change anything).

Additional information below.

Greetings,
  Sebastian

System Information
==================
Machine: Lenovo/IBM Thinkpad X60s with an IPW3945
Firmware version: 1.14.2, freshly downloaded from http://bughost.org/ipw3945/
Linux Kernel version: 2.6.20.1 (almost vanilla, except bootsplash. With Debian
standard configuration)
ipw3945d version: 1.7.22 (out of Debian package with version 1.7.22-4)
Access Point: Thomsom Speedtouch 585i v6 with Firmware version 5.4.0.14
Security: WPA2-PSK
------- Comment #1 From 2007-04-13 12:29:26 -------
Created an attachment (id=1027) [details]
Kernel log with debug=0x43fff
------- Comment #2 From 2007-04-13 12:31:01 -------
Created an attachment (id=1028) [details]
Kernel configuration

Just in case that matters :-)
------- Comment #3 From 2007-04-20 11:34:57 -------
Moin!

I investigated a bit further:
- The problem is probably *not* related to a LEDS_CMD timeout, as the problem
still exists with ipw3945 being loaded with led=0 (although "ipw_queue_tx_hcmd
Sending command LEDS_CMD (#48), seq: ..." shows up in the debug log...)
- The problem is also probably not related to WPA, it also exists with WPA or
WEP completely disabled.
- I also checked out the firmware attached to bug #1085, problem remains the same.
- The problem is indeed the high load. I can totally reproduce it by scp-ing a
large file. (Large, because the problem disappears when the network throughput
is reduced after a few seconds.)
- Maybe the bug is related to bug #1201? The debug output seems quite the same
to me (SYSASSERT #5), although my system is not unresponsive at all.

Is there any more information I could provide you?

Sebastian
------- Comment #4 From 2007-04-20 11:42:18 -------
Created an attachment (id=1035) [details]
Kernel log with debug=0x43fff and led=0
------- Comment #5 From 2007-06-17 09:46:28 -------
Hmm,

I found out something courious (at least for me):
when I set CONFIG_PREEMPT=y, the bug disappears and the connection works as
expected. (This was the case in 2.6.21.5, I didn't test it with .21.)

I would have expected this the other way round.. but - well, I'm not a kernel
hacker. Maybe this is some kind of lock or whatever being held?

Sebastian
------- Comment #6 From 2007-07-11 14:15:29 -------
I have a very similar problem here. Specs:

Sony VAIO SZ-350BP (brazilian version)
Kernel: 2.6.20-gentoo-r8 (gentoo-sources)
ipw3945: 1.2.0
ipw3945-ucode: 1.14.2
ipw3945d: 1.7.22-r4

dmesg output is pretty much the same, so I'll not be posting it here.

It seems that switching between X and VTs aggravates the problem, since some
switches (3 or 4) are enough to bring down the connection.

Well, I've just tried activating full kernel preemtion (it was set at partial
kernel preemption) and the connection seems much more stable. In the other hand,
switching from the text console to X now causes the ipw to restart *every time*.
(now, the problem happens only while switching from VT to X, not the other way).

I'll be gladly feeding more information whenever it's needed, this bug's been
freaking me out for some time now.



Greetings,
Cesar Kawakami
------- Comment #7 From 2007-10-12 08:27:08 -------
(In reply to comment #5)
> Hmm,
> 
> I found out something courious (at least for me):
> when I set CONFIG_PREEMPT=y, the bug disappears and the connection works as
> expected. (This was the case in 2.6.21.5, I didn't test it with .21.)
> 
> I would have expected this the other way round.. but - well, I'm not a kernel
> hacker. Maybe this is some kind of lock or whatever being held?
> 
> Sebastian

I have the same option set on kernel 2.6.23 (also happened on 2.6.22) and I get
the same message about LEDS_CMD time out.  It's periodic, but when it happens I
have a lot of trouble getting things back.  Sometimes I have to force a reboot
because I can't get the interface to recover.

I have pretty much the same outputs as already posted so I won't spam here :)
------- Comment #8 From 2007-12-06 22:15:05 -------
Enabling full preemption made the problem much better here, too. I'm not sure if
I still see it occasionally or if it's completely gone. It might also still
depend on system load, and preemption is just hiding it to a large extend.
------- Comment #9 From 2007-12-07 06:41:40 -------
Thanks for the reports.

Zhu Yi has been working on/with some patches sent in by a community member 
just in the past day or two ... these look promising.

It turns out that we've been unnecessarily using heavy-handed spin_lock_irqsave
(), which turns off interrupts (not just our own!), thus the system lock-ups.

Also, we were not replenishing the Rx queues as often as we could, so Rx 
buffers could pile up in heavy traffic and/or heavy CPU usage, while we were 
processing earlier Rx and command (e.g. Tx) response buffers.  Thus the 
firmware errors in heavy traffic, due to running out of buffer space to put 
things.

New patches should fix these issues.  I'll let Yi take it from here.

-- Ben --
------- Comment #10 From 2007-12-07 07:36:48 -------
Oooops, Yi's work is with 2200, not 3945!

(My morning stupor at work there).

But there sure are a lot of the spin_lock_irqsave()s in 3945/4965, too.

-- Ben --
------- Comment #11 From 2008-01-20 07:48:40 -------
Hi just wanted to report back this same bug in launchpad.

https://bugs.launchpad.net/ipw3945/+bug/109887

Thanks
------- Comment #12 From 2008-12-08 21:51:55 -------
ipw3945 as a driver has been replaced by iwl3945 in official kernel for a long
time. We suggest to use iwl3945 driver instead of the obsolete ipw3945 driver.
If you have bug, please report it with product=iwlwifi and platform="Intel(R)
Wifi Link 3945". Thanks so much!