Bug 2015 - iwlwifi hits BUG_ON when forced to roam ("iwconfig ap any")
: iwlwifi hits BUG_ON when forced to roam ("iwconfig ap any")
Status: VERIFIED FIXED
: iwlwifi
others
: iwlwifi-2.6 development tree
: 5100/5300 (Intel(R) WiFi Link 5100/5300) Gentoo
: P2 normal
Assigned To:
:
:
:
:
:
  Show dependency treegraph
 
Reported: 2009-06-06 05:58 by
Modified: 2010-02-09 18:39 (History)


Attachments
dmesg output (49.67 KB, text/plain)
2009-06-06 05:58, pee
Details
dmesg from a similar failure later on (58.44 KB, text/plain)
2009-06-06 13:56, pee
Details
dmesg output with debug enabled (51.40 KB, text/plain)
2009-06-09 06:03, pee
Details
dmesg output with patch + debug (85.71 KB, text/plain)
2009-06-09 12:16, pee
Details
syslog (212.42 KB, text/plain)
2009-06-09 14:14, pee
Details
patch to fix the problem (3.74 KB, patch)
2009-06-09 14:38, Johannes Berg
Details | Diff
patch to fix the problem (4.54 KB, patch)
2009-06-09 14:53, Johannes Berg
Details | Diff
screenshot of kernel panic (307.55 KB, image/jpeg)
2009-06-10 12:40, Michael Kaminsky
Details
messages from the last 24hrs or so with patch (123.88 KB, application/x-gzip)
2009-06-10 15:22, pee
Details
debugging level is 0x4000000 (5.60 KB, text/plain)
2009-06-10 19:51, ximin luo
Details
syslog for first warn with new patch (21.34 KB, application/x-gzip)
2009-06-11 11:55, pee
Details
further syslog with data failure (35.45 KB, application/x-gzip)
2009-06-11 13:10, pee
Details
debugging level 0x4000003 and "ap any" log (33.97 KB, text/plain)
2009-06-11 19:57, ximin luo
Details
kernel crash with extra debugging (24.03 KB, application/gzip)
2009-06-12 12:15, Michael Kaminsky
Details
new debug message without kernel crash (6.97 KB, application/gzip)
2009-06-15 11:02, Michael Kaminsky
Details
debug log without patch from comment #36 (7.97 KB, application/gzip)
2009-06-15 13:20, Michael Kaminsky
Details


Note

You need to log in before you can comment on or make changes to this bug.


Description From 2009-06-06 05:58:46
Created an attachment (id=2022) [details]
dmesg output

After running for a while i start seeing "queue full" messages from pings to my
default gateway. I've attached dmesg including the warning.

If i then do "iwconfig wlan0 ap auto" it causes my machine to hard lock.

If i ifconfig down wlan0, rmmod, and modprobe to restart it seems ok.
------- Comment #1 From 2009-06-06 13:56:45 -------
Created an attachment (id=2023) [details]
dmesg from a similar failure later on

This is the dmesg from the same boot but later in the day after another
failure.
------- Comment #2 From 2009-06-08 10:35:52 -------
Wey-Yi,

This looks like the problem you are working on.
------- Comment #3 From 2009-06-08 12:46:27 -------
(In reply to comment #1)
> Created an attachment (id=2023) [details] [details]
> dmesg from a similar failure later on
> This is the dmesg from the same boot but later in the day after another
> failure.

Looks like mac80211 issue IEEE80211_AMPDU_TX_STOP to the same tid more than
once, could you re-run the test with debug50=0x4000000.

thanks
------- Comment #4 From 2009-06-09 06:03:54 -------
Created an attachment (id=2026) [details]
dmesg output with debug enabled

Here is the dmesg with debug output included. It includes my ifconfig down of
the interface and rmmod/modprobe at the end.
------- Comment #5 From 2009-06-09 09:03:55 -------
Can you please try this with this patch, and debug50=0x4001003

The patch will cause a lot of dump_stack() but I need that info.

--- wireless-testing.orig/drivers/net/wireless/iwlwifi/iwl-sta.c    2009-06-09
18:02:22.000000000 +0200
+++ wireless-testing/drivers/net/wireless/iwlwifi/iwl-sta.c    2009-06-09
18:02:34.000000000 +0200
@@ -480,6 +480,9 @@ void iwl_clear_stations_table(struct iwl
     unsigned long flags;
     int i;

+    printk(KERN_DEBUG "iwl_clear_stations_table\n");
+    dump_stack();
+
     spin_lock_irqsave(&priv->sta_lock, flags);

     if (iwl_is_alive(priv) &&
------- Comment #6 From 2009-06-09 12:16:11 -------
Created an attachment (id=2027) [details]
dmesg output with patch + debug

Here is the output you requested
------- Comment #7 From 2009-06-09 12:43:31 -------
Thanks, that helps a lot. Can you, leaving all this in place, check if you can
reproduce the problem with this small change:

--- wireless-testing.orig/net/mac80211/util.c    2009-06-09 21:43:02.000000000
+0200
+++ wireless-testing/net/mac80211/util.c    2009-06-09 21:43:07.000000000 +0200
@@ -773,8 +773,7 @@ int ieee80211_set_freq(struct ieee80211_
         if (local->sw_scanning || local->hw_scanning)
             ret = 0;
         else
-            ret = ieee80211_hw_config(
-                local, IEEE80211_CONF_CHANGE_CHANNEL);
+            ret = ieee80211_hw_config(local, 0);
     }

     return ret;


or were you manually changing the channel while associated?
------- Comment #8 From 2009-06-09 12:47:41 -------
(In reply to comment #7)
> Thanks, that helps a lot. Can you, leaving all this in place, check if you can
> reproduce the problem with this small change:
> --- wireless-testing.orig/net/mac80211/util.c   2009-06-09 21:43:02.000000000
> +0200
> +++ wireless-testing/net/mac80211/util.c        2009-06-09 21:43:07.000000000
> +0200
> @@ -773,8 +773,7 @@ int ieee80211_set_freq(struct ieee80211_
>                 if (local->sw_scanning || local->hw_scanning)
>                         ret = 0;
>                 else
> -                       ret = ieee80211_hw_config(
> -                               local, IEEE80211_CONF_CHANGE_CHANNEL);
> +                       ret = ieee80211_hw_config(local, 0);
>         }
>         return ret;
> or were you manually changing the channel while associated?

looks like something changed in hw_config and cause the station table get
reset. is this possible caused by "iwconfig wlan0 ap auto"?
------- Comment #9 From 2009-06-09 12:50:46 -------
(In reply to comment #8)

> looks like something changed in hw_config and cause the station table get
> reset. is this possible caused by "iwconfig wlan0 ap auto"?

The stack trace quite clearly says:
iwl_clear_stations_table
Pid: 4502, comm: wpa_supplicant Not tainted 2.6.30-rc8-wl-41696-g7a35720-dirty
#5
Call Trace:
 [<ffffffff804d9386>] iwl_clear_stations_table+0x2d/0xef
 [<ffffffffa000868e>] iwl_commit_rxon+0x2a6/0x460 [iwlagn]
 [<ffffffff804c7ef7>] iwl_mac_config+0x48b/0x573
 [<ffffffff806be93e>] ieee80211_hw_config+0x7a/0x80
 [<ffffffff806d45d4>] ieee80211_set_freq+0x67/0x77
 [<ffffffff806c081b>] ieee80211_ioctl_siwfreq+0xc7/0xcc

so it must be setting the frequency... which the patch above should catch
unless you're actually moving to a different channel...which would be weird.
------- Comment #10 From 2009-06-09 13:17:14 -------
Actually, right after we get a new SSID too, so it seems wpa_supplicant or
maybe NM is just trying to connect to another AP.
------- Comment #11 From 2009-06-09 14:14:56 -------
Created an attachment (id=2028) [details]
syslog

I'm not sure if you want me to run with that patch or not =). Please let me
know

Attached syslog with NetworkMangler logs in it from the same time
------- Comment #12 From 2009-06-09 14:22:04 -------
Sorry, no, that patch won't really help. I'm developing another one, give me a
few minutes to test it at least once.
------- Comment #13 From 2009-06-09 14:38:37 -------
Created an attachment (id=2029) [details]
patch to fix the problem

Can you please try this patch? It has fixed the problem for me.
------- Comment #14 From 2009-06-09 14:53:23 -------
Created an attachment (id=2030) [details]
patch to fix the problem

sorry, that has a bug that could make it crash, this is better
------- Comment #15 From 2009-06-10 07:05:42 -------
I just confirmed that "iwconfig wlan0 ap auto" also causes my machine to hard
lock-up.  This seems to happen even if I haven't seen the kernel "WARNING".

Johannes--I just tested your latest patch (comment #14), and it still results
in a hard lock-up when running the above command.  Anything else to try?

Michael
------- Comment #16 From 2009-06-10 08:40:04 -------
(In reply to comment #15)
> I just confirmed that "iwconfig wlan0 ap auto" also causes my machine to hard
> lock-up.  This seems to happen even if I haven't seen the kernel "WARNING".

Can you explain in a little more detail how to trigger this? Is that with first
connecting with wpa_supplicant? Or just on an existing connection brought up by
iwconfig? Or without any connection at all?
------- Comment #17 From 2009-06-10 09:06:32 -------
Also, how hard is the hard lockup? If you switch out of X first, do you see any
bug message on it?
------- Comment #18 From 2009-06-10 12:40:35 -------
Created an attachment (id=2032) [details]
screenshot of kernel panic

I dropped to console first was able to take a digital photo of the kernel
panic.

Here's what I did to produce the crash (very repeatable):
  - unload iwlagn kernel module
  - turn off network-manager, etc.
  - load kernel module
  - ifconfig wlan0 up
  - iwconfig wlan0 essid CMU
  - dhclient wlan0
  - start an scp
    [works fine; kill after a few second]
    [note: once transfer starts, the kernel debug messages show that
aggregation gets turned on]
  - iwconfig wlan0 ap auto
  - start another scp
    [immediate kernel panic--see screenshot]
------- Comment #19 From 2009-06-10 15:22:29 -------
Created an attachment (id=2033) [details]
messages from the last 24hrs or so with patch

Here is my /var/log/messages since boot with the patch. It ran fine until this
afternoon sometime when I stopped moving packets. Unfortunately I wasn't at the
box when it happened. Hopefully there is something useful in there.

I tried the iwconfig wlan0 ap auto , no crash.

I ended up rmmod/modprobe to restore connectivity.
------- Comment #20 From 2009-06-10 19:51:42 -------
Created an attachment (id=2034) [details]
debugging level is 0x4000000

I try to reproduce it on our testing environment. It's easy to reproduce on
5100/5300.

Testing Environment
========================================
Platform        :       Intel SDV M3M41 
Wireless Card   :       Intel(R) WiFi Link 5100/5300
OS              :       Redhat Fedora release 10 (Cambridge) 32/64bit
AP              :       Cisco 1250
AP Channel Width:       20/Below 40/Above40 MHz
uCode           :       iwlwifi-5000-2.ucode 8.24.2.12
Source          :       commit 7a35720d1ba0dae1e7994cc388014c9364a2b382
                     (2.6.30-rc8-wl)

Steps to Reproduce
========================================
1. modprobe -r iwlagn
2. modprobe iwlagn debug50=0x4000000
(Note: if load module with "11n_disable50=1", didn't see kernel panic)
3. ifconfig wlan0 up
4. iwconfig wlan0 channel <> essid <>
5. ifconfig wlan0 IP_ADDR
6. scp file to server (300Mfile)
7. iwconfig wlan0 ap auto (this step is necessary)
8. scp again leading to kernel panic

Call Trace message
========================================
kernel BUG at drivers/net/wireless/iwlwifi/iwl-5000.c:1247!
invalid opcode: 0000 [#1] SMP
last sysfs file: /sys/class/power_supply/BAT0/energy_full
Modules linked in: iwlagn iwlcore mac80211 cfg80211 rfkill netconsole sunrpc
ipt_REJECT nf_conntrack_ipv4 nf_defrag_ipv4 iptable_filter ip_tables
ip6t_REJECT xt_tcpudp nf_conntrack_ipv6 xt_state nf_conntrack ip6table_filter
ip6_tables x_tables ipv6 cpufreq_ondemand acpi_cpufreq dm_mirror dm_region_hash
dm_log dm_multipath dm_mod uinput arc4 ecb e1000 i2c_i801 i2c_core ac pcspkr
video output battery button sr_mod cdrom sg ata_generic ata_piix libata sd_mod
scsi_mod ext3 jbd mbcache uhci_hcd ohci_hcd ehci_hcd [last unloaded: rfkill]

Pid: 0, comm: swapper Not tainted (2.6.30-rc8-wl #3) Montevina platform
EIP: 0060:[<fa6363a9>] EFLAGS: 00010097 CPU: 0
EIP is at iwl5000_rx_reply_tx+0x797/0x966 [iwlagn]
EAX: 00000082 EBX: f591e9b4 ECX: 00000000 EDX: f5324008
ESI: f6af09a0 EDI: f591e9f8 EBP: c1fdbf2c ESP: c1fdbebc
 DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068
Process swapper (pid: 0, ti=c1fdb000 task=c07402fc task.ti=c077f000)
Stack:
 c07d8460 f6af09a0 00000100 c041d120 f5324000 00000000 0000000a f591e9d4
 f5324008 00000201 00000000 00000000 00000002 00000000 c1fe1460 00000082
 f71a3fb0 c1fe4ae4 00000001 c1fdbf10 c0420a0d c1fdbf24 c0435766 00000000
Call Trace:
 [<c041d120>] ? check_preempt_wakeup+0xd3/0xec
 [<c0420a0d>] ? default_wake_function+0xb/0xd
 [<c0435766>] ? autoremove_wake_function+0xf/0x33
 [<fa6207cf>] ? iwl_rx_handle+0x1d6/0x322 [iwlagn]
 [<c043281c>] ? insert_work+0x78/0x81
 [<fa621860>] ? iwl_irq_tasklet+0x67b/0x862 [iwlagn]
 [<c042cad8>] ? run_timer_softirq+0x32/0x169
 [<c042926a>] ? tasklet_action+0x62/0x9f
 [<c0429968>] ? __do_softirq+0x88/0x12c
 [<c04298e0>] ? __do_softirq+0x0/0x12c
 <IRQ> <0> [<c042970b>] ? irq_exit+0x29/0x57
 [<c0403d11>] ? do_IRQ+0x6d/0x83
 [<c0402fc9>] ? common_interrupt+0x29/0x30
 [<c05525c3>] ? acpi_idle_enter_simple+0x132/0x15d
 [<c05c3472>] ? cpuidle_idle_call+0x60/0x92
 [<c0401c0d>] ? cpu_idle+0x44/0x5f
 [<c064792f>] ? rest_init+0x53/0x55
 [<c0786793>] ? start_kernel+0x298/0x29d
 [<c078606a>] ? __init_begin+0x6a/0x6f
Code: 94 88 1c b6 00 00 83 c8 ff f0 0f c1 02 48 85 c0 0f 8f 55 01 00 00 89 ca
8b 4d 94 8b 01 e9 44 01 00 00 0f b6 40 17 39 45 a4 74 04 <0f> 0b eb fe 8b 55 b0
8b 4d ac 8a 42 03 40 88 41 09 0f b6 45 b4
EIP: [<fa6363a9>] iwl5000_rx_reply_tx+0x797/0x966 [iwlagn] SS:ESP 0068:c1fdbebc
---[ end trace dda8d7c20ae36f0a ]---
Kernel panic - not syncing: Fatal exception in interrupt
Pid: 0, comm: swapper Tainted: G      D    2.6.30-rc8-wl #3
Call Trace:
 [<c065689c>] ? printk+0xf/0x13
 [<c06567f2>] panic+0x39/0xd4
 [<c06595b3>] oops_end+0x8c/0x9b
 [<c040503e>] die+0x54/0x5a
 [<c0658f47>] do_trap+0x89/0xa2
 [<c04037bb>] ? do_invalid_op+0x0/0x7b
 [<c040382c>] do_invalid_op+0x71/0x7b
 [<fa6363a9>] ? iwl5000_rx_reply_tx+0x797/0x966 [iwlagn]
 [<c041c2af>] ? enqueue_task+0x68/0x74
 [<c0504f89>] ? cpumask_next_and+0x26/0x37
 [<c041fca8>] ? find_busiest_group+0x282/0x876
 [<c0420a0d>] ? default_wake_function+0xb/0xd
 [<c0658d05>] error_code+0x6d/0x74
 [<c041007b>] ? mp_config_acpi_gsi+0x81/0x89
 [<c04037bb>] ? do_invalid_op+0x0/0x7b
 [<fa6363a9>] ? iwl5000_rx_reply_tx+0x797/0x966 [iwlagn]
 [<c041d120>] ? check_preempt_wakeup+0xd3/0xec
 [<c0420a0d>] ? default_wake_function+0xb/0xd
 [<c0435766>] ? autoremove_wake_function+0xf/0x33
 [<fa6207cf>] iwl_rx_handle+0x1d6/0x322 [iwlagn]
 [<c043281c>] ? insert_work+0x78/0x81
 [<fa621860>] iwl_irq_tasklet+0x67b/0x862 [iwlagn]
 [<c042cad8>] ? run_timer_softirq+0x32/0x169
 [<c042926a>] tasklet_action+0x62/0x9f
 [<c0429968>] __do_softirq+0x88/0x12c
 [<c04298e0>] ? __do_softirq+0x0/0x12c
 <IRQ>  [<c042970b>] ? irq_exit+0x29/0x57
 [<c0403d11>] ? do_IRQ+0x6d/0x83
 [<c0402fc9>] ? common_interrupt+0x29/0x30
 [<c05525c3>] ? acpi_idle_enter_simple+0x132/0x15d
 [<c05c3472>] ? cpuidle_idle_call+0x60/0x92
 [<c0401c0d>] ? cpu_idle+0x44/0x5f
 [<c064792f>] ? rest_init+0x53/0x55
 [<c0786793>] ? start_kernel+0x298/0x29d
 [<c078606a>] ? __init_begin+0x6a/0x6f
------- Comment #21 From 2009-06-11 04:20:08 -------
Alright so I looked through this, and it seems to be two or three different
issues:

 1) the issue that apparently I fixed, where before disassoc we
    cleared the station table already in iwlwifi, because mac80211
    switched channel
 2) the issue that traffic stops after a while, which I can't
    really get any information about from what I have here, unless
    you can point me to the correct date/time in the log, the log
    from comment #19 seems to have a module reload at 22:13 which
    doesn't seem to match the "afternoon" you said? And anyway I
    don't see anything wrong before the reload.
 3) the fact that it crashes with this BUG_ON(), which seems to be
    related to a cleared station table too, since it's
    BUG_ON(txq_id != txq->swq_id) which probably means one of those
    was cleared to 0 again...

I'll look into 3 in this bug since it's probably related to 1, but can we
please have a separate bug report for the hang issue?
------- Comment #22 From 2009-06-11 04:33:43 -------
Can you try with this patch? It seems like something is getting confused about
the current aggregation status. Ximin, some more context in the log would also
be useful, i.e. what exactly happened as a result of the 'ap any',
debug50=0x4000003 would probably help too.

--- wireless-testing.orig/drivers/net/wireless/iwlwifi/iwl-5000.c    2009-06-11
13:28:25.000000000 +0200
+++ wireless-testing/drivers/net/wireless/iwlwifi/iwl-5000.c    2009-06-11
13:31:13.000000000 +0200
@@ -1244,7 +1244,12 @@ static void iwl5000_rx_reply_tx(struct i
             }
         }
     } else {
-        BUG_ON(txq_id != txq->swq_id);
+        if (unlikely(txq_id != txq->swq_id)) {
+            printk(KERN_DEBUG "txq = %p\n", txq);
+            printk(KERN_DEBUG "txq_id = %d\n", txq_id);
+            printk(KERN_DEBUG "txq->swq_id = %d\n", txq->swq_id);
+            BUG_ON(1);
+        }

         info->status.rates[0].count = tx_resp->failure_frame + 1;
         info->flags |= iwl_is_tx_success(status) ?
--- wireless-testing.orig/drivers/net/wireless/iwlwifi/iwl-tx.c    2009-06-11
13:30:13.000000000 +0200
+++ wireless-testing/drivers/net/wireless/iwlwifi/iwl-tx.c    2009-06-11
13:32:15.000000000 +0200
@@ -755,6 +755,7 @@ int iwl_tx_skb(struct iwl_priv *priv, st

     txq = &priv->txq[txq_id];
     q = &txq->q;
+    printk(KERN_DEBUG "setting (%d)0x%p swq_id to %d\n", txq_id, swq_id);
     txq->swq_id = swq_id;

     spin_lock_irqsave(&priv->lock, flags);
------- Comment #23 From 2009-06-11 04:48:33 -------
The 22:13 timestamp from comment #19 is the correct one. Sorry for the
confusion, I run GMT, but live EDT =).

At the time of the reload I was unable to associate with any AP. It did however
return scan results.

I'm hoping to grab a more thorough debug next time it stops working.
------- Comment #24 From 2009-06-11 05:49:28 -------
(In reply to comment #23)
> The 22:13 timestamp from comment #19 is the correct one. Sorry for the
> confusion, I run GMT, but live EDT =).

Heh, ok.

> At the time of the reload I was unable to associate with any AP. It did however
> return scan results.
> 
> I'm hoping to grab a more thorough debug next time it stops working.

Not sure what would be helpful, it seems like transmit got stuck, somehow, so
scanning might still work because it's a firmware command.
------- Comment #25 From 2009-06-11 11:55:17 -------
Created an attachment (id=2035) [details]
syslog for first warn with new patch

I applied the last patch to the iwlwifi git. Attached is logs from the first
warning I received.
------- Comment #26 From 2009-06-11 13:10:23 -------
Created an attachment (id=2036) [details]
further syslog with data failure

data stopped work right after
Jun 11 19:59:30 t500pee kernel: Last message 'setting (2)0x0000000' repeated 84
times, supressed by syslog-ng on t500pee

there is a vsprintf warning from doing a cat on the iwlagn/data/* files in
debugfs

after a bit i did a rmmod/modprobe to reload the device
------- Comment #27 From 2009-06-11 13:18:50 -------
That should say i noticed the data stop right before, not after that syslog
entry. Or at least right around it =)
------- Comment #28 From 2009-06-11 13:39:51 -------
Ahrg, there's a bug in the patch, the last added line shouldn't be

+       printk(KERN_DEBUG "setting (%d)0x%p swq_id to %d\n", txq_id, swq_id);

but rather

+       printk(KERN_DEBUG "setting (%d)0x%p swq_id to %d\n", txq_id, txq,
swq_id);


in any case that was more meant for the problem 3) not 2).

I can't really see why tx would get stuck there.
------- Comment #29 From 2009-06-11 19:57:02 -------
Created an attachment (id=2037) [details]
debugging level 0x4000003 and "ap any" log

Hi Johannes,

Since our testing is based on iwlwifi-2.6 repo, I manual apply your patch on
the latest commit b126c042225506d2899fe1133d4e1b5474f21719 (2.6.30-wl). Still
see kernel panic after running ap any or auto then scp file. Please see
attached file.
------- Comment #30 From 2009-06-12 05:40:41 -------
(In reply to comment #29)

> Since our testing is based on iwlwifi-2.6 repo, I manual apply your patch on
> the latest commit b126c042225506d2899fe1133d4e1b5474f21719 (2.6.30-wl). Still
> see kernel panic after running ap any or auto then scp file. Please see
> attached file.

Hmm, thanks, but I don't see any of the messages the patch from comment #22 and
comment #28 added? I'll try to reproduce this next week.
------- Comment #31 From 2009-06-12 06:07:44 -------
Log with the fixed patch. Unfortunately i put my damp (from cold bottle of
water) palm on the fingerprint sensor and it got angry, so there are some USB
errors in there as well.

http://pee.erkkila.org/iwlagn.14.txt.gz
------- Comment #32 From 2009-06-12 07:14:30 -------
(In reply to comment #31)
> Log with the fixed patch. Unfortunately i put my damp (from cold bottle of
> water) palm on the fingerprint sensor and it got angry, so there are some USB
> errors in there as well.

Thanks, not much more useful than the log before though -- this patch was
mostly for debugging the crash, not the TX hang. I'll clone this bug for the TX
hang.
------- Comment #33 From 2009-06-12 07:16:48 -------
(In reply to comment #32)
> I'll clone this bug for the TX hang.

Actually, could you please do that? If I do, it'll be me as the reporter, which
is quite useless.
------- Comment #34 From 2009-06-12 12:15:30 -------
Created an attachment (id=2039) [details]
kernel crash with extra debugging

(In reply to comment #30)
> Hmm, thanks, but I don't see any of the messages the patch from comment #22 and
> comment #28 added? I'll try to reproduce this next week.

Johannes,

I reproduced the kernel crash on my end, with your extra debugging patches from
comment #22 and comment #28.  This kernel also includes your patch in comment
#14.

To give some context, there are a bunch of consecutive lines:

  setting (10)0xffff88012e18af58 swq_id to [2|170]

These were generated during the first scp (the one before running the "iwconfig
wlan0 ap auto" command; this scp is the one that works).
------- Comment #35 From 2009-06-12 14:21:25 -------
(In reply to comment #34)

Michael,

> I reproduced the kernel crash on my end, with your extra debugging patches from
> comment #22 and comment #28.  This kernel also includes your patch in comment
> #14.

Thanks.

> To give some context, there are a bunch of consecutive lines:
> 
>   setting (10)0xffff88012e18af58 swq_id to [2|170]
> 
> These were generated during the first scp (the one before running the "iwconfig
> wlan0 ap auto" command; this scp is the one that works).

Yes, that's what I wanted. The relevant txq is ffff88012e185cd8 though, not
ffff88012e18af58. This is what I see:
[ 1137.944459] setting (0)0xffff88012e185cd8 swq_id to 0
...
[ 1201.020135] setting (0)0xffff88012e185cd8 swq_id to 130
[ 1201.021838] txq = ffff88012e185cd8
[ 1201.023175] txq_id = 0
[ 1201.024447] txq->swq_id = 130
[ 1201.025744] ------------[ cut here ]------------
[ 1201.025828] kernel BUG at drivers/net/wireless/iwlwifi/iwl-5000.c:1251!


The "to 130" is very strange, this is 0x82 in hex which means we set it to AC 2
(best effort), aggregation, but hw queue 0, which really shouldn't be happening
since before it was 0, meaning AC 0.

I'll need to look into the code to understand what might be going on.
------- Comment #36 From 2009-06-13 06:44:27 -------
Hmm. The way we're setting txq->swq_id is a little strange and probably causes
the bug, that seems to be my fault. The patch I pasting should move it from the
TX path to the setup paths, but I haven't even had a chance to run it yet, only
compile tested.

--- wireless-testing.orig/drivers/net/wireless/iwlwifi/iwl-5000.c    2009-06-11
13:28:25.000000000 +0200
+++ wireless-testing/drivers/net/wireless/iwlwifi/iwl-5000.c    2009-06-11
13:31:13.000000000 +0200
@@ -1244,7 +1244,12 @@ static void iwl5000_rx_reply_tx(struct i
             }
         }
     } else {
-        BUG_ON(txq_id != txq->swq_id);
+        if (unlikely(txq_id != txq->swq_id)) {
+            printk(KERN_DEBUG "txq = %p\n", txq);
+            printk(KERN_DEBUG "txq_id = %d\n", txq_id);
+            printk(KERN_DEBUG "txq->swq_id = %d\n", txq->swq_id);
+            BUG_ON(1);
+        }

         info->status.rates[0].count = tx_resp->failure_frame + 1;
         info->flags |= iwl_is_tx_success(status) ?
--- wireless-testing.orig/drivers/net/wireless/iwlwifi/iwl-tx.c    2009-06-11
13:30:13.000000000 +0200
+++ wireless-testing/drivers/net/wireless/iwlwifi/iwl-tx.c    2009-06-13
15:42:44.000000000 +0200
@@ -348,6 +348,13 @@ int iwl_tx_queue_init(struct iwl_priv *p

     txq->need_update = 0;

+    /* aggregation TX queues will get their ID when aggregation begins */
+    if (txq_id <= IWL_TX_FIFO_AC3) {
+        printk(KERN_DEBUG "setting 0x%p (%d) swq_id to %d\n",
+               txq, txq_id, txq_id);
+        txq->swq_id = txq_id;
+    }
+
     /* TFD_QUEUE_SIZE_MAX must be power-of-two size, otherwise
      * iwl_queue_inc_wrap and iwl_queue_dec_wrap are broken. */
     BUILD_BUG_ON(TFD_QUEUE_SIZE_MAX & (TFD_QUEUE_SIZE_MAX - 1));
@@ -734,8 +741,7 @@ int iwl_tx_skb(struct iwl_priv *priv, st

     IWL_DEBUG_TX(priv, "station Id %d\n", sta_id);

-    swq_id = skb_get_queue_mapping(skb);
-    txq_id = swq_id;
+    txq_id = skb_get_queue_mapping(skb);
     if (ieee80211_is_data_qos(fc)) {
         qc = ieee80211_get_qos_ctl(hdr);
         tid = qc[0] & IEEE80211_QOS_CTL_TID_MASK;
@@ -746,16 +752,14 @@ int iwl_tx_skb(struct iwl_priv *priv, st
         hdr->seq_ctrl |= cpu_to_le16(seq_number);
         seq_number += 0x10;
         /* aggregation is on for this <sta,tid> */
-        if (info->flags & IEEE80211_TX_CTL_AMPDU) {
+        if (info->flags & IEEE80211_TX_CTL_AMPDU)
             txq_id = priv->stations[sta_id].tid[tid].agg.txq_id;
-            swq_id = iwl_virtual_agg_queue_num(swq_id, txq_id);
-        }
         priv->stations[sta_id].tid[tid].tfds_in_queue++;
     }

     txq = &priv->txq[txq_id];
+    swq_id = txq->swq_id;
     q = &txq->q;
-    txq->swq_id = swq_id;

     spin_lock_irqsave(&priv->lock, flags);

@@ -1186,6 +1190,9 @@ int iwl_tx_agg_start(struct iwl_priv *pr
     tid_data = &priv->stations[sta_id].tid[tid];
     *ssn = SEQ_TO_SN(tid_data->seq_number);
     tid_data->agg.txq_id = txq_id;
+    priv->txq[txq_id].swq_id = iwl_virtual_agg_queue_num(tx_fifo, txq_id);
+    printk(KERN_DEBUG "setting 0x%p (%d) swq_id to %d\n",
+           &priv->txq[txq_id], txq_id, priv->txq[txq_id].swq_id);
     spin_unlock_irqrestore(&priv->sta_lock, flags);

     ret = priv->cfg->ops->lib->txq_agg_enable(priv, txq_id, tx_fifo,
------- Comment #37 From 2009-06-15 09:06:55 -------
Unfortunately, I cannot reproduce the bug. My 5350 card doesn't work at all,
and with 4965 I don't run into the problem, and I don't have a 5100/5300 to
test with.

I have, however, found a problem with "iwconfig wlan0 ap any", which I fixed --
please try this patch:
http://johannes.sipsolutions.net/patches/kernel/all/2009-06-15-16%3a03/012-mac80211-reassoc.patch
------- Comment #38 From 2009-06-15 09:15:19 -------
Sorry, please use this version without the WARN(), that situation can happen
legitimately.

http://johannes.sipsolutions.net/patches/kernel/all/2009-06-15-16%3a14/012-mac80211-reassoc.patch
------- Comment #39 From 2009-06-15 11:02:27 -------
Created an attachment (id=2047) [details]
new debug message without kernel crash

Johannes,

I just tested again with the new patch(es).  Specifically, I'm running with the
patches in comments #14, #36, and #38.  There was no kernel crash this time,
but the card/driver did disassociate from the AP as a result of running
'iwconfig wlan0 ap auto'.  I had to run 'iwconfig wlan0 essid <>' (a few times)
to re-associate.  In the log, I started running the essid command to
re-associate around time [465.915877].
------- Comment #40 From 2009-06-15 11:21:03 -------
Michael,

Thanks for testing. If you have a bit more time, I would appreciate a test run
without the patch from comment #36, I think that was a red herring. I'll send
that patch to Reinette anyway because it makes things more efficient.

I have also observed that it just disassociates when you run 'ap any', but in
my case it was due to the driver rejecting the scan, which doesn't seem to be
the case for you, at least I don't see log messages indicating that in
timestamp 376.* (which is where you issued the "ap any" command). Might be due
to you fixing the channel -- I didn't do that, will look into it.

Thanks!
------- Comment #41 From 2009-06-15 13:20:27 -------
Created an attachment (id=2048) [details]
debug log without patch from comment #36

> Thanks for testing. If you have a bit more time, I would appreciate a
> test run without the patch from comment #36, I think that was a red
> herring. I'll send that patch to Reinette anyway because it makes
> things more efficient.

Johannes,

I'm attaching a new debug log.  The result is similar to the previous
test (which had the patch from comment #36) in that there is no crash.

It still seems, though, that something isn't working quite right.  After
the "ap any" command, I need to manually re-associate.  Even then, I
get terrible performance (I even tried re-associating a few times to
different APs).  In each case, the ping times to the gateway router can
reach 1000ms, and scp won't progress at all.  Eventually, as you can
see towards the end of the log, I started getting "ERROR: No TX rate
available" errors and had to unload the driver.

I actually saw these "ERROR: No TX rate available" messages on the
previous run (with patch from #36), but had truncated the log before
that (thinking/hoping it was an unrelated and/or isolated incident).
Now that it's happened twice, I'm including it.

> I have also observed that it just disassociates when you run 'ap any',
> but in my case it was due to the driver rejecting the scan, which
> doesn't seem to be the case for you, at least I don't see log messages
> indicating that in timestamp 376.* (which is where you issued the "ap
> any" command). Might be due to you fixing the channel -- I didn't do
> that, will look into it.

I don't think I was doing anything to fix the channel.  I was fixing the
ESSID, but the ESSID I'm using is served by several APs on different
frequencies.  The card/driver usually connects to one of 3-4 different
APs (on different channels) when I associate.
------- Comment #42 From 2009-06-15 13:39:19 -------
(In reply to comment #41)
> I'm attaching a new debug log.  The result is similar to the previous
> test (which had the patch from comment #36) in that there is no crash.

Good.

> It still seems, though, that something isn't working quite right.  After
> the "ap any" command, I need to manually re-associate.  Even then, I
> get terrible performance (I even tried re-associating a few times to
> different APs).  In each case, the ping times to the gateway router can
> reach 1000ms, and scp won't progress at all.  Eventually, as you can
> see towards the end of the log, I started getting "ERROR: No TX rate
> available" errors and had to unload the driver.

This is odd, I'm not seeing that. What tree are you on, and does it have the
"revamp bss_info_changed" patch?

> I actually saw these "ERROR: No TX rate available" messages on the
> previous run (with patch from #36), but had truncated the log before
> that (thinking/hoping it was an unrelated and/or isolated incident).
> Now that it's happened twice, I'm including it.

Hmm. I have no idea what could be causing that.

> > I have also observed that it just disassociates when you run 'ap any',
> > but in my case it was due to the driver rejecting the scan, which
> > doesn't seem to be the case for you, at least I don't see log messages
> > indicating that in timestamp 376.* (which is where you issued the "ap
> > any" command). Might be due to you fixing the channel -- I didn't do
> > that, will look into it.
> 
> I don't think I was doing anything to fix the channel.  I was fixing the
> ESSID, but the ESSID I'm using is served by several APs on different
> frequencies.  The card/driver usually connects to one of 3-4 different
> APs (on different channels) when I associate.

Yes, sorry, Ximin was doing that, I got confused. Maybe the problem is that it
chooses a different AP and I have only one to choose from in my test. But the
entire failure mode is very strange.

In any case I think mac80211 is now behaving correctly so we probably should
look for the bug in the driver.

Also, I cannot reproduce the issue you have where it doesn't automatically
associated to an AP again, for me it always does unless scanning fails in the
driver for some reason.
------- Comment #43 From 2009-06-15 14:26:02 -------
(In reply to comment #42)
> This is odd, I'm not seeing that. What tree are you on, and does it have the
> "revamp bss_info_changed" patch?

I'm using the iwlwifi-2.6 git tree, fresh checkout from this morning (plus your
patches).  It should have "revamp bss_info_changed" patch.

> Hmm. I have no idea what could be causing that.

I'm not sure.  It always seems to pop up when I'm re-associating with an AP
running in the 5GHz band, but that could just be coincidence.

> In any case I think mac80211 is now behaving correctly so we probably should
> look for the bug in the driver.
> 
> Also, I cannot reproduce the issue you have where it doesn't automatically
> associated to an AP again, for me it always does unless scanning fails in the
> driver for some reason.

In general the entire driver state seems to be a bit unstable after the 'ap
any' command.   With the recent debugging, though, I haven't tried running for
an extended period of time without issuing that command.  I can see how that
goes.
------- Comment #44 From 2009-08-10 15:23:15 -------
The "No TX rate available" is a bogus sanity check in the driver (see also bug
2064).

Try removing that entire "if" statement calling ieee80211_get_tx_rate() in
iwl-tx.c.  No harm will be done.  A colleague is preparing a patch to do this,
but it's easy enough just to try on your own.

Also, just to be sure (unrelated to "No TX rate", but also easy to try), try
re-defining IWL_MAX_SCAN_SIZE from 1024 to 1280 ... depending on how many
channels you scan on 5 GHZ, the command may be overflowing and wreaking havoc.

-- Ben --
------- Comment #45 From 2009-08-13 09:08:19 -------
(In reply to comment #44)
> The "No TX rate available" is a bogus sanity check in the driver (see also bug
> 2064).
> 
> Try removing that entire "if" statement calling ieee80211_get_tx_rate() in
> iwl-tx.c.  No harm will be done.  A colleague is preparing a patch to do this,
> but it's easy enough just to try on your own.
> 
> Also, just to be sure (unrelated to "No TX rate", but also easy to try), try
> re-defining IWL_MAX_SCAN_SIZE from 1024 to 1280 ... depending on how many
> channels you scan on 5 GHZ, the command may be overflowing and wreaking havoc.

I made these changes, and I haven't seen any problems on this front so far.  I
also upgraded to a newer version of the git tree at the same time (from 2 days
ago: d229b682...2996a4dbe).

In general, I think that maybe we can close this bug.  The original problem, as
stated in the subject line (BUG_ON triggered by 'iwconfig ap any') seems to be
fixed now as far as I can tell  (at least in the version of iwlwifi-2.6.git
mentioned above).  And, it seems like the "No TX rate" issue is being tracked
by Bug #2064.
------- Comment #46 From 2010-02-09 09:55:18 -------
(In reply to comment #45)

> In general, I think that maybe we can close this bug.  The original problem, as
> stated in the subject line (BUG_ON triggered by 'iwconfig ap any') seems to be
> fixed now as far as I can tell  (at least in the version of iwlwifi-2.6.git
> mentioned above).  And, it seems like the "No TX rate" issue is being tracked
> by Bug #2064.

Closing based on above comment.