Bugzilla – Bug 2015
iwlwifi hits BUG_ON when forced to roam ("iwconfig ap any")
Last modified: 2010-02-09 18:39:44
You need to log in before you can comment on or make changes to this bug.
Created an attachment (id=2022) [details] dmesg output After running for a while i start seeing "queue full" messages from pings to my default gateway. I've attached dmesg including the warning. If i then do "iwconfig wlan0 ap auto" it causes my machine to hard lock. If i ifconfig down wlan0, rmmod, and modprobe to restart it seems ok.
Created an attachment (id=2023) [details] dmesg from a similar failure later on This is the dmesg from the same boot but later in the day after another failure.
Wey-Yi, This looks like the problem you are working on.
(In reply to comment #1) > Created an attachment (id=2023) [details] [details] > dmesg from a similar failure later on > This is the dmesg from the same boot but later in the day after another > failure. Looks like mac80211 issue IEEE80211_AMPDU_TX_STOP to the same tid more than once, could you re-run the test with debug50=0x4000000. thanks
Created an attachment (id=2026) [details] dmesg output with debug enabled Here is the dmesg with debug output included. It includes my ifconfig down of the interface and rmmod/modprobe at the end.
Can you please try this with this patch, and debug50=0x4001003 The patch will cause a lot of dump_stack() but I need that info. --- wireless-testing.orig/drivers/net/wireless/iwlwifi/iwl-sta.c 2009-06-09 18:02:22.000000000 +0200 +++ wireless-testing/drivers/net/wireless/iwlwifi/iwl-sta.c 2009-06-09 18:02:34.000000000 +0200 @@ -480,6 +480,9 @@ void iwl_clear_stations_table(struct iwl unsigned long flags; int i; + printk(KERN_DEBUG "iwl_clear_stations_table\n"); + dump_stack(); + spin_lock_irqsave(&priv->sta_lock, flags); if (iwl_is_alive(priv) &&
Created an attachment (id=2027) [details] dmesg output with patch + debug Here is the output you requested
Thanks, that helps a lot. Can you, leaving all this in place, check if you can reproduce the problem with this small change: --- wireless-testing.orig/net/mac80211/util.c 2009-06-09 21:43:02.000000000 +0200 +++ wireless-testing/net/mac80211/util.c 2009-06-09 21:43:07.000000000 +0200 @@ -773,8 +773,7 @@ int ieee80211_set_freq(struct ieee80211_ if (local->sw_scanning || local->hw_scanning) ret = 0; else - ret = ieee80211_hw_config( - local, IEEE80211_CONF_CHANGE_CHANNEL); + ret = ieee80211_hw_config(local, 0); } return ret; or were you manually changing the channel while associated?
(In reply to comment #7) > Thanks, that helps a lot. Can you, leaving all this in place, check if you can > reproduce the problem with this small change: > --- wireless-testing.orig/net/mac80211/util.c 2009-06-09 21:43:02.000000000 > +0200 > +++ wireless-testing/net/mac80211/util.c 2009-06-09 21:43:07.000000000 > +0200 > @@ -773,8 +773,7 @@ int ieee80211_set_freq(struct ieee80211_ > if (local->sw_scanning || local->hw_scanning) > ret = 0; > else > - ret = ieee80211_hw_config( > - local, IEEE80211_CONF_CHANGE_CHANNEL); > + ret = ieee80211_hw_config(local, 0); > } > return ret; > or were you manually changing the channel while associated? looks like something changed in hw_config and cause the station table get reset. is this possible caused by "iwconfig wlan0 ap auto"?
(In reply to comment #8) > looks like something changed in hw_config and cause the station table get > reset. is this possible caused by "iwconfig wlan0 ap auto"? The stack trace quite clearly says: iwl_clear_stations_table Pid: 4502, comm: wpa_supplicant Not tainted 2.6.30-rc8-wl-41696-g7a35720-dirty #5 Call Trace: [<ffffffff804d9386>] iwl_clear_stations_table+0x2d/0xef [<ffffffffa000868e>] iwl_commit_rxon+0x2a6/0x460 [iwlagn] [<ffffffff804c7ef7>] iwl_mac_config+0x48b/0x573 [<ffffffff806be93e>] ieee80211_hw_config+0x7a/0x80 [<ffffffff806d45d4>] ieee80211_set_freq+0x67/0x77 [<ffffffff806c081b>] ieee80211_ioctl_siwfreq+0xc7/0xcc so it must be setting the frequency... which the patch above should catch unless you're actually moving to a different channel...which would be weird.
Actually, right after we get a new SSID too, so it seems wpa_supplicant or maybe NM is just trying to connect to another AP.
Created an attachment (id=2028) [details] syslog I'm not sure if you want me to run with that patch or not =). Please let me know Attached syslog with NetworkMangler logs in it from the same time
Sorry, no, that patch won't really help. I'm developing another one, give me a few minutes to test it at least once.
Created an attachment (id=2029) [details] patch to fix the problem Can you please try this patch? It has fixed the problem for me.
Created an attachment (id=2030) [details] patch to fix the problem sorry, that has a bug that could make it crash, this is better
I just confirmed that "iwconfig wlan0 ap auto" also causes my machine to hard lock-up. This seems to happen even if I haven't seen the kernel "WARNING". Johannes--I just tested your latest patch (comment #14), and it still results in a hard lock-up when running the above command. Anything else to try? Michael
(In reply to comment #15) > I just confirmed that "iwconfig wlan0 ap auto" also causes my machine to hard > lock-up. This seems to happen even if I haven't seen the kernel "WARNING". Can you explain in a little more detail how to trigger this? Is that with first connecting with wpa_supplicant? Or just on an existing connection brought up by iwconfig? Or without any connection at all?
Also, how hard is the hard lockup? If you switch out of X first, do you see any bug message on it?
Created an attachment (id=2032) [details] screenshot of kernel panic I dropped to console first was able to take a digital photo of the kernel panic. Here's what I did to produce the crash (very repeatable): - unload iwlagn kernel module - turn off network-manager, etc. - load kernel module - ifconfig wlan0 up - iwconfig wlan0 essid CMU - dhclient wlan0 - start an scp [works fine; kill after a few second] [note: once transfer starts, the kernel debug messages show that aggregation gets turned on] - iwconfig wlan0 ap auto - start another scp [immediate kernel panic--see screenshot]
Created an attachment (id=2033) [details] messages from the last 24hrs or so with patch Here is my /var/log/messages since boot with the patch. It ran fine until this afternoon sometime when I stopped moving packets. Unfortunately I wasn't at the box when it happened. Hopefully there is something useful in there. I tried the iwconfig wlan0 ap auto , no crash. I ended up rmmod/modprobe to restore connectivity.
Created an attachment (id=2034) [details] debugging level is 0x4000000 I try to reproduce it on our testing environment. It's easy to reproduce on 5100/5300. Testing Environment ======================================== Platform : Intel SDV M3M41 Wireless Card : Intel(R) WiFi Link 5100/5300 OS : Redhat Fedora release 10 (Cambridge) 32/64bit AP : Cisco 1250 AP Channel Width: 20/Below 40/Above40 MHz uCode : iwlwifi-5000-2.ucode 8.24.2.12 Source : commit 7a35720d1ba0dae1e7994cc388014c9364a2b382 (2.6.30-rc8-wl) Steps to Reproduce ======================================== 1. modprobe -r iwlagn 2. modprobe iwlagn debug50=0x4000000 (Note: if load module with "11n_disable50=1", didn't see kernel panic) 3. ifconfig wlan0 up 4. iwconfig wlan0 channel <> essid <> 5. ifconfig wlan0 IP_ADDR 6. scp file to server (300Mfile) 7. iwconfig wlan0 ap auto (this step is necessary) 8. scp again leading to kernel panic Call Trace message ======================================== kernel BUG at drivers/net/wireless/iwlwifi/iwl-5000.c:1247! invalid opcode: 0000 [#1] SMP last sysfs file: /sys/class/power_supply/BAT0/energy_full Modules linked in: iwlagn iwlcore mac80211 cfg80211 rfkill netconsole sunrpc ipt_REJECT nf_conntrack_ipv4 nf_defrag_ipv4 iptable_filter ip_tables ip6t_REJECT xt_tcpudp nf_conntrack_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables x_tables ipv6 cpufreq_ondemand acpi_cpufreq dm_mirror dm_region_hash dm_log dm_multipath dm_mod uinput arc4 ecb e1000 i2c_i801 i2c_core ac pcspkr video output battery button sr_mod cdrom sg ata_generic ata_piix libata sd_mod scsi_mod ext3 jbd mbcache uhci_hcd ohci_hcd ehci_hcd [last unloaded: rfkill] Pid: 0, comm: swapper Not tainted (2.6.30-rc8-wl #3) Montevina platform EIP: 0060:[<fa6363a9>] EFLAGS: 00010097 CPU: 0 EIP is at iwl5000_rx_reply_tx+0x797/0x966 [iwlagn] EAX: 00000082 EBX: f591e9b4 ECX: 00000000 EDX: f5324008 ESI: f6af09a0 EDI: f591e9f8 EBP: c1fdbf2c ESP: c1fdbebc DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068 Process swapper (pid: 0, ti=c1fdb000 task=c07402fc task.ti=c077f000) Stack: c07d8460 f6af09a0 00000100 c041d120 f5324000 00000000 0000000a f591e9d4 f5324008 00000201 00000000 00000000 00000002 00000000 c1fe1460 00000082 f71a3fb0 c1fe4ae4 00000001 c1fdbf10 c0420a0d c1fdbf24 c0435766 00000000 Call Trace: [<c041d120>] ? check_preempt_wakeup+0xd3/0xec [<c0420a0d>] ? default_wake_function+0xb/0xd [<c0435766>] ? autoremove_wake_function+0xf/0x33 [<fa6207cf>] ? iwl_rx_handle+0x1d6/0x322 [iwlagn] [<c043281c>] ? insert_work+0x78/0x81 [<fa621860>] ? iwl_irq_tasklet+0x67b/0x862 [iwlagn] [<c042cad8>] ? run_timer_softirq+0x32/0x169 [<c042926a>] ? tasklet_action+0x62/0x9f [<c0429968>] ? __do_softirq+0x88/0x12c [<c04298e0>] ? __do_softirq+0x0/0x12c <IRQ> <0> [<c042970b>] ? irq_exit+0x29/0x57 [<c0403d11>] ? do_IRQ+0x6d/0x83 [<c0402fc9>] ? common_interrupt+0x29/0x30 [<c05525c3>] ? acpi_idle_enter_simple+0x132/0x15d [<c05c3472>] ? cpuidle_idle_call+0x60/0x92 [<c0401c0d>] ? cpu_idle+0x44/0x5f [<c064792f>] ? rest_init+0x53/0x55 [<c0786793>] ? start_kernel+0x298/0x29d [<c078606a>] ? __init_begin+0x6a/0x6f Code: 94 88 1c b6 00 00 83 c8 ff f0 0f c1 02 48 85 c0 0f 8f 55 01 00 00 89 ca 8b 4d 94 8b 01 e9 44 01 00 00 0f b6 40 17 39 45 a4 74 04 <0f> 0b eb fe 8b 55 b0 8b 4d ac 8a 42 03 40 88 41 09 0f b6 45 b4 EIP: [<fa6363a9>] iwl5000_rx_reply_tx+0x797/0x966 [iwlagn] SS:ESP 0068:c1fdbebc ---[ end trace dda8d7c20ae36f0a ]--- Kernel panic - not syncing: Fatal exception in interrupt Pid: 0, comm: swapper Tainted: G D 2.6.30-rc8-wl #3 Call Trace: [<c065689c>] ? printk+0xf/0x13 [<c06567f2>] panic+0x39/0xd4 [<c06595b3>] oops_end+0x8c/0x9b [<c040503e>] die+0x54/0x5a [<c0658f47>] do_trap+0x89/0xa2 [<c04037bb>] ? do_invalid_op+0x0/0x7b [<c040382c>] do_invalid_op+0x71/0x7b [<fa6363a9>] ? iwl5000_rx_reply_tx+0x797/0x966 [iwlagn] [<c041c2af>] ? enqueue_task+0x68/0x74 [<c0504f89>] ? cpumask_next_and+0x26/0x37 [<c041fca8>] ? find_busiest_group+0x282/0x876 [<c0420a0d>] ? default_wake_function+0xb/0xd [<c0658d05>] error_code+0x6d/0x74 [<c041007b>] ? mp_config_acpi_gsi+0x81/0x89 [<c04037bb>] ? do_invalid_op+0x0/0x7b [<fa6363a9>] ? iwl5000_rx_reply_tx+0x797/0x966 [iwlagn] [<c041d120>] ? check_preempt_wakeup+0xd3/0xec [<c0420a0d>] ? default_wake_function+0xb/0xd [<c0435766>] ? autoremove_wake_function+0xf/0x33 [<fa6207cf>] iwl_rx_handle+0x1d6/0x322 [iwlagn] [<c043281c>] ? insert_work+0x78/0x81 [<fa621860>] iwl_irq_tasklet+0x67b/0x862 [iwlagn] [<c042cad8>] ? run_timer_softirq+0x32/0x169 [<c042926a>] tasklet_action+0x62/0x9f [<c0429968>] __do_softirq+0x88/0x12c [<c04298e0>] ? __do_softirq+0x0/0x12c <IRQ> [<c042970b>] ? irq_exit+0x29/0x57 [<c0403d11>] ? do_IRQ+0x6d/0x83 [<c0402fc9>] ? common_interrupt+0x29/0x30 [<c05525c3>] ? acpi_idle_enter_simple+0x132/0x15d [<c05c3472>] ? cpuidle_idle_call+0x60/0x92 [<c0401c0d>] ? cpu_idle+0x44/0x5f [<c064792f>] ? rest_init+0x53/0x55 [<c0786793>] ? start_kernel+0x298/0x29d [<c078606a>] ? __init_begin+0x6a/0x6f
Alright so I looked through this, and it seems to be two or three different issues: 1) the issue that apparently I fixed, where before disassoc we cleared the station table already in iwlwifi, because mac80211 switched channel 2) the issue that traffic stops after a while, which I can't really get any information about from what I have here, unless you can point me to the correct date/time in the log, the log from comment #19 seems to have a module reload at 22:13 which doesn't seem to match the "afternoon" you said? And anyway I don't see anything wrong before the reload. 3) the fact that it crashes with this BUG_ON(), which seems to be related to a cleared station table too, since it's BUG_ON(txq_id != txq->swq_id) which probably means one of those was cleared to 0 again... I'll look into 3 in this bug since it's probably related to 1, but can we please have a separate bug report for the hang issue?
Can you try with this patch? It seems like something is getting confused about the current aggregation status. Ximin, some more context in the log would also be useful, i.e. what exactly happened as a result of the 'ap any', debug50=0x4000003 would probably help too. --- wireless-testing.orig/drivers/net/wireless/iwlwifi/iwl-5000.c 2009-06-11 13:28:25.000000000 +0200 +++ wireless-testing/drivers/net/wireless/iwlwifi/iwl-5000.c 2009-06-11 13:31:13.000000000 +0200 @@ -1244,7 +1244,12 @@ static void iwl5000_rx_reply_tx(struct i } } } else { - BUG_ON(txq_id != txq->swq_id); + if (unlikely(txq_id != txq->swq_id)) { + printk(KERN_DEBUG "txq = %p\n", txq); + printk(KERN_DEBUG "txq_id = %d\n", txq_id); + printk(KERN_DEBUG "txq->swq_id = %d\n", txq->swq_id); + BUG_ON(1); + } info->status.rates[0].count = tx_resp->failure_frame + 1; info->flags |= iwl_is_tx_success(status) ? --- wireless-testing.orig/drivers/net/wireless/iwlwifi/iwl-tx.c 2009-06-11 13:30:13.000000000 +0200 +++ wireless-testing/drivers/net/wireless/iwlwifi/iwl-tx.c 2009-06-11 13:32:15.000000000 +0200 @@ -755,6 +755,7 @@ int iwl_tx_skb(struct iwl_priv *priv, st txq = &priv->txq[txq_id]; q = &txq->q; + printk(KERN_DEBUG "setting (%d)0x%p swq_id to %d\n", txq_id, swq_id); txq->swq_id = swq_id; spin_lock_irqsave(&priv->lock, flags);
The 22:13 timestamp from comment #19 is the correct one. Sorry for the confusion, I run GMT, but live EDT =). At the time of the reload I was unable to associate with any AP. It did however return scan results. I'm hoping to grab a more thorough debug next time it stops working.
(In reply to comment #23) > The 22:13 timestamp from comment #19 is the correct one. Sorry for the > confusion, I run GMT, but live EDT =). Heh, ok. > At the time of the reload I was unable to associate with any AP. It did however > return scan results. > > I'm hoping to grab a more thorough debug next time it stops working. Not sure what would be helpful, it seems like transmit got stuck, somehow, so scanning might still work because it's a firmware command.
Created an attachment (id=2035) [details] syslog for first warn with new patch I applied the last patch to the iwlwifi git. Attached is logs from the first warning I received.
Created an attachment (id=2036) [details] further syslog with data failure data stopped work right after Jun 11 19:59:30 t500pee kernel: Last message 'setting (2)0x0000000' repeated 84 times, supressed by syslog-ng on t500pee there is a vsprintf warning from doing a cat on the iwlagn/data/* files in debugfs after a bit i did a rmmod/modprobe to reload the device
That should say i noticed the data stop right before, not after that syslog entry. Or at least right around it =)
Ahrg, there's a bug in the patch, the last added line shouldn't be + printk(KERN_DEBUG "setting (%d)0x%p swq_id to %d\n", txq_id, swq_id); but rather + printk(KERN_DEBUG "setting (%d)0x%p swq_id to %d\n", txq_id, txq, swq_id); in any case that was more meant for the problem 3) not 2). I can't really see why tx would get stuck there.
Created an attachment (id=2037) [details] debugging level 0x4000003 and "ap any" log Hi Johannes, Since our testing is based on iwlwifi-2.6 repo, I manual apply your patch on the latest commit b126c042225506d2899fe1133d4e1b5474f21719 (2.6.30-wl). Still see kernel panic after running ap any or auto then scp file. Please see attached file.
(In reply to comment #29) > Since our testing is based on iwlwifi-2.6 repo, I manual apply your patch on > the latest commit b126c042225506d2899fe1133d4e1b5474f21719 (2.6.30-wl). Still > see kernel panic after running ap any or auto then scp file. Please see > attached file. Hmm, thanks, but I don't see any of the messages the patch from comment #22 and comment #28 added? I'll try to reproduce this next week.
Log with the fixed patch. Unfortunately i put my damp (from cold bottle of water) palm on the fingerprint sensor and it got angry, so there are some USB errors in there as well. http://pee.erkkila.org/iwlagn.14.txt.gz
(In reply to comment #31) > Log with the fixed patch. Unfortunately i put my damp (from cold bottle of > water) palm on the fingerprint sensor and it got angry, so there are some USB > errors in there as well. Thanks, not much more useful than the log before though -- this patch was mostly for debugging the crash, not the TX hang. I'll clone this bug for the TX hang.
(In reply to comment #32) > I'll clone this bug for the TX hang. Actually, could you please do that? If I do, it'll be me as the reporter, which is quite useless.
Created an attachment (id=2039) [details] kernel crash with extra debugging (In reply to comment #30) > Hmm, thanks, but I don't see any of the messages the patch from comment #22 and > comment #28 added? I'll try to reproduce this next week. Johannes, I reproduced the kernel crash on my end, with your extra debugging patches from comment #22 and comment #28. This kernel also includes your patch in comment #14. To give some context, there are a bunch of consecutive lines: setting (10)0xffff88012e18af58 swq_id to [2|170] These were generated during the first scp (the one before running the "iwconfig wlan0 ap auto" command; this scp is the one that works).
(In reply to comment #34) Michael, > I reproduced the kernel crash on my end, with your extra debugging patches from > comment #22 and comment #28. This kernel also includes your patch in comment > #14. Thanks. > To give some context, there are a bunch of consecutive lines: > > setting (10)0xffff88012e18af58 swq_id to [2|170] > > These were generated during the first scp (the one before running the "iwconfig > wlan0 ap auto" command; this scp is the one that works). Yes, that's what I wanted. The relevant txq is ffff88012e185cd8 though, not ffff88012e18af58. This is what I see: [ 1137.944459] setting (0)0xffff88012e185cd8 swq_id to 0 ... [ 1201.020135] setting (0)0xffff88012e185cd8 swq_id to 130 [ 1201.021838] txq = ffff88012e185cd8 [ 1201.023175] txq_id = 0 [ 1201.024447] txq->swq_id = 130 [ 1201.025744] ------------[ cut here ]------------ [ 1201.025828] kernel BUG at drivers/net/wireless/iwlwifi/iwl-5000.c:1251! The "to 130" is very strange, this is 0x82 in hex which means we set it to AC 2 (best effort), aggregation, but hw queue 0, which really shouldn't be happening since before it was 0, meaning AC 0. I'll need to look into the code to understand what might be going on.
Hmm. The way we're setting txq->swq_id is a little strange and probably causes the bug, that seems to be my fault. The patch I pasting should move it from the TX path to the setup paths, but I haven't even had a chance to run it yet, only compile tested. --- wireless-testing.orig/drivers/net/wireless/iwlwifi/iwl-5000.c 2009-06-11 13:28:25.000000000 +0200 +++ wireless-testing/drivers/net/wireless/iwlwifi/iwl-5000.c 2009-06-11 13:31:13.000000000 +0200 @@ -1244,7 +1244,12 @@ static void iwl5000_rx_reply_tx(struct i } } } else { - BUG_ON(txq_id != txq->swq_id); + if (unlikely(txq_id != txq->swq_id)) { + printk(KERN_DEBUG "txq = %p\n", txq); + printk(KERN_DEBUG "txq_id = %d\n", txq_id); + printk(KERN_DEBUG "txq->swq_id = %d\n", txq->swq_id); + BUG_ON(1); + } info->status.rates[0].count = tx_resp->failure_frame + 1; info->flags |= iwl_is_tx_success(status) ? --- wireless-testing.orig/drivers/net/wireless/iwlwifi/iwl-tx.c 2009-06-11 13:30:13.000000000 +0200 +++ wireless-testing/drivers/net/wireless/iwlwifi/iwl-tx.c 2009-06-13 15:42:44.000000000 +0200 @@ -348,6 +348,13 @@ int iwl_tx_queue_init(struct iwl_priv *p txq->need_update = 0; + /* aggregation TX queues will get their ID when aggregation begins */ + if (txq_id <= IWL_TX_FIFO_AC3) { + printk(KERN_DEBUG "setting 0x%p (%d) swq_id to %d\n", + txq, txq_id, txq_id); + txq->swq_id = txq_id; + } + /* TFD_QUEUE_SIZE_MAX must be power-of-two size, otherwise * iwl_queue_inc_wrap and iwl_queue_dec_wrap are broken. */ BUILD_BUG_ON(TFD_QUEUE_SIZE_MAX & (TFD_QUEUE_SIZE_MAX - 1)); @@ -734,8 +741,7 @@ int iwl_tx_skb(struct iwl_priv *priv, st IWL_DEBUG_TX(priv, "station Id %d\n", sta_id); - swq_id = skb_get_queue_mapping(skb); - txq_id = swq_id; + txq_id = skb_get_queue_mapping(skb); if (ieee80211_is_data_qos(fc)) { qc = ieee80211_get_qos_ctl(hdr); tid = qc[0] & IEEE80211_QOS_CTL_TID_MASK; @@ -746,16 +752,14 @@ int iwl_tx_skb(struct iwl_priv *priv, st hdr->seq_ctrl |= cpu_to_le16(seq_number); seq_number += 0x10; /* aggregation is on for this <sta,tid> */ - if (info->flags & IEEE80211_TX_CTL_AMPDU) { + if (info->flags & IEEE80211_TX_CTL_AMPDU) txq_id = priv->stations[sta_id].tid[tid].agg.txq_id; - swq_id = iwl_virtual_agg_queue_num(swq_id, txq_id); - } priv->stations[sta_id].tid[tid].tfds_in_queue++; } txq = &priv->txq[txq_id]; + swq_id = txq->swq_id; q = &txq->q; - txq->swq_id = swq_id; spin_lock_irqsave(&priv->lock, flags); @@ -1186,6 +1190,9 @@ int iwl_tx_agg_start(struct iwl_priv *pr tid_data = &priv->stations[sta_id].tid[tid]; *ssn = SEQ_TO_SN(tid_data->seq_number); tid_data->agg.txq_id = txq_id; + priv->txq[txq_id].swq_id = iwl_virtual_agg_queue_num(tx_fifo, txq_id); + printk(KERN_DEBUG "setting 0x%p (%d) swq_id to %d\n", + &priv->txq[txq_id], txq_id, priv->txq[txq_id].swq_id); spin_unlock_irqrestore(&priv->sta_lock, flags); ret = priv->cfg->ops->lib->txq_agg_enable(priv, txq_id, tx_fifo,
Unfortunately, I cannot reproduce the bug. My 5350 card doesn't work at all, and with 4965 I don't run into the problem, and I don't have a 5100/5300 to test with. I have, however, found a problem with "iwconfig wlan0 ap any", which I fixed -- please try this patch: http://johannes.sipsolutions.net/patches/kernel/all/2009-06-15-16%3a03/012-mac80211-reassoc.patch
Sorry, please use this version without the WARN(), that situation can happen legitimately. http://johannes.sipsolutions.net/patches/kernel/all/2009-06-15-16%3a14/012-mac80211-reassoc.patch
Created an attachment (id=2047) [details] new debug message without kernel crash Johannes, I just tested again with the new patch(es). Specifically, I'm running with the patches in comments #14, #36, and #38. There was no kernel crash this time, but the card/driver did disassociate from the AP as a result of running 'iwconfig wlan0 ap auto'. I had to run 'iwconfig wlan0 essid <>' (a few times) to re-associate. In the log, I started running the essid command to re-associate around time [465.915877].
Michael, Thanks for testing. If you have a bit more time, I would appreciate a test run without the patch from comment #36, I think that was a red herring. I'll send that patch to Reinette anyway because it makes things more efficient. I have also observed that it just disassociates when you run 'ap any', but in my case it was due to the driver rejecting the scan, which doesn't seem to be the case for you, at least I don't see log messages indicating that in timestamp 376.* (which is where you issued the "ap any" command). Might be due to you fixing the channel -- I didn't do that, will look into it. Thanks!
Created an attachment (id=2048) [details] debug log without patch from comment #36 > Thanks for testing. If you have a bit more time, I would appreciate a > test run without the patch from comment #36, I think that was a red > herring. I'll send that patch to Reinette anyway because it makes > things more efficient. Johannes, I'm attaching a new debug log. The result is similar to the previous test (which had the patch from comment #36) in that there is no crash. It still seems, though, that something isn't working quite right. After the "ap any" command, I need to manually re-associate. Even then, I get terrible performance (I even tried re-associating a few times to different APs). In each case, the ping times to the gateway router can reach 1000ms, and scp won't progress at all. Eventually, as you can see towards the end of the log, I started getting "ERROR: No TX rate available" errors and had to unload the driver. I actually saw these "ERROR: No TX rate available" messages on the previous run (with patch from #36), but had truncated the log before that (thinking/hoping it was an unrelated and/or isolated incident). Now that it's happened twice, I'm including it. > I have also observed that it just disassociates when you run 'ap any', > but in my case it was due to the driver rejecting the scan, which > doesn't seem to be the case for you, at least I don't see log messages > indicating that in timestamp 376.* (which is where you issued the "ap > any" command). Might be due to you fixing the channel -- I didn't do > that, will look into it. I don't think I was doing anything to fix the channel. I was fixing the ESSID, but the ESSID I'm using is served by several APs on different frequencies. The card/driver usually connects to one of 3-4 different APs (on different channels) when I associate.
(In reply to comment #41) > I'm attaching a new debug log. The result is similar to the previous > test (which had the patch from comment #36) in that there is no crash. Good. > It still seems, though, that something isn't working quite right. After > the "ap any" command, I need to manually re-associate. Even then, I > get terrible performance (I even tried re-associating a few times to > different APs). In each case, the ping times to the gateway router can > reach 1000ms, and scp won't progress at all. Eventually, as you can > see towards the end of the log, I started getting "ERROR: No TX rate > available" errors and had to unload the driver. This is odd, I'm not seeing that. What tree are you on, and does it have the "revamp bss_info_changed" patch? > I actually saw these "ERROR: No TX rate available" messages on the > previous run (with patch from #36), but had truncated the log before > that (thinking/hoping it was an unrelated and/or isolated incident). > Now that it's happened twice, I'm including it. Hmm. I have no idea what could be causing that. > > I have also observed that it just disassociates when you run 'ap any', > > but in my case it was due to the driver rejecting the scan, which > > doesn't seem to be the case for you, at least I don't see log messages > > indicating that in timestamp 376.* (which is where you issued the "ap > > any" command). Might be due to you fixing the channel -- I didn't do > > that, will look into it. > > I don't think I was doing anything to fix the channel. I was fixing the > ESSID, but the ESSID I'm using is served by several APs on different > frequencies. The card/driver usually connects to one of 3-4 different > APs (on different channels) when I associate. Yes, sorry, Ximin was doing that, I got confused. Maybe the problem is that it chooses a different AP and I have only one to choose from in my test. But the entire failure mode is very strange. In any case I think mac80211 is now behaving correctly so we probably should look for the bug in the driver. Also, I cannot reproduce the issue you have where it doesn't automatically associated to an AP again, for me it always does unless scanning fails in the driver for some reason.
(In reply to comment #42) > This is odd, I'm not seeing that. What tree are you on, and does it have the > "revamp bss_info_changed" patch? I'm using the iwlwifi-2.6 git tree, fresh checkout from this morning (plus your patches). It should have "revamp bss_info_changed" patch. > Hmm. I have no idea what could be causing that. I'm not sure. It always seems to pop up when I'm re-associating with an AP running in the 5GHz band, but that could just be coincidence. > In any case I think mac80211 is now behaving correctly so we probably should > look for the bug in the driver. > > Also, I cannot reproduce the issue you have where it doesn't automatically > associated to an AP again, for me it always does unless scanning fails in the > driver for some reason. In general the entire driver state seems to be a bit unstable after the 'ap any' command. With the recent debugging, though, I haven't tried running for an extended period of time without issuing that command. I can see how that goes.
The "No TX rate available" is a bogus sanity check in the driver (see also bug 2064). Try removing that entire "if" statement calling ieee80211_get_tx_rate() in iwl-tx.c. No harm will be done. A colleague is preparing a patch to do this, but it's easy enough just to try on your own. Also, just to be sure (unrelated to "No TX rate", but also easy to try), try re-defining IWL_MAX_SCAN_SIZE from 1024 to 1280 ... depending on how many channels you scan on 5 GHZ, the command may be overflowing and wreaking havoc. -- Ben --
(In reply to comment #44) > The "No TX rate available" is a bogus sanity check in the driver (see also bug > 2064). > > Try removing that entire "if" statement calling ieee80211_get_tx_rate() in > iwl-tx.c. No harm will be done. A colleague is preparing a patch to do this, > but it's easy enough just to try on your own. > > Also, just to be sure (unrelated to "No TX rate", but also easy to try), try > re-defining IWL_MAX_SCAN_SIZE from 1024 to 1280 ... depending on how many > channels you scan on 5 GHZ, the command may be overflowing and wreaking havoc. I made these changes, and I haven't seen any problems on this front so far. I also upgraded to a newer version of the git tree at the same time (from 2 days ago: d229b682...2996a4dbe). In general, I think that maybe we can close this bug. The original problem, as stated in the subject line (BUG_ON triggered by 'iwconfig ap any') seems to be fixed now as far as I can tell (at least in the version of iwlwifi-2.6.git mentioned above). And, it seems like the "No TX rate" issue is being tracked by Bug #2064.
(In reply to comment #45) > In general, I think that maybe we can close this bug. The original problem, as > stated in the subject line (BUG_ON triggered by 'iwconfig ap any') seems to be > fixed now as far as I can tell (at least in the version of iwlwifi-2.6.git > mentioned above). And, it seems like the "No TX rate" issue is being tracked > by Bug #2064. Closing based on above comment.