fix: Optimize voice interaction pipeline

1. register_speaker_node: Enable AEC to match main node for better SV accuracy. 2. tts/dashscope: Fix ffmpeg argument order (input option thread_queue_size). 3. asr/dashscope: Keep WebSocket connection alive to reduce latency. 4. speaker_verifier: Force single-thread inference to avoid CPU contention.
merge develop features
2026-01-19 16:17:27 +08:00 · 2026-01-19 14:32:57 +08:00 · 2026-01-19 14:21:06 +08:00 · 2026-01-19 13:31:49 +08:00 · 2026-01-19 11:35:01 +08:00 · 2026-01-19 09:58:40 +08:00
39 changed files with 4665 additions and 65 deletions
--- a/.gitignore
+++ b/.gitignore
@@ -0,0 +1,6 @@
+build/
+install/
+log/
+__pycache__/
+*.pyc
+*.egg-info/
--- a/README.md
+++ b/README.md
@@ -1,2 +1,104 @@
-# hivecore_robot_voice
+# ROS 语音包 (robot_speaker)
+
+## 注册阿里云百炼获取api_key
+https://bailian.console.aliyun.com/?tab=model#/api-key
+->密钥管理
+放到config/voice.yaml
+
+## 安装依赖
+1. 系统依赖
+```bash
+sudo apt-get update
+sudo apt-get install -y python3-pyaudio portaudio19-dev alsa-utils ffmpeg swig meson ninja-build build-essential pkg-config libwebrtc-audio-processing-dev
+```
+
+2. Python依赖
+```bash
+cd ~/ros_learn/hivecore_robot_voice
+# 在 Python 3.10 环境下，需要单独安装 aec-audio-processing 以跳过版本检查
+pip3 install aec-audio-processing --no-binary :all: --ignore-requires-python --break-system-packages
+pip3 install -r requirements.txt --break-system-packages
+```
+
+## 编译启动
+1. 注册声纹
+  - 启动节点后可以说：二狗今天天气真好开始注册声纹
+  - 正确的注册姿势：
+    方法A（推荐）：唤醒后停顿一下，然后说一段长句子。
+    用户："二狗"
+    机器：（日志提示等待声纹语音）
+    用户："我现在正在注册声纹，这是一段很长的测试语音，请把我的声音录进去。"（持续说 3-5 秒）
+    方法B（连贯说）：一口气说很长的一句话。
+    用户："二狗你好，我是你的主人，请记住我的声音，这是一段用来注册的长语音。"
+  - 注意：要包含唤醒词，语句不要停顿，尽量大于1.5秒
+```bash
+cd ~/ros_learn/hivecore_robot_voice
+colcon build
+source install/setup.bash
+ros2 run robot_speaker register_speaker_node
+```
+
+2. 主节点
+  - 启动节点后每句交互包含唤醒词，唤醒词和语句之间不要有停顿
+  - 二狗拍照看看开启图文交互
+  - 支持已注册声纹用户打断
+```bash
+cd ~/ros_learn/hivecore_robot_voice
+colcon build
+source install/setup.bash
+ros2 launch robot_speaker voice.launch.py
+```
+
+## 架构说明
+[录音线程] - 唯一实时线程
+  ├─ 麦克风采集 PCM
+  ├─ VAD + 能量检测
+  ├─ 检测到人声 → 立即中断TTS
+  ├─ 语音 PCM → ASR 音频队列
+  └─ 语音 PCM → 声纹音频队列（旁路，不阻塞）
+
+[ASR推理线程] - 只做 audio → text
+  └─ 从 ASR 音频队列取音频→ 实时 / 流式 ASR → text → 文本队列
+
+[声纹识别线程] - 非实时、低频（CAM++）
+  ├─ 通过回调函数接收音频chunk，写入缓冲区，等待 speech_end 事件触发处理
+  ├─ 累积 1~2 秒有效人声（VAD 后）
+  ├─ CAM++ 提取 speaker embedding
+  ├─ 声纹匹配 / 注册
+  └─ 更新 current_speaker_id（共享状态，只写不控）
+声纹线程要求：不影响录音，不影响ASR，不控制TTS，只更新当前说话人是谁
+
+[主线程/处理线程] - 处理业务逻辑
+  ├─ 从 文本队列 取 ASR 文本
+  ├─ 读取 current_speaker_id（只读）
+  ├─ 唤醒词处理（结合 speaker_id）
+  ├─ 权限 / 身份判断（是否允许继续）
+  ├─ VLM处理（文本 / 多模态）
+  └─ TTS播放（启动TTS线程，不等待）
+
+[TTS播放线程] - 只播放（可被中断）
+  ├─ 接收 TTS 音频流
+  ├─ 播放到输出设备
+  └─ 响应中断标志（由录音线程触发）
+
+
+## 用到的命令
+1. 音频设备
+```bash
+# 1. 查看所有音频设备
+cat /proc/asound/cards
+# 2. 查看 card(1)的流信息（设备参数）
+cat /proc/asound/card1/stream0
+```
+
+2. 相机设备
+```bash
+# 1. 查看相机所有基础信息（型号、固件版本、序列号等）
+rs-enumerate-devices -c 
+```
+
+3. 模型下载
+```bash
+modelscope download --model iic/speech_campplus_sv_zh-cn_16k-common --local_dir [指定路径]
+```

--- a/config/knowledge.json
+++ b/config/knowledge.json
@@ -0,0 +1,18 @@
+{
+  "entries": [
+    {
+      "id": "robot_identity",
+      "patterns": [
+        "ni shi shei"
+      ],
+      "answer": "我叫二狗，是蜂核科技的机器人，很高兴为你服务"
+    },
+    {
+      "id": "wake_word",
+      "patterns": [
+        "ni de ming zi"
+      ],
+      "answer": "我的名字是二狗"
+    }
+  ]
+}
--- a/config/speakers.json
+++ b/config/speakers.json
@@ -0,0 +1,599 @@
+{
+  "user_1768311644": {
+    "embedding": [
+      0.017083248123526573,
+      -0.01032772846519947,
+      0.0058503481559455395,
+      0.11945011466741562,
+      0.03864186629652977,
+      -0.16047827899456024,
+      0.008000967092812061,
+      0.10669729858636856,
+      0.13221754133701324,
+      0.06365424394607544,
+      -0.06943577527999878,
+      0.08401959389448166,
+      0.09903465211391449,
+      0.0407508946955204,
+      -0.07486417144536972,
+      0.0010617832886055112,
+      0.12097838521003723,
+      -0.013734623789787292,
+      -0.020789025351405144,
+      -0.02113250270485878,
+      0.008510188199579716,
+      -0.05490498244762421,
+      -0.17027714848518372,
+      0.09569162130355835,
+      -0.07379947602748871,
+      0.05932804197072983,
+      0.0839226171374321,
+      0.004776939284056425,
+      0.050190482288599014,
+      -0.19962339103221893,
+      -0.13987377285957336,
+      0.041607797145843506,
+      0.10067984461784363,
+      0.0684289038181305,
+      0.08163066953420639,
+      -0.029243428260087967,
+      -0.10118222236633301,
+      -0.11619988083839417,
+      -0.10121472179889679,
+      -0.04290663078427315,
+      -0.08373524248600006,
+      0.03493887186050415,
+      0.055566269904375076,
+      -0.11284282803535461,
+      -0.10970190167427063,
+      0.03457016497850418,
+      0.11647575348615646,
+      -0.014930102974176407,
+      -0.04663793370127678,
+      0.0752566009759903,
+      -0.06746217608451843,
+      -0.07642832398414612,
+      0.06518206000328064,
+      0.07191824167966843,
+      0.13557033240795135,
+      0.04906972125172615,
+      0.03679114207625389,
+      0.07466751337051392,
+      0.01071798987686634,
+      -0.07979520410299301,
+      -0.10039637982845306,
+      0.004846179857850075,
+      -0.07325125485658646,
+      -0.08750395476818085,
+      0.05332862585783005,
+      0.10648373514413834,
+      -0.035643525421619415,
+      0.21233271062374115,
+      0.011915713548660278,
+      0.13632774353027344,
+      0.10383394360542297,
+      -0.053550489246845245,
+      0.05719169229269028,
+      0.04600509628653526,
+      0.043678827583789825,
+      -0.03646669536828995,
+      0.08175459504127502,
+      0.042513635009527206,
+      -0.09215544164180756,
+      -0.06402364373207092,
+      -0.10830589383840561,
+      0.03379691392183304,
+      0.07699205726385117,
+      -0.11046901345252991,
+      -0.016612332314252853,
+      -0.02984754927456379,
+      0.00998819898813963,
+      -0.05820641294121742,
+      0.007753593847155571,
+      -0.016712933778762817,
+      0.0014505418948829174,
+      -0.04807407408952713,
+      -0.048170242458581924,
+      -0.0531715452671051,
+      0.019113507121801376,
+      0.08439801633358002,
+      0.010585008189082146,
+      -0.07400234043598175,
+      0.10156761854887009,
+      -0.018891986459493637,
+      -0.052156757563352585,
+      0.1302887201309204,
+      0.08590760082006454,
+      0.13382190465927124,
+      -0.1498136967420578,
+      -0.030552342534065247,
+      -0.09281301498413086,
+      0.10279291868209839,
+      0.015315898694097996,
+      -0.014133274555206299,
+      -0.01298056822270155,
+      0.06241781264543533,
+      0.017693962901830673,
+      0.0007682808791287243,
+      0.029756756499409676,
+      0.12711282074451447,
+      -0.0695323497056961,
+      0.01649993099272251,
+      0.08811338990926743,
+      -0.06976141035556793,
+      -0.0763985738158226,
+      -0.10730905085802078,
+      0.0256052203476429,
+      0.05183263123035431,
+      0.0947495624423027,
+      0.007070058956742287,
+      -0.0505177341401577,
+      -0.009485805407166481,
+      0.003954170271754265,
+      0.014901814050972462,
+      -0.08098141849040985,
+      0.03615008667111397,
+      -0.09673020988702774,
+      0.06970252841711044,
+      0.009914563037455082,
+      -0.012040670961141586,
+      -0.0008170561632141471,
+      -0.06880783289670944,
+      -0.053053151816129684,
+      0.05272500216960907,
+      0.021709589287638664,
+      -0.09712725877761841,
+      0.06947346031665802,
+      -0.07973745465278625,
+      -0.036861639469861984,
+      -0.08714801073074341,
+      0.05473816394805908,
+      -0.006384482141584158,
+      -0.03656519949436188,
+      0.0605260394513607,
+      0.0407724604010582,
+      -0.1314084380865097,
+      -0.05484895780682564,
+      0.014381998218595982,
+      -0.07414797693490982,
+      -0.013259666971862316,
+      -0.1076463982462883,
+      -0.04896606504917145,
+      0.050690483301877975,
+      0.0719417929649353,
+      0.04990950971841812,
+      -0.049923382699489594,
+      0.08706197887659073,
+      -0.06278207153081894,
+      -0.029196983203291893,
+      -0.07312408834695816,
+      0.01651231199502945,
+      0.025062547996640205,
+      -0.023919139057397842,
+      0.05597180873155594,
+      0.08446669578552246,
+      -0.06616690754890442,
+      0.011679486371576786,
+      0.008357426151633263,
+      -0.07388673722743988,
+      0.03612314909696579,
+      -0.055705588310956955,
+      -0.008656222373247147,
+      -0.06408344209194183,
+      -0.05341912433505058,
+      0.01561578270047903,
+      0.002446901286020875,
+      0.042539432644844055,
+      0.12226217240095139,
+      -0.03700198978185654,
+      0.02393815666437149,
+      -0.021217981353402138,
+      0.04431416094303131,
+      -0.09150857478380203,
+      -0.004766684491187334,
+      -0.06133556738495827,
+      0.07721113413572311
+    ],
+    "env": "near",
+    "threshold": 0.4,
+    "registered_at": 1768311644.5742264
+  },
+  "user_1768529827": {
+    "embedding": [
+      0.0077949948608875275,
+      -0.012852567248046398,
+      0.0014490776229649782,
+      0.088177390396595,
+      -0.052150458097457886,
+      -0.1070166826248169,
+      -0.051932964473962784,
+      0.040730226784944534,
+      0.09491471946239471,
+      -0.10504328459501266,
+      -0.17986123263835907,
+      0.06056514009833336,
+      0.0002809118013828993,
+      -0.05353177338838577,
+      -0.08724740147590637,
+      -0.01057526096701622,
+      -0.10766296088695526,
+      0.024376090615987778,
+      -0.11535818874835968,
+      0.12653452157974243,
+      -0.0063497889786958694,
+      -0.02372283861041069,
+      -0.049704890698194504,
+      0.01079346239566803,
+      -0.10683158040046692,
+      0.00932641327381134,
+      0.043871842324733734,
+      0.04073511064052582,
+      0.005968529265373945,
+      0.05397576093673706,
+      0.07122175395488739,
+      0.06804963946342468,
+      -0.058389563113451004,
+      -0.03463176265358925,
+      -0.06834574788808823,
+      -0.09127284586429596,
+      -0.09805246442556381,
+      -0.015370666980743408,
+      -0.07054834067821503,
+      -0.07520422339439392,
+      -0.0502505861222744,
+      0.01580144092440605,
+      0.04316972196102142,
+      -0.010298517532646656,
+      -0.09042523056268692,
+      -0.03399325907230377,
+      0.03738871216773987,
+      0.09461583197116852,
+      0.07643604278564453,
+      -0.04089711233973503,
+      0.14397914707660675,
+      -0.03218085318803787,
+      -0.03981873393058777,
+      -0.05353623256087303,
+      -0.06475386023521423,
+      0.047925639897584915,
+      0.008481102995574474,
+      0.09522885829210281,
+      0.05679373815655708,
+      0.021448519080877304,
+      0.04586802423000336,
+      0.007880095392465591,
+      -0.08111433684825897,
+      -0.030093876644968987,
+      0.18197935819625854,
+      0.049670975655317307,
+      -0.029350068420171738,
+      0.1003178134560585,
+      0.05890532210469246,
+      -0.0418926365673542,
+      -0.015124992467463017,
+      -0.0016869385726749897,
+      0.029022999107837677,
+      0.10370466858148575,
+      -0.07392475008964539,
+      -0.041242245584726334,
+      0.0948185846209526,
+      0.0766805037856102,
+      0.12104924768209457,
+      0.07941737771034241,
+      -0.024586958810687065,
+      -0.005290709435939789,
+      0.08198735862970352,
+      -0.15709130465984344,
+      0.11847008019685745,
+      0.01280289888381958,
+      0.09401026368141174,
+      0.10199982672929764,
+      0.00811630580574274,
+      0.09336159378290176,
+      -0.1219155564904213,
+      0.00885648000985384,
+      0.08536995947360992,
+      -0.031735390424728394,
+      -0.02445235848426819,
+      0.17981232702732086,
+      0.05046188458800316,
+      -0.012413986958563328,
+      -0.16514025628566742,
+      -0.09369593858718872,
+      0.03961285203695297,
+      -0.024150250479578972,
+      0.024869512766599655,
+      0.009099201299250126,
+      0.0023227918427437544,
+      0.005291149020195007,
+      -0.08285452425479889,
+      0.02174258604645729,
+      -0.00018321558309253305,
+      -0.01761690340936184,
+      -0.13327360153198242,
+      0.07804469764232635,
+      -0.03172646835446358,
+      0.05993621423840523,
+      -0.0034280805848538876,
+      0.09203101694583893,
+      0.04720155894756317,
+      -0.12012632191181183,
+      -0.028879230841994286,
+      -0.04471825063228607,
+      -0.08928379416465759,
+      -0.055793069303035736,
+      -0.0230169165879488,
+      0.04459748789668083,
+      -0.08481008559465408,
+      0.09873232245445251,
+      -0.057500336319208145,
+      -0.05438977852463722,
+      0.06309207528829575,
+      -0.045493170619010925,
+      -0.0636027380824089,
+      -0.03580763190984726,
+      -0.043026816099882126,
+      0.04125182330608368,
+      -0.06327074766159058,
+      0.02830875851213932,
+      -0.0697140172123909,
+      -0.11324217170476913,
+      -0.02744743973016739,
+      -0.09659717977046967,
+      -0.036915868520736694,
+      0.06836548447608948,
+      -0.19481360912322998,
+      -0.08151774108409882,
+      0.013570327311754227,
+      -0.013908851891756058,
+      -0.02302597463130951,
+      -0.14017312228679657,
+      -0.0654999315738678,
+      0.0582318976521492,
+      -0.023702487349510193,
+      -0.046911414712667465,
+      -0.02062028832733631,
+      0.09885907918214798,
+      -0.010111358016729355,
+      -0.009303858503699303,
+      -0.07802718877792358,
+      0.09181840717792511,
+      -0.00822418462485075,
+      -0.024477459490299225,
+      0.04909557104110718,
+      0.024657243862748146,
+      0.08074013143777847,
+      0.10684694349765778,
+      -0.009657780639827251,
+      0.04053448513150215,
+      -0.054968591779470444,
+      0.09773849695920944,
+      -0.019937219098210335,
+      -0.11860335618257523,
+      -0.12553851306438446,
+      0.0016870739636942744,
+      0.07446407526731491,
+      -0.12183381617069244,
+      -0.07524612545967102,
+      0.06794209778308868,
+      -0.04324038699269295,
+      -0.018201345577836037,
+      -0.08356837183237076,
+      0.08218713104724884,
+      -0.1253940612077713,
+      -0.05880133807659149,
+      0.11516888439655304,
+      -0.007864559069275856,
+      0.06438153237104416,
+      -0.06551646441221237,
+      0.11812424659729004,
+      -0.07544125616550446,
+      0.033888354897499084,
+      0.02552076056599617,
+      0.019394448027014732,
+      -0.009682931937277317
+    ],
+    "env": "near",
+    "threshold": 0.55,
+    "registered_at": 1768529827.4784193
+  },
+  "user_1768530001": {
+    "embedding": [
+      -0.02827363647520542,
+      0.04181317239999771,
+      -0.07721243053674698,
+      0.031220311298966408,
+      -0.006549456622451544,
+      -0.045262161642313004,
+      -0.06796529144048691,
+      0.10546170920133591,
+      -0.054266564548015594,
+      -0.04982651397585869,
+      0.008982052095234394,
+      0.0887555256485939,
+      -0.03736695274710655,
+      -0.027568811550736427,
+      -0.01881324127316475,
+      -0.030173255130648613,
+      -0.03817622363567352,
+      -0.027703644707798958,
+      -0.020354237407445908,
+      0.08958664536476135,
+      0.027346525341272354,
+      -0.007979321293532848,
+      -0.01638970896601677,
+      0.14815205335617065,
+      -0.029478076845407486,
+      0.0968138799071312,
+      0.011266525834798813,
+      0.10481037944555283,
+      0.006314543075859547,
+      -0.07480890303850174,
+      -0.126618891954422,
+      0.054260920733213425,
+      -0.054261378943920135,
+      0.02066616155207157,
+      0.056972429156303406,
+      -0.02620418183505535,
+      -0.08435375243425369,
+      -0.06768523901700974,
+      -0.001804384752176702,
+      -0.03350691497325897,
+      -0.06783927977085114,
+      0.09583555907011032,
+      0.042077258229255676,
+      -0.03811662644147873,
+      -0.09298640489578247,
+      0.11314687132835388,
+      0.06972789764404297,
+      -0.10421980172395706,
+      0.02739877998828888,
+      -0.06242597475647926,
+      0.06683704257011414,
+      0.030034003779292107,
+      -0.04094783961772919,
+      0.08657337725162506,
+      0.02882716991007328,
+      0.07672230899333954,
+      -0.0162385031580925,
+      0.12335177510976791,
+      -0.07505486160516739,
+      0.05924128741025925,
+      0.02278822474181652,
+      0.051575034856796265,
+      -0.07616295665502548,
+      -0.049982234835624695,
+      -0.021159915253520012,
+      0.023469945415854454,
+      -0.008445728570222855,
+      0.18868982791900635,
+      0.10217619687318802,
+      0.0029947187285870314,
+      0.003596147522330284,
+      -0.010885344818234444,
+      0.002336243400350213,
+      -0.06228164955973625,
+      -0.09452632069587708,
+      0.06288570165634155,
+      0.09799493104219437,
+      0.05772380530834198,
+      -0.012649190612137318,
+      0.037833958864212036,
+      -0.07815677672624588,
+      0.11595622450113297,
+      -0.006132716778665781,
+      -0.047689273953437805,
+      0.10451581329107285,
+      0.12618094682693481,
+      -0.012135603465139866,
+      -0.14452683925628662,
+      -0.011882219463586807,
+      0.05687599256634712,
+      -0.10221579670906067,
+      0.09555421024560928,
+      0.050166770815849304,
+      0.026791365817189217,
+      0.0343380831182003,
+      0.0643647089600563,
+      -0.09814899414777756,
+      -0.01735001988708973,
+      0.0002968672488350421,
+      -0.16691210865974426,
+      -0.044747937470674515,
+      0.10229559987783432,
+      0.01551489345729351,
+      0.0614253506064415,
+      -0.012457458302378654,
+      -0.059297215193510056,
+      -0.0662546306848526,
+      0.06900843977928162,
+      -0.15012530982494354,
+      0.14357514679431915,
+      -0.08563537150621414,
+      0.1512402445077896,
+      -0.05548126623034477,
+      -0.13191379606723785,
+      0.02588576264679432,
+      -0.007292638067156076,
+      -0.033004030585289,
+      -0.08764250576496124,
+      -0.04006534814834595,
+      0.001069005811586976,
+      0.0708790197968483,
+      -0.11471016705036163,
+      -0.08249906450510025,
+      -0.07923658937215805,
+      -0.029890256002545357,
+      0.027568599209189415,
+      -0.00042784016113728285,
+      0.01911524124443531,
+      0.002947323489934206,
+      -0.058468904346227646,
+      0.0006662740488536656,
+      -0.09472604095935822,
+      -0.07827164232730865,
+      0.05823435261845589,
+      -0.022661248221993446,
+      0.007729553151875734,
+      0.044511985033750534,
+      -0.17424426972866058,
+      -0.054321326315402985,
+      -0.010871038772165775,
+      -0.04280569776892662,
+      0.01373684499412775,
+      -0.03464324399828911,
+      0.0012510031228885055,
+      -0.13786448538303375,
+      0.13943427801132202,
+      0.07161138951778412,
+      -0.0017689999658614397,
+      -0.0330035537481308,
+      0.01767006888985634,
+      -0.06832484155893326,
+      -0.16906532645225525,
+      -0.08673631399869919,
+      0.016205811873078346,
+      -0.040736377239227295,
+      -0.053034041076898575,
+      -0.057571377605199814,
+      -0.018383856862783432,
+      0.029812879860401154,
+      -0.005708644632250071,
+      0.07977750152349472,
+      0.03715944290161133,
+      0.029830463230609894,
+      -0.15909501910209656,
+      0.10081987082958221,
+      0.07019384205341339,
+      0.05683498457074165,
+      0.008955223485827446,
+      -0.06697771698236465,
+      0.044268134981393814,
+      0.08812808990478516,
+      -0.17523430287837982,
+      0.05148027464747429,
+      -0.11579684168100357,
+      -0.06281758099794388,
+      -0.08106749504804611,
+      -0.07915353775024414,
+      0.03760797902941704,
+      -0.059639666229486465,
+      0.012170189991593361,
+      -0.028386766090989113,
+      -0.043592486530542374,
+      0.029122747480869293,
+      0.052276406437158585,
+      0.06929390132427216,
+      -0.10774848610162735,
+      0.06797030568122864,
+      -0.017512541264295578,
+      0.07446594536304474,
+      -0.07573172450065613,
+      -0.15186654031276703,
+      -0.03710319101810455
+    ],
+    "env": "near",
+    "threshold": 0.55,
+    "registered_at": 1768530001.2158406
+  }
+}
--- a/config/voice.yaml
+++ b/config/voice.yaml
@@ -0,0 +1,70 @@
+# ROS 语音包配置文件
+
+dashscope:
+  api_key: "sk-7215a5ab7a00469db4072e1672a0661e"
+  asr:
+    model: "qwen3-asr-flash-realtime"
+    url: "wss://dashscope.aliyuncs.com/api-ws/v1/realtime"
+  llm:
+    model: "qwen3-vl-flash"
+    base_url: "https://dashscope.aliyuncs.com/compatible-mode/v1"
+    temperature: 0.7
+    max_tokens: 4096
+    max_history: 10
+    summary_trigger: 3
+  tts:
+    model: "cosyvoice-v3-flash"
+    voice: "longanyang"
+
+audio:
+  microphone:
+    device_index: 3  # 指向 iFLYTEK-M2 (hw:1,0)
+    sample_rate: 48000  # 尝试使用硬件原生采样率 48kHz，避免重采样可能导致的问题
+    channels: 1  # 输入声道数：单声道（MONO，适合语音采集）
+    chunk: 1024
+    heartbeat_interval: 2.0  # 心跳间隔（秒），用于定期输出录音状态
+  soundcard:
+    card_index: 1  # USB Audio Device (card 1)
+    device_index: 0  # USB Audio [USB Audio] (device 0)
+    # card_index: -1  # 使用默认声卡
+    # device_index: -1  # 使用默认输出设备
+    sample_rate: 48000  # 输出采样率：48kHz（iFLYTEK 支持 48000）
+    channels: 2  # 输出声道数：立体声（2声道，FL+FR）
+    volume: 1.0  # 音量比例（0.0-1.0，0.2表示20%音量）
+  echo_cancellation:
+    enabled: false  # 是否启用回声消除（true/false）
+    max_duration_ms: 500  # 参考信号缓冲区最大时长（毫秒）
+  tts:
+    source_sample_rate: 22050  # TTS服务固定输出采样率（DashScope服务固定值，不可修改）
+    source_channels: 1  # TTS服务固定输出声道数（DashScope服务固定值，不可修改）
+    ffmpeg_thread_queue_size: 4096  # ffmpeg输入线程队列大小（增大以减少卡顿）
+
+vad:
+  vad_mode: 3  # VAD模式：0-3，3最严格
+  silence_duration_ms: 1000  # 静音持续时长（毫秒）
+  min_energy_threshold: 300  # 最小能量阈值
+
+system:
+  use_llm: true  # 是否使用LLM
+  use_wake_word: true  # 是否启用唤醒词检测
+  wake_word: "er gou"  # 唤醒词（拼音）
+  session_timeout: 3.0  # 会话超时时间（秒）
+  shutup_keywords: "bi zui"  # 闭嘴指令关键词（拼音，逗号分隔）
+  interrupt_command_queue_depth: 10  # 中断命令订阅的队列深度（QoS）
+  sv_enabled: true  # 是否启用声纹识别
+  sv_model_path: "~/hivecore_robot_os1/voice_model" # 声纹模型路径
+  sv_threshold: 0.55  # 声纹识别阈值（0.0-1.0，值越小越宽松，值越大越严格）
+  sv_speaker_db_path: "~/hivecore_robot_os1/config/speakers.json"  # 声纹数据库保存路径（JSON格式，相对于ROS2包share目录）
+  sv_buffer_size: 240000  # 声纹验证录音缓冲区大小（样本数，48kHz下5秒=240000）
+  sv_registration_silence_threshold_ms: 500  # 声纹注册状态下的静音阈值（毫秒）
+
+camera:
+  serial_number: "405622075404"  # 相机序列号（Intel RealSense D435）
+  rgb:
+    width: 640   # 图像宽度
+    height: 480  # 图像高度
+    fps: 30      # 帧率（支持：6, 10, 15, 30, 60）
+    format: "RGB8"  # 图像格式：RGB8, BGR8
+  image:
+    jpeg_quality: 85  # JPEG压缩质量（0-100，85是质量和大小平衡点）
+    max_size: "1280x720"  # 最大尺寸
--- a/launch/voice.launch.py
+++ b/launch/voice.launch.py
@@ -0,0 +1,17 @@
+from launch import LaunchDescription
+from launch_ros.actions import Node
+
+
+def generate_launch_description():
+    """启动语音交互节点，所有参数从 voice.yaml 读取"""
+    return LaunchDescription([
+        Node(
+            package='robot_speaker',
+            executable='robot_speaker_node',
+            name='robot_speaker_node',
+            output='screen'
+        ),
+    ])
+
+
+
--- a/package.xml
+++ b/package.xml
@@ -2,13 +2,22 @@
 <?xml-model href="http://download.ros.org/schema/package_format3.xsd" schematypens="http://www.w3.org/2001/XMLSchema"?>
 <package format="3">
  <name>robot_speaker</name>
-  <version>0.0.0</version>
-  <description>TODO: Package description</description>
+  <version>0.0.1</version>
+  <description>语音识别和合成ROS2包</description>
  <maintainer email="mzebra@foxmail.com">mzebra</maintainer>
  <license>Apache-2.0</license>

  <depend>rclpy</depend>
-  <depend>example_interfaces</depend>
+  <depend>std_msgs</depend>
+  <depend>ament_index_python</depend>
+  <depend>interfaces</depend>
+
+  <exec_depend>python3-pyaudio</exec_depend>
+  <exec_depend>python3-requests</exec_depend>
+  <exec_depend>python3-edge-tts</exec_depend>
+  <exec_depend>python3-webrtcvad</exec_depend>
+  <exec_depend>python3-yaml</exec_depend>
+  <exec_depend>python3-pypinyin</exec_depend>

  <test_depend>ament_copyright</test_depend>
  <test_depend>ament_flake8</test_depend>
--- a/requirements.txt
+++ b/requirements.txt
@@ -0,0 +1,17 @@
+dashscope>=1.20.0
+openai>=1.0.0
+pyaudio>=0.2.11
+webrtcvad>=2.0.10
+pypinyin>=0.49.0
+rclpy>=3.0.0
+pyrealsense2>=2.54.0
+Pillow>=10.0.0
+numpy>=1.24.0
+PyYAML>=6.0
+aec-audio-processing
+modelscope>=1.33.0
+funasr>=1.0.0
+datasets==3.6.0
+
+
+
--- a/robot_speaker/init.py
+++ b/robot_speaker/init.py
@@ -0,0 +1,6 @@
+# robot_speaker package
+
+
+
+
+
--- a/robot_speaker/bridge/init.py
+++ b/robot_speaker/bridge/init.py
@@ -0,0 +1,2 @@
+# Bridge package for connecting LLM outputs to brain execution.
+
--- a/robot_speaker/bridge/skill_bridge_node.py
+++ b/robot_speaker/bridge/skill_bridge_node.py
@@ -0,0 +1,136 @@
+#!/usr/bin/env python3
+"""
+桥接LLM技能序列到小脑ExecuteBtAction，并转发反馈/结果。
+"""
+import json
+import os
+import re
+
+import rclpy
+from rclpy.node import Node
+from rclpy.action import ActionClient
+from std_msgs.msg import String
+from ament_index_python.packages import get_package_share_directory
+
+from interfaces.action import ExecuteBtAction
+
+
+class SkillBridgeNode(Node):
+    def __init__(self):
+        super().__init__('skill_bridge_node')
+        self._action_client = ActionClient(self, ExecuteBtAction, '/execute_bt_action')
+        self._current_epoch = 1
+        self._allowed_skills = self._load_allowed_skills()
+
+        self.skill_seq_sub = self.create_subscription(
+            String, '/llm_skill_sequence', self._on_skill_sequence_received, 10
+        )
+        self.feedback_pub = self.create_publisher(String, '/skill_execution_feedback', 10)
+        self.result_pub = self.create_publisher(String, '/skill_execution_result', 10)
+
+        self.get_logger().info('SkillBridgeNode started')
+
+    def _on_skill_sequence_received(self, msg: String):
+        raw = (msg.data or "").strip()
+        if not raw:
+            return
+        if not self._allowed_skills:
+            self.get_logger().warning("No skill whitelist loaded; reject all sequences")
+            return
+        sequence, invalid = self._extract_skill_sequence(raw)
+        if invalid:
+            self.get_logger().warning(f"Rejected sequence with invalid skills: {invalid}")
+            return
+        if not sequence:
+            self.get_logger().warning(f"Invalid skill sequence: {raw}")
+            return
+        self._send_skill_sequence(sequence)
+
+    def _load_allowed_skills(self) -> set[str]:
+        try:
+            brain_share = get_package_share_directory("brain")
+            skill_path = os.path.join(brain_share, "config", "robot_skills.yaml")
+            if not os.path.exists(skill_path):
+                return set()
+            import yaml
+            with open(skill_path, "r", encoding="utf-8") as f:
+                data = yaml.safe_load(f) or []
+            return {str(entry["name"]) for entry in data if isinstance(entry, dict) and entry.get("name")}
+        except Exception as e:
+            self.get_logger().warning(f"Load skills failed: {e}")
+            return set()
+
+    def _extract_skill_sequence(self, text: str) -> tuple[str, list[str]]:
+        # Accept CSV/space/semicolon and filter by CamelCase tokens
+        tokens = re.split(r'[,\s;]+', text.strip())
+        skills = [t for t in tokens if re.match(r'^[A-Z][A-Za-z0-9]*$', t)]
+        if not skills:
+            return "", []
+        invalid = [s for s in skills if s not in self._allowed_skills]
+        return ",".join(skills), invalid
+
+    def _send_skill_sequence(self, skill_sequence: str):
+        if not self._action_client.wait_for_server(timeout_sec=2.0):
+            self.get_logger().error('ExecuteBtAction server unavailable')
+            return
+        goal = ExecuteBtAction.Goal()
+        goal.epoch = self._current_epoch
+        self._current_epoch += 1
+        goal.action_name = skill_sequence
+        goal.calls = []
+
+        self.get_logger().info(f"Dispatch skill sequence: {skill_sequence}")
+        send_future = self._action_client.send_goal_async(goal, feedback_callback=self._feedback_callback)
+        rclpy.spin_until_future_complete(self, send_future, timeout_sec=5.0)
+        if not send_future.done():
+            self.get_logger().warning("Send goal timed out")
+            return
+        goal_handle = send_future.result()
+        if not goal_handle or not goal_handle.accepted:
+            self.get_logger().error("Goal rejected")
+            return
+        result_future = goal_handle.get_result_async()
+        rclpy.spin_until_future_complete(self, result_future)
+        if result_future.done():
+            self._handle_result(result_future.result())
+
+    def _feedback_callback(self, feedback_msg):
+        fb = feedback_msg.feedback
+        payload = {
+            "stage": fb.stage,
+            "current_skill": fb.current_skill,
+            "progress": float(fb.progress),
+            "detail": fb.detail,
+            "epoch": int(fb.epoch),
+        }
+        msg = String()
+        msg.data = json.dumps(payload, ensure_ascii=True)
+        self.feedback_pub.publish(msg)
+
+    def _handle_result(self, result_wrapper):
+        result = result_wrapper.result
+        if not result:
+            return
+        payload = {
+            "success": bool(result.success),
+            "message": result.message,
+            "total_skills": int(result.total_skills),
+            "succeeded_skills": int(result.succeeded_skills),
+        }
+        msg = String()
+        msg.data = json.dumps(payload, ensure_ascii=True)
+        self.result_pub.publish(msg)
+
+
+
+def main(args=None):
+    rclpy.init(args=args)
+    node = SkillBridgeNode()
+    rclpy.spin(node)
+    node.destroy_node()
+    rclpy.shutdown()
+
+
+if __name__ == '__main__':
+    main()
+
--- a/robot_speaker/core/init.py
+++ b/robot_speaker/core/init.py
@@ -0,0 +1,5 @@
+"""核心模块"""
+
+
+
+
--- a/robot_speaker/core/conversation_state.py
+++ b/robot_speaker/core/conversation_state.py
@@ -0,0 +1,10 @@
+from enum import Enum
+
+
+class ConversationState(Enum):
+    """会话状态机"""
+    IDLE = "idle"                    # 等待用户唤醒或声音
+    CHECK_VOICE = "check_voice"      # 用户说话 → 检查声纹
+    AUTHORIZED = "authorized"        # 已注册用户
+
+
--- a/robot_speaker/core/intent_router.py
+++ b/robot_speaker/core/intent_router.py
@@ -0,0 +1,158 @@
+from dataclasses import dataclass
+from typing import Optional
+import os
+import yaml
+from ament_index_python.packages import get_package_share_directory
+
+from pypinyin import pinyin, Style
+
+
+@dataclass
+class IntentResult:
+    intent: str  # "skill_sequence" | "kb_qa" | "chat_text" | "chat_camera"
+    text: str
+    need_camera: bool
+    camera_mode: Optional[str]  # "head" | "left_hand" | "right_hand" | None
+    system_prompt: Optional[str]
+
+
+class IntentRouter:
+    def __init__(self):
+        self.camera_capture_keywords = [
+            "pai zhao", "pai ge zhao", "pai zhang zhao"
+        ]
+        self.skill_keywords = [
+            "ban xiang zi"
+        ]
+        self.kb_keywords = [
+            "ni shi shei", "ni de ming zi"
+        ]
+        self._cached_skill_names: list[str] | None = None
+
+    def _load_brain_skill_names(self) -> list[str]:
+        if self._cached_skill_names is not None:
+            return self._cached_skill_names
+        skill_names: list[str] = []
+        try:
+            brain_share = get_package_share_directory("brain")
+            skill_path = os.path.join(brain_share, "config", "robot_skills.yaml")
+            with open(skill_path, "r", encoding="utf-8") as f:
+                data = yaml.safe_load(f) or []
+            for entry in data:
+                if isinstance(entry, dict) and entry.get("name"):
+                    skill_names.append(str(entry["name"]))
+        except Exception:
+            skill_names = []
+        self._cached_skill_names = skill_names
+        return skill_names
+
+    def to_pinyin(self, text: str) -> str:
+        chars = [c for c in text if '\u4e00' <= c <= '\u9fa5']
+        if not chars:
+            return ""
+        py_list = pinyin(''.join(chars), style=Style.NORMAL)
+        return ' '.join([item[0] for item in py_list]).lower().strip()
+
+    def is_skill_sequence_intent(self, text: str) -> bool:
+        text_pinyin = self.to_pinyin(text)
+        return any(k in text_pinyin for k in self.skill_keywords)
+
+
+    def check_camera_command(self, text: str) -> tuple[bool, Optional[str]]:
+        if not text:
+            return False, None
+        text_pinyin = self.to_pinyin(text)
+        for keyword in self.camera_capture_keywords:
+            if keyword in text_pinyin:
+                return True, self.detect_camera_mode(text)
+        return False, None
+
+    def detect_camera_mode(self, text: str) -> str:
+        text_pinyin = self.to_pinyin(text)
+        left_keys = ["zuo shou", "zuo bi", "zuo bian"]
+        right_keys = ["you shou", "you bi", "you bian"]
+        head_keys = ["tou", "nao dai"]
+        for kw in left_keys:
+            if kw in text_pinyin:
+                return "left_hand"
+        for kw in right_keys:
+            if kw in text_pinyin:
+                return "right_hand"
+        for kw in head_keys:
+            if kw in text_pinyin:
+                return "head"
+        return "head"
+
+    def build_skill_prompt(self) -> str:
+        skills = self._load_brain_skill_names()
+        skills_text = ", ".join(skills) if skills else ""
+        skill_guard = (
+            "【技能限制】只能使用以下技能名称：" + skills_text
+            if skills_text
+            else "【技能限制】技能列表不可用，请不要输出任何技能名称。"
+        )
+        return (
+            "你是机器人任务规划器。\n"
+            "本任务必须拍照。请根据用户请求选择使用哪个相机拍照（默认头部相机），并结合当前环境信息生成简洁、可执行的技能序列。\n"
+            "【重要】如果对话历史中包含【执行结果】或【执行状态】，请参考上一轮技能序列的执行情况，根据成功/失败信息调整本次技能序列。\n"
+            "【输出格式要求】只输出逗号分隔的技能名称，不要任何解释说明。\n"
+            + skill_guard
+        )
+
+    def build_chat_prompt(self, need_camera: bool) -> str:
+        if need_camera:
+            return (
+                "你是一个智能语音助手。\n"
+                "请结合图片内容简短回答。"
+            )
+        return (
+            "你是一个智能语音助手。\n"
+            "请自然、简短地与用户对话。"
+        )
+
+    def build_kb_prompt(self) -> str:
+        return (
+            "你是蜂核科技的员工。\n"
+            "请基于知识库信息回答用户问题，回答要准确简洁。"
+        )
+
+    def build_default_system_prompt(self) -> str:
+        return (
+            "你是一个智能语音助手。\n"
+            "- 当用户发送图片时，请仔细观察图片内容，结合用户的问题或描述，提供简短、专业的回答。\n"
+            "- 当用户没有发送图片时，请自然、友好地与用户对话。\n"
+            "请根据对话模式调整你的回答风格。"
+        )
+
+    def route(self, text: str) -> IntentResult:
+        need_camera, camera_mode = self.check_camera_command(text)
+        text_pinyin = self.to_pinyin(text)
+
+        if self.is_skill_sequence_intent(text):
+            if camera_mode is None:
+                camera_mode = "head"
+            return IntentResult(
+                intent="skill_sequence",
+                text=text,
+                need_camera=True,
+                camera_mode=camera_mode,
+                system_prompt=self.build_skill_prompt()
+            )
+
+        if any(k in text_pinyin for k in self.kb_keywords):
+            return IntentResult(
+                intent="kb_qa",
+                text=text,
+                need_camera=False,
+                camera_mode=None,
+                system_prompt=self.build_kb_prompt()
+            )
+
+        return IntentResult(
+            intent="chat_camera" if need_camera else "chat_text",
+            text=text,
+            need_camera=need_camera,
+            camera_mode=camera_mode,
+            system_prompt=self.build_chat_prompt(need_camera)
+        )
+
--- a/robot_speaker/core/node_callbacks.py
+++ b/robot_speaker/core/node_callbacks.py
@@ -0,0 +1,246 @@
+import threading
+import numpy as np
+
+from robot_speaker.core.conversation_state import ConversationState
+from robot_speaker.perception.speaker_verifier import SpeakerState
+
+
+class NodeCallbacks:
+    # ==================== 初始化与内部工具 ====================
+    def __init__(self, node):
+        self.node = node
+
+    def _mark_utterance_processed(self) -> bool:
+        node = self.node
+        with node.utterance_lock:
+            if node.current_utterance_id == node.last_processed_utterance_id:
+                return False
+            node.last_processed_utterance_id = node.current_utterance_id
+            return True
+
+    def _trigger_sv_for_check_voice(self, source: str):
+        node = self.node
+        if not (node.sv_enabled and node.sv_client):
+            return
+        if not self._mark_utterance_processed():
+            return
+        if node._handle_empty_speaker_db():
+            node.get_logger().info(f"[声纹] CHECK_VOICE状态，数据库为空，跳过声纹验证（来源: {source}）")
+            return
+        if not node.sv_speech_end_event.is_set():
+            with node.sv_lock:
+                node.sv_recording = False
+                buffer_size = len(node.sv_audio_buffer)
+            node.get_logger().info(f"[声纹] {source}触发验证，缓冲区大小: {buffer_size} 样本（{buffer_size/node.sample_rate:.2f}秒）")
+            if buffer_size > 0:
+                node.sv_speech_end_event.set()
+        else:
+            node.get_logger().debug(f"[声纹] 声纹验证已触发，跳过（来源: {source}）")
+    
+    # ==================== 业务逻辑代理 ====================
+    def handle_interrupt_command(self, msg):
+        return self.node._handle_interrupt_command(msg)
+
+    def check_interrupt_and_cancel_turn(self) -> bool:
+        return self.node._check_interrupt_and_cancel_turn()
+
+    def handle_wake_word(self, text: str) -> str:
+        return self.node._handle_wake_word(text)
+
+    def check_shutup_command(self, text: str) -> bool:
+        return self.node._check_shutup_command(text)
+
+    def check_camera_command(self, text: str):
+        return self.node.intent_router.check_camera_command(text)
+
+    def llm_process_stream_with_camera(self, user_text: str, need_camera: bool) -> str:
+        return self.node._llm_process_stream_with_camera(user_text, need_camera)
+
+    def put_tts_text(self, text: str):
+        return self.node._put_tts_text(text)
+
+    def force_stop_tts(self):
+        return self.node._force_stop_tts()
+
+    def drain_queue(self, q):
+        return self.node._drain_queue(q)
+
+    # ==================== 录音/VAD回调 ====================
+    def get_silence_threshold(self) -> int:
+        """获取动态静音阈值（毫秒）"""
+        node = self.node
+        return node.silence_duration_ms
+    
+    def should_put_audio_to_queue(self) -> bool:
+        """
+        检查是否应该将音频放入队列（用于ASR）,根据状态机决定是否允许ASR
+        """
+        node = self.node
+        state = node._get_state()
+        if state in [ConversationState.IDLE, ConversationState.CHECK_VOICE,
+                     ConversationState.AUTHORIZED]:
+            return True
+        return False
+    
+    def on_speech_start(self):
+        """录音线程检测到人声开始"""
+        node = self.node
+        node.get_logger().info("[录音线程] 检测到人声，开始录音")
+
+        with node.utterance_lock:
+            node.current_utterance_id += 1
+        
+        state = node._get_state()
+        
+        if state == ConversationState.IDLE:
+            # Idle -> CheckVoice
+            if node.sv_enabled and node.sv_client:
+                # 开始录音用于声纹验证
+                with node.sv_lock:
+                    node.sv_recording = True
+                    node.sv_audio_buffer.clear()
+                    node.get_logger().debug("[声纹] 开始录音用于声纹验证")
+                node._change_state(ConversationState.CHECK_VOICE, "检测到语音，开始检查声纹")
+            else:
+                node._change_state(ConversationState.AUTHORIZED, "未启用声纹，直接授权")
+        
+        elif state == ConversationState.CHECK_VOICE:
+            # CheckVoice状态，继续录音用于声纹验证
+            if node.sv_enabled:
+                with node.sv_lock:
+                    node.sv_recording = True
+                    node.sv_audio_buffer.clear()
+                    node.get_logger().debug("[声纹] 继续录音用于声纹验证")
+        
+        elif state == ConversationState.AUTHORIZED:
+            # Authorized状态，开始录音用于声纹验证（验证当前用户）
+            if node.sv_enabled:
+                with node.sv_lock:
+                    node.sv_recording = True
+                    node.sv_audio_buffer.clear()
+                    node.get_logger().debug("[声纹] 开始录音用于声纹验证")
+    
+    def on_audio_chunk_for_sv(self, audio_chunk: bytes):
+        """录音线程音频chunk回调 - 仅在需要时录音到声纹缓冲区"""
+        node = self.node
+        state = node._get_state()
+        
+        # 声纹验证录音（CHECK_VOICE, AUTHORIZED状态）
+        if node.sv_enabled and node.sv_recording:
+            try:
+                audio_array = np.frombuffer(audio_chunk, dtype=np.int16)
+                with node.sv_lock:
+                    node.sv_audio_buffer.extend(audio_array)
+            except Exception as e:
+                node.get_logger().debug(f"[声纹] 录音失败: {e}")
+        
+    def on_speech_end(self):
+        """录音线程检测到说话结束（静音一段时间）"""
+        node = self.node
+        node.get_logger().info("[录音线程] 检测到说话结束")
+        
+        state = node._get_state()
+        node.get_logger().info(f"[录音线程] 说话结束时的状态: {state}")
+        
+        if state == ConversationState.CHECK_VOICE:
+            if node.asr_client and node.asr_client.running:
+                node.asr_client.stop_current_recognition()
+            self._trigger_sv_for_check_voice("VAD")
+            return
+        
+        elif state == ConversationState.AUTHORIZED:
+            if node.asr_client and node.asr_client.running:
+                node.asr_client.stop_current_recognition()
+            if node.sv_enabled:
+                with node.sv_lock:
+                    node.sv_recording = False
+                    buffer_size = len(node.sv_audio_buffer)
+                node.get_logger().debug(f"[声纹] 停止录音，缓冲区大小: {buffer_size}")
+                node.sv_speech_end_event.set()
+                
+                # 如果TTS正在播放，异步等待声纹验证结果，如果通过才中断TTS
+                # 使用独立线程避免阻塞录音线程，影响TTS播放
+                if node.tts_playing_event.is_set():
+                    node.get_logger().info("[打断] TTS播放中，用户说话结束，异步等待声纹验证结果...")
+                    def _check_sv_and_interrupt():
+                        # 等待声纹验证结果（最多等待2秒）
+                        with node.sv_result_cv:
+                            current_seq = node.sv_result_seq
+                            if node.sv_result_cv.wait_for(
+                                lambda: node.sv_result_seq > current_seq,
+                                timeout=2.0
+                            ):
+                                # 声纹验证完成，检查结果
+                                with node.sv_lock:
+                                    speaker_id = node.current_speaker_id
+                                    speaker_state = node.current_speaker_state
+                                
+                                if speaker_id and speaker_state == SpeakerState.VERIFIED:
+                                    node.get_logger().info(f"[打断] 声纹验证通过({speaker_id})，中断TTS播放")
+                                    node._interrupt_tts("检测到人声（已授权用户，说话结束）")
+                                else:
+                                    node.get_logger().debug(f"[打断] 声纹验证未通过，不中断TTS（状态: {speaker_state.value}）")
+                            else:
+                                node.get_logger().warning("[打断] 声纹验证超时，不中断TTS")
+                    # 在独立线程中等待，避免阻塞录音线程
+                    threading.Thread(target=_check_sv_and_interrupt, daemon=True, name="SVInterruptCheck").start()
+            return
+    
+    def on_new_segment(self):
+        """录音线程检测到新的已授权用户声段，开始录音用于声纹验证（不立即中断）"""
+        node = self.node
+        state = node._get_state()
+        if state == ConversationState.AUTHORIZED:
+            # TTS播放期间，检测到人声时不立即中断，而是开始录音用于声纹验证
+            # 等待用户说话结束（speech_end）后，如果声纹验证通过，才中断TTS
+            # 这样可以避免TTS回声误触发，但支持真正的用户打断
+            if node.tts_playing_event.is_set():
+                node.get_logger().debug("[打断] TTS播放中，检测到人声，开始录音用于声纹验证（等待说话结束后验证）")
+                # 录音已经在 on_speech_start 中开始了，这里不需要额外操作
+            else:
+                # TTS未播放时，检查声纹验证结果并立即中断
+                if node.sv_enabled and node.sv_client:
+                    with node.sv_lock:
+                        current_speaker_id = node.current_speaker_id
+                        speaker_state = node.current_speaker_state
+                    if speaker_state == SpeakerState.VERIFIED and current_speaker_id:
+                        node._interrupt_tts("检测到人声（已授权用户）")
+                        node.get_logger().info(f"[打断] 已授权用户({current_speaker_id})发言，中断TTS播放")
+                    else:
+                        node.get_logger().debug(f"[打断] 检测到人声，但声纹未验证或未匹配，不中断TTS（当前状态: {speaker_state.value}）")
+                else:
+                    # 未启用声纹，直接中断（保持原有行为）
+                    node._interrupt_tts("检测到人声（未启用声纹）")
+                    node.get_logger().info("[打断] 检测到人声，中断TTS播放")
+        else:
+            node.get_logger().debug(f"[打断] 检测到人声，但当前状态为 {state.value}，非已授权用户，不允许打断TTS")
+    
+    def on_heartbeat(self):
+        """录音线程静音心跳回调"""
+        self.node.get_logger().info("[录音线程] 静音中")
+
+    # ==================== ASR回调 ====================
+    def on_asr_sentence_end(self, text: str):
+        """ASR sentence_end回调 - 将文本放入队列"""
+        node = self.node
+        if not text or not text.strip():
+            return
+        text_clean = text.strip()
+        node.get_logger().info(f"[ASR] 识别完成: {text_clean}")
+        
+        state = node._get_state()
+        
+        # 规则2：CHECK_VOICE状态下，如果ASR识别完成但VAD还没有触发speech_end，主动触发声纹验证
+        if state == ConversationState.CHECK_VOICE:
+            if node.sv_enabled and node.sv_client:
+                node.get_logger().info("[ASR] CHECK_VOICE状态，ASR识别完成，主动触发声纹验证")
+            self._trigger_sv_for_check_voice("ASR")
+        
+        # 其他状态，将文本放入队列
+        node.text_queue.put(text_clean, timeout=1.0)
+    
+    def on_asr_text_update(self, text: str):
+        """ASR 实时文本更新回调 - 用于多轮提示"""
+        if not text or not text.strip():
+            return
+        self.node.get_logger().debug(f"[ASR] 识别中: {text.strip()}")
--- a/robot_speaker/core/node_workers.py
+++ b/robot_speaker/core/node_workers.py
@@ -0,0 +1,188 @@
+import queue
+import time
+import numpy as np
+
+from robot_speaker.core.conversation_state import ConversationState
+from robot_speaker.perception.speaker_verifier import SpeakerState
+
+
+class NodeWorkers:
+    def __init__(self, node):
+        self.node = node
+    
+    def recording_worker(self):
+        """线程1: 录音线程 - 唯一实时线程"""
+        node = self.node
+        node.get_logger().info("[录音线程] 启动")
+        node.audio_recorder.record_with_vad()
+    
+    def asr_worker(self):
+        """线程2: ASR推理线程 - 只做 audio → text"""
+        node = self.node
+        node.get_logger().info("[ASR推理线程] 启动")
+        while not node.stop_event.is_set():
+            try:
+                audio_chunk = node.audio_queue.get(timeout=0.1)
+            except queue.Empty:
+                continue
+            if node.interrupt_event.is_set():
+                continue
+            if node.callbacks.should_put_audio_to_queue() and node.asr_client and node.asr_client.running:
+                node.asr_client.send_audio(audio_chunk)
+    
+    def process_worker(self):
+        """线程3: 主线程 - 处理业务逻辑"""
+        node = self.node
+        node.get_logger().info("[主线程] 启动")
+        while not node.stop_event.is_set():
+            try:
+                text = node.text_queue.get(timeout=0.1)
+            except queue.Empty:
+                continue
+            
+            node.get_logger().info(f"[主线程] 收到识别文本: {text}")
+            current_state = node._get_state()
+            
+            if current_state == ConversationState.CHECK_VOICE:
+                if node.use_wake_word:
+                    node.get_logger().info(f"[主线程] CHECK_VOICE状态，检查唤醒词，文本: {text}")
+                    processed_text = node.callbacks.handle_wake_word(text)
+                    if not processed_text:
+                        node.get_logger().info(f"[主线程] 未检测到唤醒词（唤醒词配置: '{node.wake_word}'），回到Idle状态")
+                        node._change_state(ConversationState.IDLE, "未检测到唤醒词")
+                        continue
+                    node.get_logger().info(f"[主线程] 检测到唤醒词，处理后的文本: {processed_text}")
+                    text = processed_text
+                
+                if node.sv_enabled and node.sv_client:
+                    node.get_logger().info("[主线程] CHECK_VOICE状态：等待声纹验证结果...")
+                    with node.sv_result_cv:
+                        current_seq = node.sv_result_seq
+                        if not node.sv_result_cv.wait_for(
+                            lambda: node.sv_result_seq > current_seq,
+                            timeout=15.0
+                        ):
+                            node.get_logger().warning("[主线程] CHECK_VOICE状态：声纹结果未ready（超时15秒），拒绝本轮")
+                            with node.sv_lock:
+                                node.sv_audio_buffer.clear()
+                            node._change_state(ConversationState.IDLE, "声纹结果未ready")
+                            continue
+                    node.get_logger().info("[主线程] CHECK_VOICE状态：声纹结果ready，继续处理")
+                    
+                    with node.sv_lock:
+                        speaker_id = node.current_speaker_id
+                        speaker_state = node.current_speaker_state
+                        score = node.current_speaker_score
+                    
+                    if speaker_id and speaker_state == SpeakerState.VERIFIED:
+                        node.get_logger().info(f"[主线程] 声纹验证成功: {speaker_id}, 得分: {score:.4f}")
+                        node._change_state(ConversationState.AUTHORIZED, "声纹验证成功")
+                    else:
+                        node.get_logger().info(f"[主线程] 声纹验证失败，得分: {score:.4f}")
+                        node.callbacks.put_tts_text("声纹验证失败")
+                        node._change_state(ConversationState.IDLE, "声纹验证失败")
+                        continue
+                else:
+                    node._change_state(ConversationState.AUTHORIZED, "未启用声纹")
+            
+            elif current_state == ConversationState.AUTHORIZED:
+                if node.tts_playing_event.is_set():
+                    node.get_logger().debug("[主线程] AUTHORIZED状态，TTS播放中，忽略ASR识别结果（只有VAD检测到已授权用户人声才能中断）")
+                    continue
+            
+            elif current_state == ConversationState.IDLE:
+                node.get_logger().warning("[主线程] Idle状态收到文本，忽略")
+                continue
+
+            if node.use_wake_word and current_state == ConversationState.AUTHORIZED:
+                processed_text = node.callbacks.handle_wake_word(text)
+                if not processed_text:
+                    node._change_state(ConversationState.IDLE, "未检测到唤醒词")
+                    continue
+                text = processed_text
+
+            if node.callbacks.check_shutup_command(text):
+                node.get_logger().info("[主线程] 检测到闭嘴指令")
+                node.interrupt_event.set()
+                node.callbacks.force_stop_tts()
+                node._change_state(ConversationState.IDLE, "用户闭嘴指令")
+                continue
+
+            intent_payload = node.intent_router.route(text)
+            node._handle_intent(intent_payload)
+            
+            if current_state == ConversationState.AUTHORIZED:
+                node.session_start_time = time.time()
+    
+    def sv_worker(self):
+        """线程5: 声纹识别线程 - 非实时、低频（CAM++）"""
+        node = self.node
+        node.get_logger().info("[声纹识别线程] 启动")
+        
+        # 动态计算最小音频样本数，确保降采样到16kHz后≥0.5秒
+        target_sr = 16000  # CAM++模型目标采样率
+        min_duration_seconds = 0.5
+        min_samples_at_target_sr = int(target_sr * min_duration_seconds)  # 8000样本@16kHz
+        
+        if node.sample_rate >= target_sr:
+            downsample_step = int(node.sample_rate / target_sr)
+            min_audio_samples = min_samples_at_target_sr * downsample_step
+        else:
+            min_audio_samples = int(node.sample_rate * min_duration_seconds)
+        
+        while not node.stop_event.is_set():
+            try:
+                if node.sv_speech_end_event.wait(timeout=0.1):
+                    node.sv_speech_end_event.clear()
+                    with node.sv_lock:
+                        audio_list = list(node.sv_audio_buffer)
+                        buffer_size = len(audio_list)
+                        node.sv_audio_buffer.clear()
+                    
+                    node.get_logger().info(f"[声纹识别] 收到speech_end事件，录音长度: {buffer_size} 样本（{buffer_size/node.sample_rate:.2f}秒）")
+
+                    if node._handle_empty_speaker_db():
+                        node.get_logger().info("[声纹识别] 数据库为空，跳过验证，直接设置UNKNOWN状态")
+                        continue
+                    
+                    if buffer_size >= min_audio_samples:
+                        audio_array = np.array(audio_list, dtype=np.int16)
+                        embedding, success = node.sv_client.extract_embedding(
+                            audio_array, 
+                            sample_rate=node.sample_rate
+                        )
+                        
+                        if not success or embedding is None:
+                            node.get_logger().debug("[声纹识别] 提取embedding失败")
+                            with node.sv_lock:
+                                node.current_speaker_id = None
+                                node.current_speaker_state = SpeakerState.ERROR
+                                node.current_speaker_score = 0.0
+                        else:
+                            speaker_id, match_state, score, _ = node.sv_client.match_speaker(embedding)
+                            with node.sv_lock:
+                                node.current_speaker_id = speaker_id
+                                node.current_speaker_state = match_state
+                                node.current_speaker_score = score
+                            
+                            if match_state == SpeakerState.VERIFIED:
+                                node.get_logger().info(f"[声纹识别] 识别到说话人: {speaker_id}, 相似度: {score:.4f}")
+                            elif match_state == SpeakerState.REJECTED:
+                                node.get_logger().info(f"[声纹识别] 未匹配到已知说话人（相似度不足）, 相似度: {score:.4f}")
+                            else:
+                                node.get_logger().info(f"[声纹识别] 状态: {match_state.value}, 相似度: {score:.4f}")
+                    else:
+                        node.get_logger().debug(f"[声纹识别] 录音太短: {buffer_size} < {min_audio_samples}，跳过处理")
+                        with node.sv_lock:
+                            node.current_speaker_id = None
+                            node.current_speaker_state = SpeakerState.UNKNOWN
+                            node.current_speaker_score = 0.0
+                    
+                    with node.sv_result_cv:
+                        node.sv_result_seq += 1
+                        node.sv_result_cv.notify_all()
+                    
+            except Exception as e:
+                node.get_logger().error(f"[声纹识别线程] 错误: {e}")
+                time.sleep(0.1)
+
--- a/robot_speaker/core/register_speaker_node.py
+++ b/robot_speaker/core/register_speaker_node.py
@@ -0,0 +1,463 @@
+"""
+声纹注册独立节点：运行完成后退出
+"""
+import collections
+import os
+import queue
+import threading
+import time
+import yaml
+
+import numpy as np
+import rclpy
+from rclpy.node import Node
+from ament_index_python.packages import get_package_share_directory
+
+from robot_speaker.perception.audio_pipeline import VADDetector, AudioRecorder
+from robot_speaker.perception.speaker_verifier import SpeakerVerificationClient
+from robot_speaker.perception.echo_cancellation import ReferenceSignalBuffer
+from robot_speaker.models.asr.dashscope import DashScopeASR
+from robot_speaker.models.tts.dashscope import DashScopeTTSClient
+from robot_speaker.core.types import TTSRequest
+from pypinyin import pinyin, Style
+
+
+class RegisterSpeakerNode(Node):
+    def __init__(self):
+        super().__init__('register_speaker_node')
+        self._load_config()
+
+        self.stop_event = threading.Event()
+        self.processing = False
+        self.buffer_lock = threading.Lock()
+        self.audio_buffer = collections.deque(maxlen=self.sv_buffer_size)
+        
+        # 状态：等待唤醒词 -> 等待声纹语音
+        self.waiting_for_wake_word = True
+        self.waiting_for_voiceprint = False
+
+        # 音频队列和文本队列（用于ASR）
+        self.audio_queue = queue.Queue()
+        self.text_queue = queue.Queue()
+
+        self.vad_detector = VADDetector(
+            mode=self.vad_mode,
+            sample_rate=self.sample_rate
+        )
+
+        # 创建参考信号缓冲区（用于回声消除）
+        self.reference_signal_buffer = ReferenceSignalBuffer(
+            max_duration_ms=self.audio_echo_cancellation_max_duration_ms,
+            sample_rate=self.sample_rate,
+            channels=self.output_channels
+        ) if self.audio_echo_cancellation_enabled else None
+
+        self.audio_recorder = AudioRecorder(
+            device_index=self.input_device_index,
+            sample_rate=self.sample_rate,
+            channels=self.channels,
+            chunk=self.chunk,
+            vad_detector=self.vad_detector,
+            audio_queue=self.audio_queue,  # 送ASR用于唤醒词检测
+            silence_duration_ms=self.silence_duration_ms,
+            min_energy_threshold=self.min_energy_threshold,
+            heartbeat_interval=self.audio_microphone_heartbeat_interval,
+            on_heartbeat=self._on_heartbeat,
+            is_playing=lambda: False,
+            on_new_segment=None,
+            on_speech_start=self._on_speech_start,
+            on_speech_end=self._on_speech_end,
+            stop_flag=self.stop_event.is_set,
+            on_audio_chunk=self._on_audio_chunk,
+            should_put_to_queue=self._should_put_to_queue,
+            get_silence_threshold=lambda: self.silence_duration_ms,
+            enable_echo_cancellation=self.audio_echo_cancellation_enabled,  # 启用回声消除，保持与主程序一致
+            reference_signal_buffer=self.reference_signal_buffer,
+            logger=self.get_logger()
+        )
+        
+        # ASR客户端 - 用于唤醒词检测
+        self.asr_client = DashScopeASR(
+            api_key=self.dashscope_api_key,
+            sample_rate=self.sample_rate,
+            model=self.asr_model,
+            url=self.asr_url,
+            logger=self.get_logger()
+        )
+        self.asr_client.on_sentence_end = self._on_asr_sentence_end
+        self.asr_client.start()
+        
+        # ASR处理线程
+        self.asr_thread = threading.Thread(
+            target=self._asr_worker,
+            name="RegisterASRThread",
+            daemon=True
+        )
+        self.asr_thread.start()
+        
+        # 文本处理线程
+        self.text_thread = threading.Thread(
+            target=self._text_worker,
+            name="RegisterTextThread",
+            daemon=True
+        )
+        self.text_thread.start()
+
+        self.sv_client = SpeakerVerificationClient(
+            model_path=self.sv_model_path,
+            threshold=self.sv_threshold,
+            speaker_db_path=self.sv_speaker_db_path,
+            logger=self.get_logger()
+        )
+
+        self.tts_client = DashScopeTTSClient(
+            api_key=self.dashscope_api_key,
+            model=self.tts_model,
+            voice=self.tts_voice,
+            card_index=self.output_card_index,
+            device_index=self.output_device_index,
+            output_sample_rate=self.output_sample_rate,
+            output_channels=self.output_channels,
+            output_volume=self.output_volume,
+            tts_source_sample_rate=self.audio_tts_source_sample_rate,
+            tts_source_channels=self.audio_tts_source_channels,
+            tts_ffmpeg_thread_queue_size=self.audio_tts_ffmpeg_thread_queue_size,
+            reference_signal_buffer=self.reference_signal_buffer,
+            logger=self.get_logger()
+        )
+
+        self.get_logger().info("声纹注册节点启动，请说'er gou......'唤醒注册")
+        self.recording_thread = threading.Thread(
+            target=self.audio_recorder.record_with_vad,
+            name="RegisterRecordingThread",
+            daemon=True
+        )
+        self.recording_thread.start()
+
+        self.timer = self.create_timer(0.2, self._check_done)
+
+    def _load_config(self):
+        config_file = os.path.join(
+            get_package_share_directory('robot_speaker'),
+            'config',
+            'voice.yaml'
+        )
+        with open(config_file, 'r') as f:
+            config = yaml.safe_load(f)
+
+        dashscope = config['dashscope']
+        audio = config['audio']
+        mic = audio['microphone']
+        soundcard = audio['soundcard']
+        vad = config['vad']
+        system = config['system']
+
+        self.dashscope_api_key = dashscope['api_key']
+        self.asr_model = dashscope['asr']['model']
+        self.asr_url = dashscope['asr']['url']
+        self.tts_model = dashscope['tts']['model']
+        self.tts_voice = dashscope['tts']['voice']
+        
+        self.input_device_index = mic['device_index']
+        self.sample_rate = mic['sample_rate']
+        self.channels = mic['channels']
+        self.chunk = mic['chunk']
+        self.audio_microphone_heartbeat_interval = mic['heartbeat_interval']
+
+        self.output_card_index = soundcard['card_index']
+        self.output_device_index = soundcard['device_index']
+        self.output_sample_rate = soundcard['sample_rate']
+        self.output_channels = soundcard['channels']
+        self.output_volume = soundcard['volume']
+
+        echo = audio.get('echo_cancellation', {})
+        self.audio_echo_cancellation_enabled = echo.get('enabled', True)  # 默认启用
+        self.audio_echo_cancellation_max_duration_ms = echo.get('max_duration_ms', 200)
+
+        tts_audio = audio.get('tts', {})
+        self.audio_tts_source_sample_rate = tts_audio.get('source_sample_rate', 22050)
+        self.audio_tts_source_channels = tts_audio.get('source_channels', 1)
+        self.audio_tts_ffmpeg_thread_queue_size = tts_audio.get('ffmpeg_thread_queue_size', 5)
+
+        self.vad_mode = vad['vad_mode']
+        self.silence_duration_ms = vad['silence_duration_ms']
+        self.min_energy_threshold = vad['min_energy_threshold']
+
+        self.sv_model_path = os.path.expanduser(system['sv_model_path'])
+        self.sv_threshold = system['sv_threshold']
+        self.sv_speaker_db_path = os.path.expanduser(system['sv_speaker_db_path'])
+        self.sv_buffer_size = system['sv_buffer_size']
+        self.wake_word = system['wake_word']
+
+    def _should_put_to_queue(self) -> bool:
+        """判断是否应该将音频放入ASR队列（仅在等待唤醒词时）"""
+        return self.waiting_for_wake_word
+    
+    def _on_heartbeat(self):
+        if self.waiting_for_wake_word:
+            self.get_logger().info("[注册录音] 等待唤醒词'er gou'...")
+        elif self.waiting_for_voiceprint:
+            self.get_logger().info("[注册录音] 等待声纹语音...")
+
+    def _on_speech_start(self):
+        if self.waiting_for_wake_word:
+            # 等待唤醒词时，开始录音（可能包含唤醒词）
+            self.get_logger().info("[注册录音] 检测到人声，开始录音")
+        elif self.waiting_for_voiceprint:
+            self.get_logger().info("[注册录音] 检测到人声，继续录音（用于声纹注册）")
+        # 注意：不清空缓冲区，保留包含唤醒词的音频
+
+    def _on_audio_chunk(self, audio_chunk: bytes):
+        # 记录所有音频（包括唤醒词），用于声纹注册
+        try:
+            audio_array = np.frombuffer(audio_chunk, dtype=np.int16)
+            with self.buffer_lock:
+                self.audio_buffer.extend(audio_array)
+        except Exception as e:
+            self.get_logger().debug(f"[注册录音] 录音失败: {e}")
+
+    def _on_speech_end(self):
+        # 如果还在等待唤醒词，不处理
+        if self.waiting_for_wake_word:
+            return
+        # 如果已经在处理，不重复处理
+        if self.processing:
+            return
+        
+        # 等待声纹语音时，用户说话结束，使用当前音频（即使不足3秒）
+        if self.waiting_for_voiceprint:
+            self._process_voiceprint_audio(use_current_audio_if_short=True)
+            return  # 处理完毕后直接返回，防止重复调用
+
+    def _process_voiceprint_audio(self, use_current_audio_if_short: bool = False):
+        """处理声纹音频：使用用户完整的第一段语音进行注册
+        
+        Args:
+            use_current_audio_if_short: 如果音频不足3秒，是否使用当前音频（用于用户已说完的情况）
+        """
+        if self.processing:
+            return
+        self.processing = True
+        with self.buffer_lock:
+            audio_list = list(self.audio_buffer)
+
+        buffer_size = len(audio_list)
+        buffer_sec = buffer_size / self.sample_rate
+        self.get_logger().info(f"[注册录音] 当前音频长度: {buffer_sec:.2f}秒")
+
+        required_samples = int(self.sample_rate * 3)
+        
+        # 如果音频不足3秒
+        if buffer_size < required_samples:
+            if use_current_audio_if_short:
+                # 用户已经说完了，使用当前音频（即使不足3秒）
+                self.get_logger().info(f"[注册录音] 音频不足3秒（当前{buffer_sec:.2f}秒），但用户已说完，使用当前音频进行注册")
+                audio_to_use = audio_list
+            else:
+                # 等待继续录音
+                self.get_logger().info(f"[注册录音] 音频不足3秒（当前{buffer_sec:.2f}秒），等待继续录音...")
+                self.processing = False
+                return
+        else:
+            # 策略优化：不再强行截取最后3秒，因为唤醒词检测有延迟，
+            # "er gou" 可能在缓冲区的中间偏后位置。
+            # 为了防止截取到尾部的静音，并在包含完整唤醒词，
+            # 我们截取最近的 3.0 秒（或者全部，如果不足3秒），
+            # 这样能最大程度包含有效语音 "二狗"。
+            target_samples = int(self.sample_rate * 3.0)
+            if buffer_size > target_samples:
+                audio_to_use = audio_list[-target_samples:]
+            else:
+                audio_to_use = audio_list
+            
+            duration = len(audio_to_use) / self.sample_rate
+            self.get_logger().info(f"[注册录音] 使用最近 {duration:.2f} 秒音频用于注册（覆盖唤醒词）")
+        
+        # 清空缓冲区
+        with self.buffer_lock:
+            self.audio_buffer.clear()
+
+        try:
+            audio_array = np.array(audio_to_use, dtype=np.int16)
+            embedding, success = self.sv_client.extract_embedding(
+                audio_array,
+                sample_rate=self.sample_rate
+            )
+            if not success or embedding is None:
+                self.get_logger().error("[注册录音] 提取embedding失败")
+                self.processing = False
+                return
+
+            speaker_id = f"user_{int(time.time())}"
+            if self.sv_client.register_speaker(speaker_id, embedding):
+                self.get_logger().info(f"[注册录音] 注册成功，用户ID: {speaker_id}，准备退出")
+                
+                # 播放成功提示
+                try:
+                    self.get_logger().info("[注册录音] 播放注册成功提示")
+                    request = TTSRequest(text="声纹注册成功", voice=self.tts_voice)
+                    self.tts_client.synthesize(request)
+                    time.sleep(5) 
+                except Exception as e:
+                    self.get_logger().error(f"[注册录音] 播放提示失败: {e}")
+                
+                self.stop_event.set()
+            else:
+                self.get_logger().error("[注册录音] 注册失败")
+                self.processing = False
+        except Exception as e:
+            self.get_logger().error(f"[注册录音] 注册异常: {e}")
+            self.processing = False
+    
+    def _extract_speech_segments(self, audio_array: np.ndarray, frame_size: int = 1024) -> list:
+        """使用能量检测提取人声片段（过滤静音）"""
+        speech_segments = []
+        frame_samples = frame_size
+        total_frames = 0
+        speech_frames = 0
+        
+        for i in range(0, len(audio_array), frame_samples):
+            frame = audio_array[i:i + frame_samples]
+            if len(frame) < frame_samples:
+                break
+            
+            total_frames += 1
+            # 计算帧的能量（RMS，对于int16音频）
+            frame_float = frame.astype(np.float32)
+            energy = np.sqrt(np.mean(frame_float ** 2))
+            
+            # 使用更低的阈值来检测人声（降低阈值，避免误判静音）
+            # 阈值可以动态调整，或者使用自适应阈值
+            threshold = self.min_energy_threshold * 0.50  # 降低阈值到原来的50%
+            
+            # 如果能量超过阈值，认为是人声
+            if energy >= threshold:
+                speech_segments.append((i, i + frame_samples))
+                speech_frames += 1
+        
+        # 调试信息
+        if total_frames > 0:
+            speech_ratio = speech_frames / total_frames
+            self.get_logger().debug(f"[注册录音] 能量检测: 总帧数={total_frames}, 人声帧数={speech_frames}, 人声比例={speech_ratio:.2%}, 阈值={self.min_energy_threshold}")
+        
+        return speech_segments
+    
+    def _merge_speech_segments(self, audio_array: np.ndarray, segments: list, min_samples: int) -> np.ndarray:
+        """合并人声片段，返回连续的人声音频"""
+        if not segments:
+            return np.array([], dtype=np.int16)
+        
+        # 合并相邻的片段
+        merged_segments = []
+        current_start, current_end = segments[0]
+        
+        for start, end in segments[1:]:
+            if start <= current_end + 1024:  # 允许小间隙（1帧）
+                current_end = end
+            else:
+                merged_segments.append((current_start, current_end))
+                current_start, current_end = start, end
+        merged_segments.append((current_start, current_end))
+        
+        # 从后往前选择片段，直到达到3秒
+        selected_audio = []
+        total_samples = 0
+        
+        for start, end in reversed(merged_segments):
+            segment_audio = audio_array[start:end]
+            selected_audio.insert(0, segment_audio)
+            total_samples += len(segment_audio)
+            if total_samples >= min_samples:
+                break
+        
+        if not selected_audio:
+            return np.array([], dtype=np.int16)
+        
+        return np.concatenate(selected_audio)
+    
+    def _asr_worker(self):
+        """ASR处理线程"""
+        while not self.stop_event.is_set():
+            try:
+                audio_chunk = self.audio_queue.get(timeout=0.1)
+                if self.asr_client and self.asr_client.running:
+                    self.asr_client.send_audio(audio_chunk)
+            except queue.Empty:
+                continue
+            except Exception as e:
+                self.get_logger().error(f"[注册ASR] 处理异常: {e}")
+    
+    def _on_asr_sentence_end(self, text: str):
+        """ASR识别完成回调"""
+        if text and text.strip():
+            self.text_queue.put(text.strip())
+    
+    def _text_worker(self):
+        """文本处理线程：检测唤醒词"""
+        while not self.stop_event.is_set():
+            try:
+                text = self.text_queue.get(timeout=0.1)
+                if self.waiting_for_wake_word:
+                    self._check_wake_word(text)
+            except queue.Empty:
+                continue
+            except Exception as e:
+                self.get_logger().error(f"[注册文本] 处理异常: {e}")
+    
+    def _to_pinyin(self, text: str) -> str:
+        """将中文文本转换为拼音"""
+        chars = [c for c in text if '\u4e00' <= c <= '\u9fa5']
+        if not chars:
+            return ""
+        py_list = pinyin(chars, style=Style.NORMAL)
+        return ' '.join([item[0] for item in py_list]).lower().strip()
+    
+    def _check_wake_word(self, text: str):
+        """检查是否包含唤醒词"""
+        text_pinyin = self._to_pinyin(text)
+        wake_word_pinyin = self.wake_word.lower().strip()
+        self.get_logger().info(f"[注册唤醒词] 原始文本: {text}, 文本拼音: {text_pinyin}, 唤醒词拼音: {wake_word_pinyin}")
+        
+        if not wake_word_pinyin:
+            return
+        
+        text_pinyin_parts = text_pinyin.split() if text_pinyin else []
+        wake_word_parts = wake_word_pinyin.split()
+        
+        # 检查是否包含唤醒词
+        for i in range(len(text_pinyin_parts) - len(wake_word_parts) + 1):
+            if text_pinyin_parts[i:i + len(wake_word_parts)] == wake_word_parts:
+                self.get_logger().info(f"[注册唤醒词] 检测到唤醒词 '{self.wake_word}'")
+                self.get_logger().info("=" * 50)
+                self.get_logger().info("[声纹注册] 开始注册声纹，将截取3秒音频用于注册")
+                self.get_logger().info("=" * 50)
+                self.waiting_for_wake_word = False
+                self.waiting_for_voiceprint = True
+                # 停止ASR，不再需要识别
+                if self.asr_client:
+                    self.asr_client.stop_current_recognition()
+                # 立即处理当前音频缓冲区中的完整音频
+                # 用户可能已经说完了（包含唤醒词的整段语音）
+                self._process_voiceprint_audio()
+                return
+
+    def _check_done(self):
+        if self.stop_event.is_set():
+            self.get_logger().info("注册完成，节点退出")
+            # 清理资源
+            if self.asr_client:
+                self.asr_client.stop()
+            self.destroy_node()
+            rclpy.shutdown()
+
+
+def main(args=None):
+    rclpy.init(args=args)
+    node = RegisterSpeakerNode()
+    rclpy.spin(node)
+
+
+if __name__ == '__main__':
+    main()
+
+
--- a/robot_speaker/core/robot_speaker_node.py
+++ b/robot_speaker/core/robot_speaker_node.py
@@ -0,0 +1,858 @@
+"""
+语音交互节点
+"""
+import rclpy
+from rclpy.node import Node
+from std_msgs.msg import String
+import threading
+import queue
+import time
+import re
+import base64
+import io
+import numpy as np
+from PIL import Image
+import subprocess
+import collections
+import os
+import yaml
+import json
+from ament_index_python.packages import get_package_share_directory
+from robot_speaker.perception.audio_pipeline import VADDetector, AudioRecorder
+from robot_speaker.models.asr.dashscope import DashScopeASR
+from robot_speaker.models.tts.dashscope import DashScopeTTSClient
+from robot_speaker.models.llm.dashscope import DashScopeLLM
+from robot_speaker.understanding.context_manager import ConversationHistory
+from robot_speaker.core.types import LLMMessage, TTSRequest
+from robot_speaker.perception.camera_client import CameraClient
+from robot_speaker.perception.speaker_verifier import SpeakerVerificationClient, SpeakerState
+from robot_speaker.perception.echo_cancellation import ReferenceSignalBuffer
+from robot_speaker.core.conversation_state import ConversationState
+from robot_speaker.core.node_workers import NodeWorkers
+from robot_speaker.core.node_callbacks import NodeCallbacks
+from robot_speaker.core.intent_router import IntentRouter, IntentResult
+
+
+class RobotSpeakerNode(Node):
+    # ==================== 初始化 ====================
+    def __init__(self):
+        super().__init__('robot_speaker_node')
+        
+        # 直接从配置文件加载参数
+        self._load_config()
+
+        # 初始化队列（线程间通信）
+        self.audio_queue = queue.Queue()  # 录音线程 → ASR线程
+        self.text_queue = queue.Queue()   # ASR线程 → 主线程
+        self.tts_queue = queue.Queue()    # 主线程 → TTS线程
+        
+        # 初始化线程同步事件
+        self.interrupt_event = threading.Event()  # 中断标志
+        self.stop_event = threading.Event()       # 停止标志
+        self.tts_playing_event = threading.Event()  # TTS播放状态
+
+        # 初始化会话管理
+        self.session_active = False
+        self.session_start_time = 0.0
+        self.session_lock = threading.Lock()
+        
+        # 状态机状态
+        self.conversation_state = ConversationState.IDLE  # 当前会话状态
+        self.state_lock = threading.Lock()  # 保护状态机状态
+        
+        # 声纹识别共享状态
+        self.current_speaker_id = None  # 当前说话人ID（共享状态，只读）
+        self.current_speaker_state = SpeakerState.UNKNOWN  # 当前说话人状态
+        self.current_speaker_score = 0.0  # 当前说话人相似度得分
+        self.sv_lock = threading.Lock()  # 保护声纹识别共享状态
+        self.sv_speech_end_event = threading.Event()  # 通知声纹线程处理（speech_end触发）
+        self.sv_result_ready_event = threading.Event()  # 保留兼容（已不用于同步）
+        self.sv_result_lock = threading.Lock()  # 声纹结果序号锁
+        self.sv_result_cv = threading.Condition(self.sv_result_lock)
+        self.sv_result_seq = 0
+        # 声纹缓冲区大小将在_init_components中初始化（需要先读取参数）
+        self.sv_audio_buffer = None  # 声纹验证录音缓冲区（将在_init_components中初始化）
+        self.sv_recording = False  # 是否正在为声纹验证录音
+        
+        # 声纹注册状态
+        self.utterance_lock = threading.Lock()
+        self.current_utterance_id = 0
+        self.last_processed_utterance_id = 0
+
+        self.intent_router = IntentRouter()
+        self.callbacks = NodeCallbacks(self)
+
+        # 初始化组件（VAD、录音器、ASR、LLM、TTS）
+        self._init_components()
+        self.workers = NodeWorkers(self)
+        
+        # 状态机初始状态
+        if self.sv_enabled and self.sv_client:
+            speaker_count = self.sv_client.get_speaker_count()
+            if speaker_count == 0:
+                self.get_logger().info("声纹数据库为空，请注册声纹")
+
+        # ROS订阅
+        self.interrupt_sub = self.create_subscription(
+            String, 'interrupt_command', self.callbacks.handle_interrupt_command, self.system_interrupt_command_queue_depth
+        )
+        self.skill_sequence_pub = self.create_publisher(String, '/llm_skill_sequence', 10)
+        self.skill_feedback_sub = self.create_subscription(
+            String, '/skill_execution_feedback', self._on_skill_feedback, 10
+        )
+        self.skill_result_sub = self.create_subscription(
+            String, '/skill_execution_result', self._on_skill_result, 10
+        )
+
+        self.latest_skill_feedback = None
+        self.latest_skill_result = None
+
+        # 启动线程
+        self._start_threads()
+        self.get_logger().info("语音节点已启动")
+    
+    # ==================== 配置加载 ====================
+    def _load_config(self):
+        """直接从 voice.yaml 配置文件加载参数"""
+        config_file = os.path.join(
+            get_package_share_directory('robot_speaker'),
+            'config',
+            'voice.yaml'
+        )
+        with open(config_file, 'r') as f:
+            config = yaml.safe_load(f)
+        
+        # 音频参数
+        audio = config['audio']
+        mic = audio['microphone']
+        soundcard = audio['soundcard']
+        echo = audio['echo_cancellation']
+        tts_audio = audio['tts']
+        
+        self.input_device_index = mic['device_index']
+        self.output_card_index = soundcard['card_index']
+        self.output_device_index = soundcard['device_index']
+        self.sample_rate = mic['sample_rate']
+        self.channels = mic['channels']
+        self.chunk = mic['chunk']
+        self.audio_microphone_heartbeat_interval = mic['heartbeat_interval']
+        self.output_sample_rate = soundcard['sample_rate']
+        self.output_channels = soundcard['channels']
+        self.output_volume = soundcard['volume']
+        self.audio_echo_cancellation_enabled = echo.get('enabled', True)  # 默认启用
+        self.audio_echo_cancellation_max_duration_ms = echo['max_duration_ms']
+        self.audio_tts_source_sample_rate = tts_audio['source_sample_rate']
+        self.audio_tts_source_channels = tts_audio['source_channels']
+        self.audio_tts_ffmpeg_thread_queue_size = tts_audio['ffmpeg_thread_queue_size']
+        
+        # VAD参数
+        vad = config['vad']
+        self.vad_mode = vad['vad_mode']
+        self.silence_duration_ms = vad['silence_duration_ms']
+        self.min_energy_threshold = vad['min_energy_threshold']
+        
+        # DashScope参数
+        dashscope = config['dashscope']
+        self.dashscope_api_key = dashscope['api_key']
+        self.asr_model = dashscope['asr']['model']
+        self.asr_url = dashscope['asr']['url']
+        self.llm_model = dashscope['llm']['model']
+        self.llm_base_url = dashscope['llm']['base_url']
+        self.llm_temperature = dashscope['llm']['temperature']
+        self.llm_max_tokens = dashscope['llm']['max_tokens']
+        self.llm_max_history = dashscope['llm']['max_history']
+        self.llm_summary_trigger = dashscope['llm']['summary_trigger']
+        self.tts_model = dashscope['tts']['model']
+        self.tts_voice = dashscope['tts']['voice']
+        
+        # 系统参数
+        system = config['system']
+        self.use_llm = system['use_llm']
+        self.use_wake_word = system['use_wake_word']
+        self.wake_word = system['wake_word']
+        self.session_timeout = system['session_timeout']
+        self.system_shutup_keywords = system['shutup_keywords']
+        self.system_interrupt_command_queue_depth = system['interrupt_command_queue_depth']
+        self.sv_enabled = system['sv_enabled']
+        self.sv_model_path = os.path.expanduser(system['sv_model_path'])
+        self.sv_threshold = system['sv_threshold']
+        self.sv_speaker_db_path = os.path.expanduser(system['sv_speaker_db_path'])  # 展开用户目录
+        self.sv_buffer_size = system['sv_buffer_size']
+        
+        # 相机参数
+        camera = config['camera']
+        self.camera_serial_number = camera['serial_number']
+        self.camera_rgb_width = camera['rgb']['width']
+        self.camera_rgb_height = camera['rgb']['height']
+        self.camera_rgb_fps = camera['rgb']['fps']
+        self.camera_rgb_format = camera['rgb']['format']
+        self.camera_image_jpeg_quality = camera['image']['jpeg_quality']
+        self.camera_image_max_size = camera['image']['max_size']
+
+        self.knowledge_file = os.path.join(
+            get_package_share_directory('robot_speaker'),
+            'config',
+            'knowledge.json'
+        )
+   
+    # ==================== 组件初始化 ====================
+    def _init_components(self):
+        """初始化所有组件"""
+        self.shutup_keywords = [k.strip() for k in self.system_shutup_keywords.split(',') if k.strip()]
+        
+        self.kb_answers_map = {}
+        if self.knowledge_file and os.path.exists(self.knowledge_file):
+            try:
+                with open(self.knowledge_file, 'r') as f:
+                    kb_data = json.load(f)
+                entries = kb_data["entries"]
+                for entry in entries:
+                    patterns = entry["patterns"]
+                    answer = entry["answer"]
+                    if not answer.strip():
+                        continue
+                    for pattern in patterns:
+                        key = pattern.strip().lower()
+                        if key:
+                            self.kb_answers_map[key] = answer.strip()
+                self.get_logger().info(f"知识库已加载: {len(self.kb_answers_map)} 条")
+            except Exception as e:
+                self.get_logger().warning(f"知识库加载失败: {e}")
+
+        self.sv_audio_buffer = collections.deque(maxlen=self.sv_buffer_size)
+        
+        self.vad_detector = VADDetector(
+            mode=self.vad_mode,
+            sample_rate=self.sample_rate
+        )
+        
+        # 创建参考信号缓冲区（用于回声消除）,虽然播放是44100Hz，但麦克风输入是16kHz
+        self.reference_signal_buffer = ReferenceSignalBuffer(
+            max_duration_ms=self.audio_echo_cancellation_max_duration_ms,
+            sample_rate=self.sample_rate,
+            channels=self.output_channels
+        ) if self.audio_echo_cancellation_enabled else None
+        
+        # 录音器 - 直接发送音频chunk到队列
+        self.audio_recorder = AudioRecorder(
+            device_index=self.input_device_index,
+            sample_rate=self.sample_rate,
+            channels=self.channels,
+            chunk=self.chunk,
+            vad_detector=self.vad_detector,
+            audio_queue=self.audio_queue,
+            silence_duration_ms=self.silence_duration_ms,
+            min_energy_threshold=self.min_energy_threshold,
+            heartbeat_interval=self.audio_microphone_heartbeat_interval,
+            on_heartbeat=self.callbacks.on_heartbeat,
+            is_playing=self.tts_playing_event.is_set,
+            on_new_segment=self.callbacks.on_new_segment,
+            on_speech_start=self.callbacks.on_speech_start,
+            on_speech_end=self.callbacks.on_speech_end,
+            stop_flag=self.stop_event.is_set,
+            on_audio_chunk=self.callbacks.on_audio_chunk_for_sv if self.sv_enabled else None,  # 声纹录音回调
+            should_put_to_queue=self.callbacks.should_put_audio_to_queue,  # 检查是否应该将音频放入队列
+            get_silence_threshold=self.callbacks.get_silence_threshold,  # 动态静音阈值回调
+            enable_echo_cancellation=self.audio_echo_cancellation_enabled,  # 从配置文件读取
+            reference_signal_buffer=self.reference_signal_buffer,  # 传递参考信号缓冲区
+            logger=self.get_logger()
+        )
+        
+        # ASR客户端 - 流式识别
+        self.asr_client = DashScopeASR(
+            api_key=self.dashscope_api_key,
+            sample_rate=self.sample_rate,
+            model=self.asr_model,
+            url=self.asr_url,
+            logger=self.get_logger()
+        )
+        self.asr_client.on_sentence_end = self.callbacks.on_asr_sentence_end
+        self.asr_client.on_text_update = self.callbacks.on_asr_text_update
+        self.asr_client.start()
+        
+        # LLM客户端
+        if self.use_llm:
+            self.llm_client = DashScopeLLM(
+                api_key=self.dashscope_api_key,
+                model=self.llm_model,
+                base_url=self.llm_base_url,
+                temperature=self.llm_temperature,
+                max_tokens=self.llm_max_tokens,
+                name="LLM-chat",
+                logger=self.get_logger()
+            )
+            self.history = ConversationHistory(
+                max_history=self.llm_max_history,
+                summary_trigger=self.llm_summary_trigger
+            )
+        else:
+            self.llm_client = None
+            self.history = None
+        
+        # TTS客户端
+        self.get_logger().info(f"TTS配置: model={self.tts_model}, voice={self.tts_voice}")
+        self.get_logger().info(f"音频输出配置: sample_rate={self.output_sample_rate}, channels={self.output_channels}")
+        self.tts_client = DashScopeTTSClient(
+            api_key=self.dashscope_api_key,
+            model=self.tts_model,
+            voice=self.tts_voice,
+            card_index=self.output_card_index,
+            device_index=self.output_device_index,
+            output_sample_rate=self.output_sample_rate,
+            output_channels=self.output_channels,
+            output_volume=self.output_volume,
+            tts_source_sample_rate=self.audio_tts_source_sample_rate,
+            tts_source_channels=self.audio_tts_source_channels,
+            tts_ffmpeg_thread_queue_size=self.audio_tts_ffmpeg_thread_queue_size,
+            reference_signal_buffer=self.reference_signal_buffer,  # 传递参考信号缓冲区
+            logger=self.get_logger()
+        )
+        
+        # 相机客户端（默认一直运行）
+        try:
+            self.camera_client = CameraClient(
+                serial_number=self.camera_serial_number,
+                width=self.camera_rgb_width,
+                height=self.camera_rgb_height,
+                fps=self.camera_rgb_fps,
+                format=self.camera_rgb_format,
+                logger=self.get_logger()
+            )
+            self.camera_client.initialize()
+        except Exception as e:
+            self.get_logger().warning(f"相机初始化失败: {e}，相机功能将不可用")
+            self.camera_client = None
+        
+        # 声纹识别客户端
+        if self.sv_enabled and self.sv_model_path:
+            try:
+                self.sv_client = SpeakerVerificationClient(
+                    model_path=self.sv_model_path,
+                    threshold=self.sv_threshold,
+                    speaker_db_path=self.sv_speaker_db_path,
+                    logger=self.get_logger()
+                )
+            except Exception as e:
+                self.get_logger().warning(f"声纹识别初始化失败: {e}，声纹功能将不可用")
+                self.sv_client = None
+                self.sv_enabled = False
+        else:
+            self.sv_client = None
+    
+    # ==================== 线程启动 ====================
+    def _start_threads(self):
+        """启动线程"""
+        # 线程1: 录音线程
+        self.recording_thread = threading.Thread(
+            target=self.workers.recording_worker, 
+            name="RecordingThread",
+            daemon=True
+        )
+        self.recording_thread.start()
+
+        # 线程2: ASR推理线程
+        self.asr_thread = threading.Thread(
+            target=self.workers.asr_worker,
+            name="ASRThread",
+            daemon=True
+        )
+        self.asr_thread.start()
+
+        # 线程3: 主线程 - 处理业务逻辑
+        self.process_thread = threading.Thread(
+            target=self.workers.process_worker,
+            name="ProcessThread",
+            daemon=True
+        )
+        self.process_thread.start()
+
+        # 线程4: TTS播放线程
+        self.tts_thread = threading.Thread(
+            target=self._tts_worker,
+            name="TTSThread",
+            daemon=True
+        )
+        self.tts_thread.start()
+        
+        # 线程5: 声纹识别线程（如果启用）
+        if self.sv_enabled and self.sv_client:
+            self.sv_thread = threading.Thread(
+                target=self.workers.sv_worker,
+                name="SVThread",
+                daemon=True
+            )
+            self.sv_thread.start()
+        else:
+            self.sv_thread = None
+    
+    # ==================== TTS播放线程 ====================
+    def _tts_worker(self):
+        """
+        线程4: TTS播放线程 - 只播放
+        """
+        self.get_logger().info("[TTS播放线程] 启动")
+        while not self.stop_event.is_set():
+            try:
+                text = self.tts_queue.get(timeout=1.0)
+            except queue.Empty:
+                if self.interrupt_event.is_set():
+                    self.get_logger().debug("[TTS播放线程] 检测到中断事件")
+                continue
+
+            if self.interrupt_event.is_set():
+                self.get_logger().info("[TTS播放线程] 中断播放，跳过文本")
+                continue
+
+            if not text or not str(text).strip():
+                continue
+            
+            text_str = str(text).strip()
+            text_len = len(text_str)
+            self.get_logger().info(f"[TTS播放线程] 开始播放: {text_str[:100]}... (总长度: {text_len}字符)")
+            self.tts_playing_event.set()
+
+            request = TTSRequest(text=text_str, voice=None)
+            success = self.tts_client.synthesize(
+                request,
+                interrupt_check=lambda: self.interrupt_event.is_set()
+            )
+            if success:
+                self.get_logger().info("[TTS播放线程] 播放完成")
+            else:
+                self.get_logger().info("[TTS播放线程] 播放被中断")
+
+            self.tts_playing_event.clear()
+
+            if self.interrupt_event.is_set():
+                self.get_logger().info("[TTS播放线程] 播放完成后检测到中断，清空队列")
+                self._drain_queue(self.tts_queue)
+                self.interrupt_event.clear()
+    
+    # ==================== 状态机方法 ====================
+    def _change_state(self, new_state: ConversationState, reason: str | None = None):
+        """改变状态机状态"""
+        with self.state_lock:
+            old_state = self.conversation_state
+            self.conversation_state = new_state
+            if reason:
+                self.get_logger().info(f"[状态机] {old_state.value} -> {new_state.value}: {reason}")
+            else:
+                self.get_logger().info(f"[状态机] {old_state.value} -> {new_state.value}")
+    
+    def _get_state(self) -> ConversationState:
+        """获取当前状态"""
+        with self.state_lock:
+            return self.conversation_state
+    
+    # ==================== LLM处理（含拍照） ====================
+    def _encode_image_to_base64(self, image_data: np.ndarray, quality: int = 85) -> str:
+        """将numpy图像数组编码为base64字符串"""
+        try:
+            if image_data.shape[2] == 3:
+                pil_image = Image.fromarray(image_data, 'RGB')
+            else:
+                pil_image = Image.fromarray(image_data)
+            
+            buffer = io.BytesIO()
+            pil_image.save(buffer, format='JPEG', quality=quality)
+            image_bytes = buffer.getvalue()
+            base64_str = base64.b64encode(image_bytes).decode('utf-8')
+            return base64_str
+        except Exception as e:
+            self.get_logger().error(f"图像编码失败: {e}")
+            return ""
+    
+    def _llm_process_stream_with_camera(
+        self,
+        user_text: str,
+        need_camera: bool,
+        system_prompt: str | None = None,
+        suppress_tts: bool = False
+    ) -> str:
+        """LLM流式处理 - 支持多模态（文本+图像）"""
+        if not self.llm_client or not self.history:
+            return ""
+        
+        messages = list(self.history.get_messages())
+        
+        has_system_msg = any(msg.role == "system" for msg in messages)
+        if not has_system_msg:
+            if not system_prompt:
+                system_prompt = self.intent_router.build_default_system_prompt()
+            messages.insert(0, LLMMessage(role="system", content=system_prompt))
+        
+        full_reply = ""
+        tts_text_buffer = ""
+        image_base64_list = []
+        
+        def on_token(token: str):
+            nonlocal full_reply, tts_text_buffer
+            if self.interrupt_event.is_set():
+                self.get_logger().info("[LLM流式处理] on_token回调中检测到中断，停止处理")
+                return
+            
+            full_reply += token
+            tts_text_buffer += token
+            
+        if need_camera and self.camera_client:
+            with self.camera_client.capture_context() as image_data:
+                if image_data is not None:
+                    image_base64 = self._encode_image_to_base64(
+                        image_data,
+                        quality=self.camera_image_jpeg_quality
+                    )
+                    if image_base64:
+                        image_base64_list.append(image_base64)
+                        self.get_logger().info("[相机] 已拍照")
+            
+        if image_base64_list:
+            self.get_logger().info(
+                f"[多模态] 准备发送给LLM: {len(image_base64_list)}张图片，用户文本: {user_text[:50]}"
+            )
+            for idx, img_b64 in enumerate(image_base64_list):
+                self.get_logger().debug(f"[多模态] 图片#{idx+1} base64长度: {len(img_b64)}")
+        
+        reply = self.llm_client.chat_stream(
+            messages, 
+            on_token=on_token,
+            images=image_base64_list if image_base64_list else None,
+            interrupt_check=lambda: self.interrupt_event.is_set()
+        )
+        
+        if self.interrupt_event.is_set() or (reply is None):
+            if self.interrupt_event.is_set():
+                self.get_logger().info("[LLM流式处理] 处理被中断")
+            return ""
+        
+        if image_base64_list:
+            for img_b64 in image_base64_list:
+                del img_b64
+            image_base64_list.clear()
+            self.get_logger().info("[相机] 已删除照片")
+        
+        if reply and reply.strip():
+            tts_text_to_send = reply.strip()
+            tts_buffer_len = len(tts_text_buffer.strip()) if tts_text_buffer else 0
+            reply_len = len(tts_text_to_send)
+            if tts_buffer_len != reply_len:
+                self.get_logger().info(
+                    f"[流式TTS] tts_text_buffer({tts_buffer_len}字符)和reply({reply_len}字符)长度不一致，使用reply作为TTS文本"
+                )
+        elif tts_text_buffer and tts_text_buffer.strip():
+            tts_text_to_send = tts_text_buffer.strip()
+            self.get_logger().warning(
+                f"[流式TTS] reply为空，使用tts_text_buffer({len(tts_text_to_send)}字符)作为TTS文本"
+            )
+        else:
+            tts_text_to_send = ""
+            self.get_logger().warning("[流式TTS] reply和tts_text_buffer都为空，无法发送TTS文本")
+        
+        if not self.interrupt_event.is_set() and tts_text_to_send and not suppress_tts:
+            text_len = len(tts_text_to_send)
+            self.get_logger().info(
+                f"[流式TTS] 发送完整文本到TTS队列: {tts_text_to_send[:100]}... (总长度: {text_len}字符)"
+            )
+            if text_len > 100:
+                self.get_logger().debug(f"[流式TTS] 完整文本内容: {tts_text_to_send}")
+            self._put_tts_text(tts_text_to_send)
+        elif suppress_tts:
+            self.get_logger().info("[流式TTS] suppress_tts开启，跳过TTS输出")
+        
+        return reply.strip() if reply else ""
+    
+    # ==================== 中断与TTS工具 ====================
+    def _force_stop_tts(self):
+        """强制停止TTS播放 - 直接杀死记录的ffmpeg进程PID"""
+        self._drain_queue(self.tts_queue)
+        self.interrupt_event.set()
+
+        if self.tts_client and self.tts_client.current_ffmpeg_pid:
+            try:
+                pid = self.tts_client.current_ffmpeg_pid
+                os.kill(pid, 9)  # SIGKILL
+                self.get_logger().info(f"[强制停止TTS] 已终止ffmpeg进程，PID={pid}")
+                self.tts_client.current_ffmpeg_pid = None
+            except ProcessLookupError:
+                self.get_logger().debug(f"[强制停止TTS] ffmpeg进程已不存在，PID={pid}")
+                self.tts_client.current_ffmpeg_pid = None
+            except Exception as e:
+                self.get_logger().warning(f"[强制停止TTS] 终止ffmpeg进程失败: {e}")
+    
+    def _check_interrupt(self, auto_clear: bool = False) -> bool:
+        """
+        检查中断标志
+        """
+        if self.interrupt_event.is_set():
+            if auto_clear:
+                self.interrupt_event.clear()
+            return True
+        return False
+    
+    def _check_interrupt_and_cancel_turn(self) -> bool:
+        """检查中断并取消轮次（统一处理中断后的清理）"""
+        if self._check_interrupt(auto_clear=True):
+            if self.use_llm and self.history:
+                self.history.cancel_turn()
+            return True
+        return False
+    
+    # ==================== 注册/会话/唤醒词 ====================
+    def _handle_empty_speaker_db(self) -> bool:
+        """处理数据库为空的情况（统一处理）"""
+        if not (self.sv_enabled and self.sv_client):
+            return False
+        
+        speaker_count = self.sv_client.get_speaker_count()
+        if speaker_count == 0:
+            with self.sv_lock:
+                self.current_speaker_id = None
+                self.current_speaker_state = SpeakerState.UNKNOWN
+                self.current_speaker_score = 0.0
+            self.sv_result_ready_event.set()
+            return True
+        return False
+    
+    def _put_tts_text(self, text: str):
+        """统一处理TTS队列put（带异常处理）"""
+        try:
+            self.tts_queue.put(text, timeout=0.2)
+            self.get_logger().debug(f"[TTS队列] 文本已成功放入队列: {text[:50]}... (队列大小: {self.tts_queue.qsize()})")
+        except Exception as e:
+            self.get_logger().error(f"[TTS队列] 放入队列失败: {e}, 文本: {text[:50]}")
+    
+    def _interrupt_tts(self, reason: str):
+        """
+        中断TTS播放,只设置中断事件，不清空队列，让TTS线程自己检查并停止播放
+        """
+        self.get_logger().info(f"[中断] {reason}")
+        self.interrupt_event.set()
+    
+    @staticmethod
+    def _drain_queue(q: queue.Queue):
+        """清空队列"""
+        while True:
+            try:
+                q.get_nowait()
+            except queue.Empty:
+                break
+    
+    def _start_session(self):
+        """开始会话"""
+        with self.session_lock:
+            self.session_active = True
+            self.session_start_time = time.time()
+    
+    def _reset_session(self):
+        """重置会话"""
+        with self.session_lock:
+            self.session_start_time = time.time()
+    
+    def _is_session_active(self) -> bool:
+        """检查会话是否活跃"""
+        with self.session_lock:
+            if not self.session_active:
+                return False
+            if time.time() - self.session_start_time >= self.session_timeout:
+                self.session_active = False
+                return False
+            return True
+    
+    # ==================== 意图处理 ====================
+    def _handle_wake_word(self, text: str) -> str:
+        """处理唤醒词：ASR文本转拼音，检查是否包含唤醒词拼音"""
+        if not self.use_wake_word:
+            return text.strip()
+        
+        if self._is_session_active():
+            self._reset_session()
+            return text.strip()
+        
+        text_pinyin = self.intent_router.to_pinyin(text)
+        wake_word_pinyin = self.wake_word.lower().strip()
+        self.get_logger().info(f"[唤醒词] 原始文本: {text}, 文本拼音: {text_pinyin}, 唤醒词拼音: {wake_word_pinyin}")
+        if not wake_word_pinyin:
+            self.get_logger().info("[唤醒词] 唤醒词为空，过滤文本")
+            return ""
+        
+        text_pinyin_parts = text_pinyin.split() if text_pinyin else []
+        wake_word_parts = wake_word_pinyin.split()
+        
+        start_idx = -1
+        for i in range(len(text_pinyin_parts) - len(wake_word_parts) + 1):
+            if text_pinyin_parts[i:i + len(wake_word_parts)] == wake_word_parts:
+                start_idx = i
+                break
+        
+        if start_idx == -1:
+            self.get_logger().info(f"[唤醒词] 未检测到唤醒词 '{self.wake_word}'，过滤文本")
+            return ""
+        
+        removed = 0
+        new_text = ""
+        for c in text:
+            if '\u4e00' <= c <= '\u9fa5':
+                if removed < start_idx or removed >= start_idx + len(wake_word_parts):
+                    new_text += c
+                removed += 1
+            else:
+                new_text += c
+        
+        self._start_session()
+        return new_text.strip()
+    
+    def _check_shutup_command(self, text: str) -> bool:
+        """检查闭嘴指令"""
+        if not text:
+            return False
+        text_lower = text.lower()
+        text_pinyin = self.intent_router.to_pinyin(text)
+        for keyword in self.shutup_keywords:
+            kw = keyword.lower().strip()
+            if not kw:
+                continue
+            if kw in text_lower or (text_pinyin and kw in text_pinyin):
+                return True
+        return False
+
+    def _handle_intent(self, intent_payload: IntentResult):
+        """按意图路由到不同处理逻辑"""
+        intent = intent_payload.intent
+        text = intent_payload.text
+        need_camera = intent_payload.need_camera
+        system_prompt = intent_payload.system_prompt
+
+        if intent == "kb_qa":
+            answer = None
+            text_pinyin = self.intent_router.to_pinyin(text)
+            if text_pinyin:
+                answer = self.kb_answers_map.get(text_pinyin)
+            if answer:
+                if "{wake_word}" in answer:
+                    answer = answer.replace("{wake_word}", self.wake_word or "")
+                self._put_tts_text(answer)
+            else:
+                pass
+            return
+
+        if self.use_llm and self.llm_client:
+            if self.history:
+                self.history.start_turn(text)
+
+            reply = self._llm_process_stream_with_camera(
+                text,
+                need_camera=need_camera,
+                system_prompt=system_prompt,
+                suppress_tts=(intent == "skill_sequence")
+            )
+            if reply:
+                if self.history:
+                    self.history.commit_turn(reply)
+                if intent == "skill_sequence":
+                    skill_msg = String()
+                    skill_msg.data = reply.strip()
+                    self.skill_sequence_pub.publish(skill_msg)
+                    self.get_logger().info(f"[技能序列] 已发布: {skill_msg.data}")
+            else:
+                if self.history:
+                    self.history.cancel_turn()
+        else:
+            self.get_logger().warning("[主线程] 未启用LLM，无法处理文本")
+    
+    # ==================== 资源清理 ====================
+    def destroy_node(self):
+        """销毁节点"""
+        self.get_logger().info("语音节点正在关闭...")
+        self.stop_event.set()
+        self.interrupt_event.set()
+        self.get_logger().info("强制停止TTS播放...")
+        self._force_stop_tts()
+        
+        self._drain_queue(self.tts_queue)
+        
+        threads_to_join = [self.recording_thread, self.asr_thread, self.process_thread, self.tts_thread]
+        if self.sv_thread:
+            threads_to_join.append(self.sv_thread)
+        for thread in threads_to_join:
+            if thread and thread.is_alive():
+                thread.join(timeout=1.0)
+        
+        self._force_stop_tts()
+        
+        if hasattr(self, 'asr_client') and self.asr_client:
+            self.asr_client.stop()
+        
+        if hasattr(self, 'audio_recorder') and self.audio_recorder:
+            self.audio_recorder.cleanup()
+        
+        if hasattr(self, 'camera_client') and self.camera_client:
+            self.camera_client.cleanup()
+        
+        if hasattr(self, 'sv_client') and self.sv_client:
+            try:
+                self.sv_client.save_speakers()
+                self.sv_client.cleanup()
+            except Exception as e:
+                self.get_logger().warning(f"清理声纹识别资源时出错: {e}")
+        
+        super().destroy_node()
+
+    def _on_skill_feedback(self, msg: String):
+        try:
+            feedback = json.loads(msg.data)
+            self.latest_skill_feedback = feedback
+            feedback_text = (
+                f"【执行状态】阶段:{feedback.get('stage','')}, "
+                f"技能:{feedback.get('current_skill','')}, "
+                f"进度:{feedback.get('progress', 0):.1%}, "
+                f"详情:{feedback.get('detail','')}"
+            )
+            if self.history:
+                self.history.add_message("system", feedback_text)
+        except Exception as e:
+            self.get_logger().warning(f"[技能反馈] 解析失败: {e}")
+
+    def _on_skill_result(self, msg: String):
+        try:
+            result = json.loads(msg.data)
+            self.latest_skill_result = result
+            result_text = (
+                f"【执行结果】{'成功' if result.get('success') else '失败'}, "
+                f"总技能数:{result.get('total_skills', 0)}, "
+                f"成功数:{result.get('succeeded_skills', 0)}, "
+                f"消息:{result.get('message','')}"
+            )
+            if self.history:
+                self.history.add_message("system", result_text)
+        except Exception as e:
+            self.get_logger().warning(f"[技能结果] 解析失败: {e}")
+
+
+def _init_ros(args):
+    rclpy.init(args=args)
+
+def _create_node():
+    return RobotSpeakerNode()
+
+def _run_node(node):
+    rclpy.spin(node)
+
+def _cleanup_node(node):
+    if node:
+        node.destroy_node()
+
+def _shutdown_ros():
+    if rclpy.ok():
+        rclpy.shutdown()
+
+# ==================== 入口 ====================
+def main(args=None):
+    node = None
+    _init_ros(args)
+    node = _create_node()
+    _run_node(node)
+    _cleanup_node(node)
+    _shutdown_ros()
+
+
+if __name__ == '__main__':
+    main()
--- a/robot_speaker/core/types.py
+++ b/robot_speaker/core/types.py
@@ -0,0 +1,36 @@
+"""
+统一数据结构定义
+"""
+from dataclasses import dataclass
+
+
+@dataclass
+class ASRResult:
+    """ASR识别结果"""
+    text: str
+    confidence: float | None = None
+    language: str | None = None
+
+
+@dataclass
+class LLMMessage:
+    """LLM消息"""
+    role: str  # "user", "assistant", "system"
+    content: str
+
+
+@dataclass
+class TTSRequest:
+    """TTS请求"""
+    text: str
+    voice: str | None = None  # 如果为None，使用控制台配置的默认音色
+    speed: float | None = None
+    pitch: float | None = None
+
+
+@dataclass
+class ImageMessage:
+    """图像消息 - 用于多模态LLM"""
+    image_data: bytes  # base64编码的图像数据
+    image_format: str = "jpeg"
+
--- a/robot_speaker/models/init.py
+++ b/robot_speaker/models/init.py
@@ -0,0 +1,5 @@
+"""模型层"""
+
+
+
+
--- a/robot_speaker/models/asr/init.py
+++ b/robot_speaker/models/asr/init.py
@@ -0,0 +1,5 @@
+"""ASR模型"""
+
+
+
+
--- a/robot_speaker/models/asr/base.py
+++ b/robot_speaker/models/asr/base.py
@@ -0,0 +1,13 @@
+class ASRClient:
+    def start(self) -> bool:
+        raise NotImplementedError
+    
+    def stop(self) -> bool:
+        raise NotImplementedError
+    
+    def send_audio(self, audio_data: bytes) -> bool:
+        raise NotImplementedError
+
+
+
+
--- a/robot_speaker/models/asr/dashscope.py
+++ b/robot_speaker/models/asr/dashscope.py
@@ -0,0 +1,218 @@
+"""
+ASR语音识别模块
+"""
+import base64
+import time
+import threading
+import dashscope
+from dashscope.audio.qwen_omni import OmniRealtimeConversation, OmniRealtimeCallback
+from dashscope.audio.qwen_omni.omni_realtime import TranscriptionParams, MultiModality
+from robot_speaker.models.asr.base import ASRClient
+
+
+class DashScopeASR(ASRClient):
+    """DashScope实时ASR识别器封装"""
+    
+    def __init__(self, api_key: str, 
+                 sample_rate: int,
+                 model: str,
+                 url: str,
+                 logger=None):
+        dashscope.api_key = api_key
+        self.sample_rate = sample_rate
+        self.model = model
+        self.url = url
+        self.logger = logger
+        
+        self.conversation = None
+        self.running = False
+        self.on_sentence_end = None
+        self.on_text_update = None  # 实时文本更新回调
+        
+        # 线程同步机制
+        self._stop_lock = threading.Lock()  # 防止并发调用 stop_current_recognition
+        self._final_result_event = threading.Event()  # 等待 final 回调完成
+        self._pending_commit = False  # 标记是否有待处理的 commit
+    
+    def _log(self, level: str, msg: str):
+        """记录日志，根据级别调用对应的ROS2日志方法"""
+        if self.logger:
+            # ROS2 logger不能动态改变severity级别，需要显式调用对应方法
+            if level == "debug":
+                self.logger.debug(msg)
+            elif level == "info":
+                self.logger.info(msg)
+            elif level == "warning":
+                self.logger.warn(msg)
+            elif level == "error":
+                self.logger.error(msg)
+            else:
+                self.logger.info(msg)  # 默认使用info级别
+        else:
+            print(f"[ASR] {msg}")
+    
+    def start(self):
+        """启动ASR识别器"""
+        if self.running:
+            return False
+        
+        try:
+            callback = _ASRCallback(self)
+            self.conversation = OmniRealtimeConversation(
+                model=self.model,
+                url=self.url,
+                callback=callback
+            )
+            callback.conversation = self.conversation
+            
+            self.conversation.connect()
+            
+            transcription_params = TranscriptionParams(
+                language='zh',
+                sample_rate=self.sample_rate,
+                input_audio_format="pcm",
+            )
+
+            # 本地 VAD → 只控制 TTS 打断
+            # 服务端 turn detection → 只控制 ASR 输出、LLM 生成轮次
+            
+            self.conversation.update_session(
+                output_modalities=[MultiModality.TEXT],
+                enable_input_audio_transcription=True,
+                transcription_params=transcription_params,
+                enable_turn_detection=True,
+                # 保留服务端 turn detection
+                turn_detection_type='server_vad',  # 服务端VAD
+                turn_detection_threshold=0.2,      # 可调
+                turn_detection_silence_duration_ms=800
+            )
+            
+            self.running = True
+            self._log("info", "ASR已启动")
+            return True
+        except Exception as e:
+            self.running = False
+            self._log("error", f"ASR启动失败: {e}")
+            if self.conversation:
+                try:
+                    self.conversation.close()
+                except:
+                    pass
+                self.conversation = None
+            return False
+    
+    def send_audio(self, audio_chunk: bytes):
+        """发送音频chunk到ASR"""
+        if not self.running or not self.conversation:
+            return False
+        try:
+            audio_b64 = base64.b64encode(audio_chunk).decode('ascii')
+            self.conversation.append_audio(audio_b64)
+            return True
+        except Exception as e:
+            # 连接已关闭或其他错误，静默处理（避免日志过多）
+            # running状态会在stop_current_recognition中正确设置
+            return False
+    
+    def stop_current_recognition(self):
+        """
+        触发提交操作获取当前识别结果，但不关闭连接
+        """
+        if not self.running or not self.conversation:
+            return False
+
+        # 使用锁防止并发调用
+        if not self._stop_lock.acquire(blocking=False):
+            self._log("warning", "stop_current_recognition 正在执行，跳过本次调用")
+            return False
+
+        try:
+            # 重置事件，准备等待 final 回调
+            self._final_result_event.clear()
+            self._pending_commit = True
+
+            # 触发 commit，等待 final 结果
+            self.conversation.commit()
+
+            # 等待 final 回调完成（最多等待1秒）
+            if self._final_result_event.wait(timeout=1.0):
+                self._log("debug", "已收到 final 回调")
+            else:
+                self._log("warning", "等待 final 回调超时，继续执行")
+
+            return True
+
+        except Exception as e:
+            self._log("error", f"提交当前识别结果失败: {e}")
+            # 出现错误时尝试重启连接
+            self.running = False
+            try:
+                if self.conversation:
+                    self.conversation.close()
+            except:
+                pass
+            self.conversation = None
+            time.sleep(0.1)
+            return self.start()
+
+        finally:
+            self._pending_commit = False
+            self._stop_lock.release()
+    
+    def stop(self):
+        """停止ASR识别器"""
+        # 等待正在执行的 stop_current_recognition 完成
+        with self._stop_lock:
+            self.running = False
+            self._final_result_event.set()  # 唤醒可能正在等待的线程
+            if self.conversation:
+                try:
+                    self.conversation.close()
+                except Exception as e:
+                    self._log("warning", f"停止时关闭连接出错: {e}")
+                self.conversation = None
+            self._log("info", "ASR已停止")
+
+
+class _ASRCallback(OmniRealtimeCallback):
+    """ASR回调处理"""
+    
+    def __init__(self, asr_client: DashScopeASR):
+        self.asr_client = asr_client
+        self.conversation = None
+    
+    def on_open(self):
+        self.asr_client._log("info", "ASR WebSocket已连接")
+    
+    def on_close(self, code, msg):
+        self.asr_client._log("info", f"ASR WebSocket已关闭: code={code}, msg={msg}")
+    
+    def on_event(self, response):
+        event_type = response.get('type', '')
+        
+        if event_type == 'session.created':
+            session_id = response.get('session', {}).get('id', '')
+            self.asr_client._log("info", f"ASR会话已创建: {session_id}")
+        
+        elif event_type == 'conversation.item.input_audio_transcription.completed':
+            # 最终识别结果
+            transcript = response.get('transcript', '')
+            if transcript and transcript.strip() and self.asr_client.on_sentence_end:
+                self.asr_client.on_sentence_end(transcript.strip())
+            
+            # 如果有待处理的 commit，通知等待的线程
+            if self.asr_client._pending_commit:
+                self.asr_client._final_result_event.set()
+        
+        elif event_type == 'conversation.item.input_audio_transcription.text':
+            # 实时识别文本更新（多轮提示）
+            transcript = response.get('transcript', '') or response.get('text', '')
+            if transcript and transcript.strip() and self.asr_client.on_text_update:
+                self.asr_client.on_text_update(transcript.strip())
+        
+        elif event_type == 'input_audio_buffer.speech_started':
+            self.asr_client._log("info", "ASR检测到说话开始")
+        
+        elif event_type == 'input_audio_buffer.speech_stopped':
+            self.asr_client._log("info", "ASR检测到说话结束")
+
--- a/robot_speaker/models/llm/init.py
+++ b/robot_speaker/models/llm/init.py
@@ -0,0 +1,5 @@
+"""LLM模型"""
+
+
+
+
--- a/robot_speaker/models/llm/base.py
+++ b/robot_speaker/models/llm/base.py
@@ -0,0 +1,15 @@
+from robot_speaker.core.types import LLMMessage
+
+
+class LLMClient:
+    def chat(self, messages: list[LLMMessage]) -> str | None:
+        raise NotImplementedError
+    
+    def chat_stream(self, messages: list[LLMMessage],
+                    on_token=None,
+                    interrupt_check=None) -> str | None:
+        raise NotImplementedError
+
+
+
+
--- a/robot_speaker/models/llm/dashscope.py
+++ b/robot_speaker/models/llm/dashscope.py
@@ -0,0 +1,149 @@
+"""
+LLM大语言模型模块
+支持多模态（文本+图像）
+"""
+from openai import OpenAI
+from typing import Optional, List
+from robot_speaker.core.types import LLMMessage
+from robot_speaker.models.llm.base import LLMClient
+
+
+class DashScopeLLM(LLMClient):
+    """DashScope LLM客户端封装"""
+    
+    def __init__(self, api_key: str, 
+                 model: str,
+                 base_url: str,
+                 temperature: float,
+                 max_tokens: int,
+                 name: str = "LLM",
+                 logger=None):
+        self.client = OpenAI(api_key=api_key, base_url=base_url)
+        self.model = model
+        self.temperature = temperature
+        self.max_tokens = max_tokens
+        self.name = name
+        self.logger = logger
+    
+    def _log(self, level: str, msg: str):
+        """记录日志，根据级别调用对应的ROS2日志方法"""
+        msg = f"[{self.name}] {msg}"
+        if self.logger:
+            # ROS2 logger不能动态改变severity级别，需要显式调用对应方法
+            if level == "debug":
+                self.logger.debug(msg)
+            elif level == "info":
+                self.logger.info(msg)
+            elif level == "warning":
+                self.logger.warn(msg)
+            elif level == "error":
+                self.logger.error(msg)
+            else:
+                self.logger.info(msg)  # 默认使用info级别
+    
+    def chat(self, messages: list[LLMMessage]) -> str | None:
+        """非流式聊天:任务规划"""
+        payload_messages = [{"role": msg.role, "content": msg.content} for msg in messages]
+        response = self.client.chat.completions.create(
+            model=self.model,
+            messages=payload_messages,
+            temperature=self.temperature,
+            max_tokens=self.max_tokens,
+            stream=False
+        )
+        reply = response.choices[0].message.content.strip()
+        return reply if reply else None
+    
+    def chat_stream(self, messages: list[LLMMessage], 
+                   on_token=None,
+                   images: Optional[List[str]] = None,
+                   interrupt_check=None) -> str | None:
+        """
+        流式聊天:语音系统
+        支持多模态（文本+图像）
+        支持中断检查（interrupt_check: 返回True表示需要中断）
+        """
+        # 转换消息格式，支持多模态
+        # 图像只添加到最后一个user消息中
+        payload_messages = []
+        last_user_idx = -1
+        for i, msg in enumerate(messages):
+            if msg.role == "user":
+                last_user_idx = i
+        
+        has_images_in_message = False
+        for i, msg in enumerate(messages):
+            msg_dict = {"role": msg.role}
+            
+            # 如果当前消息是最后一个user消息且有图像，构建多模态content
+            if i == last_user_idx and msg.role == "user" and images and len(images) > 0:
+                content_list = [{"type": "text", "text": msg.content}]
+                # 添加所有图像
+                for img_idx, img_base64 in enumerate(images):
+                    image_url = f"data:image/jpeg;base64,{img_base64[:50]}..." if len(img_base64) > 50 else f"data:image/jpeg;base64,{img_base64}"
+                    content_list.append({
+                        "type": "image_url",
+                        "image_url": {
+                            "url": f"data:image/jpeg;base64,{img_base64}"
+                        }
+                    })
+                    self._log("info", f"[多模态] 添加图像 #{img_idx+1} 到user消息，base64长度: {len(img_base64)}")
+                msg_dict["content"] = content_list
+                has_images_in_message = True
+            else:
+                msg_dict["content"] = msg.content
+            
+            payload_messages.append(msg_dict)
+        
+        # 记录多模态信息
+        if images and len(images) > 0:
+            if has_images_in_message:
+                # 找到最后一个user消息，记录其content结构
+                last_user_msg = payload_messages[last_user_idx] if last_user_idx >= 0 else None
+                if last_user_msg and isinstance(last_user_msg.get("content"), list):
+                    content_items = last_user_msg["content"]
+                    text_items = [item for item in content_items if item.get("type") == "text"]
+                    image_items = [item for item in content_items if item.get("type") == "image_url"]
+                    self._log("info", f"[多模态] 已发送多模态请求: {len(text_items)}个文本 + {len(image_items)}张图片")
+                    self._log("debug", f"[多模态] 用户文本: {text_items[0].get('text', '')[:50] if text_items else 'N/A'}")
+                else:
+                    self._log("warning", "[多模态] 消息格式异常，无法确认图片是否添加")
+            else:
+                self._log("warning", f"[多模态] 有{len(images)}张图片，但未找到user消息，图片未被添加")
+        else:
+            self._log("debug", "[多模态] 纯文本请求（无图片）")
+        
+        full_reply = ""
+        interrupted = False
+        
+        stream = self.client.chat.completions.create(
+            model=self.model,
+            messages=payload_messages,
+            temperature=self.temperature,
+            max_tokens=self.max_tokens,
+            stream=True
+        )
+        
+        for chunk in stream:
+            # 检查中断标志
+            if interrupt_check and interrupt_check():
+                self._log("info", "LLM流式处理被中断")
+                interrupted = True
+                break
+            
+            if chunk.choices and chunk.choices[0].delta.content:
+                content = chunk.choices[0].delta.content
+                full_reply += content
+                if on_token:
+                    on_token(content)
+                    # 在on_token回调后再次检查中断（on_token可能设置中断标志）
+                    if interrupt_check and interrupt_check():
+                        self._log("info", "LLM流式处理在on_token回调后被中断")
+                        interrupted = True
+                        break
+        
+        if interrupted:
+            return None  # 被中断时返回None，表示未完成
+        
+        return full_reply.strip() if full_reply else None
+
--- a/robot_speaker/models/tts/init.py
+++ b/robot_speaker/models/tts/init.py
@@ -0,0 +1,5 @@
+"""TTS模型"""
+
+
+
+
--- a/robot_speaker/models/tts/base.py
+++ b/robot_speaker/models/tts/base.py
@@ -0,0 +1,14 @@
+from robot_speaker.core.types import TTSRequest
+
+
+class TTSClient:
+    """TTS客户端抽象基类"""
+    
+    def synthesize(self, request: TTSRequest,
+                   on_chunk=None,
+                   interrupt_check=None) -> bool:
+        raise NotImplementedError
+
+
+
+
--- a/robot_speaker/models/tts/dashscope.py
+++ b/robot_speaker/models/tts/dashscope.py
@@ -0,0 +1,244 @@
+"""
+TTS语音合成模块
+"""
+import subprocess
+import dashscope
+from dashscope.audio.tts_v2 import SpeechSynthesizer, ResultCallback, AudioFormat
+from robot_speaker.core.types import TTSRequest
+from robot_speaker.models.tts.base import TTSClient
+
+
+class DashScopeTTSClient(TTSClient):
+    """DashScope流式TTS客户端封装"""
+    
+    def __init__(self, api_key: str, 
+                 model: str,
+                 voice: str,
+                 card_index: int, 
+                 device_index: int,
+                 output_sample_rate: int = 44100,
+                 output_channels: int = 2,
+                 output_volume: float = 1.0,
+                 tts_source_sample_rate: int = 22050,  # TTS服务固定输出采样率
+                 tts_source_channels: int = 1,  # TTS服务固定输出声道数
+                 tts_ffmpeg_thread_queue_size: int = 1024,  # ffmpeg输入线程队列大小
+                 reference_signal_buffer=None,  # 参考信号缓冲区（用于回声消除）
+                 logger=None):
+        dashscope.api_key = api_key
+        self.model = model
+        self.voice = voice
+        self.card_index = card_index
+        self.device_index = device_index
+        self.output_sample_rate = output_sample_rate
+        self.output_channels = output_channels
+        self.output_volume = output_volume
+        self.tts_source_sample_rate = tts_source_sample_rate
+        self.tts_source_channels = tts_source_channels
+        self.tts_ffmpeg_thread_queue_size = tts_ffmpeg_thread_queue_size
+        self.reference_signal_buffer = reference_signal_buffer  # 参考信号缓冲区
+        self.logger = logger
+        self.current_ffmpeg_pid = None  # 当前ffmpeg进程的PID
+        
+        # 构建ALSA设备, 允许 ffmpeg 自动重采样 / 重声道
+        self.alsa_device = f"plughw:{card_index},{device_index}" if (
+            card_index >= 0 and device_index >= 0
+        ) else "default"
+    
+    def _log(self, level: str, msg: str):
+        """记录日志，根据级别调用对应的ROS2日志方法"""
+        if self.logger:
+            # ROS2 logger不能动态改变severity级别，需要显式调用对应方法
+            if level == "debug":
+                self.logger.debug(msg)
+            elif level == "info":
+                self.logger.info(msg)
+            elif level == "warning":
+                self.logger.warn(msg)
+            elif level == "error":
+                self.logger.error(msg)
+            else:
+                self.logger.info(msg)  # 默认使用info级别
+        else:
+            print(f"[TTS] {msg}")
+    
+    def synthesize(self, request: TTSRequest,
+                   on_chunk=None,
+                   interrupt_check=None) -> bool:
+        """主流程:流式合成并播放"""
+        callback = _TTSCallback(self, interrupt_check, on_chunk, self.reference_signal_buffer)
+        # 使用配置的voice，request.voice为None或空时使用self.voice
+        voice_to_use = request.voice if request.voice and request.voice.strip() else self.voice
+        
+        if not voice_to_use or not voice_to_use.strip():
+            self._log("error", f"Voice参数无效: '{voice_to_use}'")
+            return False
+        
+        self._log("info", f"TTS开始: 文本='{request.text[:50]}...', voice='{voice_to_use}'")
+        synthesizer = SpeechSynthesizer(
+            model=self.model,
+            voice=voice_to_use,
+            format=AudioFormat.PCM_22050HZ_MONO_16BIT,
+            callback=callback,
+        )
+        
+        try:
+            synthesizer.streaming_call(request.text)
+            synthesizer.streaming_complete()
+        finally:
+            callback.cleanup()
+        
+        return not callback._interrupted
+
+
+class _TTSCallback(ResultCallback):
+    """TTS回调处理 - 使用ffmpeg播放，自动处理采样率转换"""
+    
+    def __init__(self, tts_client: DashScopeTTSClient,
+                 interrupt_check=None,
+                 on_chunk=None,
+                 reference_signal_buffer=None):
+        self.tts_client = tts_client
+        self.interrupt_check = interrupt_check
+        self.on_chunk = on_chunk
+        self.reference_signal_buffer = reference_signal_buffer  # 参考信号缓冲区
+        self._proc = None
+        self._interrupted = False
+        self._cleaned_up = False
+    
+    def on_open(self):
+        # 使用ffmpeg播放，自动处理采样率转换（TTS源采样率 -> 设备采样率）
+        # TTS服务输出固定采样率和声道数，ffmpeg会自动转换为设备采样率和声道数
+        ffmpeg_cmd = [
+            'ffmpeg',
+            '-f', 's16le',            # 原始 PCM
+            '-ar', str(self.tts_client.tts_source_sample_rate),  # TTS输出采样率（从配置文件读取）
+            '-ac', str(self.tts_client.tts_source_channels),  # TTS输出声道数（从配置文件读取）
+            '-i', 'pipe:0',            # stdin
+            '-f', 'alsa',              # 输出到 ALSA
+            '-ar', str(self.tts_client.output_sample_rate),  # 输出设备采样率（从配置文件读取）
+            '-ac', str(self.tts_client.output_channels),  # 输出设备声道数（从配置文件读取）
+            '-acodec', 'pcm_s16le',    # 输出编码
+            '-fflags', 'nobuffer',     # 减少缓冲
+            '-flags', 'low_delay',     # 低延迟
+            '-avioflags', 'direct',    # 尝试直通写入 ALSA，减少延迟
+            self.tts_client.alsa_device
+        ]
+        
+        # 将 -thread_queue_size 放到输入文件之前
+        insert_pos = ffmpeg_cmd.index('-i')
+        ffmpeg_cmd.insert(insert_pos, str(self.tts_client.tts_ffmpeg_thread_queue_size))
+        ffmpeg_cmd.insert(insert_pos, '-thread_queue_size')
+        
+        # 添加音量调节filter（如果音量不是1.0）
+        if self.tts_client.output_volume != 1.0:
+            # 在输出编码前插入音量filter
+            # volume filter放在输入之后、输出编码之前
+            acodec_idx = ffmpeg_cmd.index('-acodec')
+            ffmpeg_cmd.insert(acodec_idx, f'volume={self.tts_client.output_volume}')
+            ffmpeg_cmd.insert(acodec_idx, '-af')
+
+        self.tts_client._log("info", f"启动ffmpeg播放: ALSA设备={self.tts_client.alsa_device}, "
+                                     f"输出采样率={self.tts_client.output_sample_rate}Hz, "
+                                     f"输出声道数={self.tts_client.output_channels}, "
+                                     f"音量={self.tts_client.output_volume * 100:.0f}%")
+        self._proc = subprocess.Popen(
+            ffmpeg_cmd,
+            stdin=subprocess.PIPE,
+            stdout=subprocess.DEVNULL,
+            stderr=subprocess.PIPE  # 改为PIPE以便捕获错误
+        )
+        # 记录ffmpeg进程PID
+        self.tts_client.current_ffmpeg_pid = self._proc.pid
+        self.tts_client._log("debug", f"ffmpeg进程已启动，PID={self._proc.pid}")
+    
+    def on_complete(self):
+        pass
+    
+    def on_error(self, message: str):
+        self.tts_client._log("error", f"TTS错误: {message}")
+    
+    def on_close(self):
+        self.cleanup()
+    
+    def on_event(self, message):
+        pass
+    
+    def on_data(self, data: bytes) -> None:
+        """接收音频数据并播放"""
+        if self._interrupted:
+            return
+        
+        if self.interrupt_check and self.interrupt_check():
+            # 停止播放,不停止 TTS
+            self._interrupted = True
+            if self._proc:
+                self._proc.terminate()
+            return
+        
+        # 优先写入ffmpeg，避免阻塞播放
+        # 优先写入ffmpeg，避免阻塞播放
+        if self._proc and self._proc.stdin and not self._interrupted:
+            try:
+                self._proc.stdin.write(data)
+                self._proc.stdin.flush()
+            except BrokenPipeError:
+                # ffmpeg进程可能已退出，检查错误
+                if self._proc.stderr:
+                    error_msg = self._proc.stderr.read().decode('utf-8', errors='ignore')
+                    self.tts_client._log("error", f"ffmpeg错误: {error_msg}")
+                self._interrupted = True
+        
+        # 将音频数据添加到参考信号缓冲区（用于回声消除）
+        # 在写入ffmpeg之后处理，避免阻塞播放
+        if self.reference_signal_buffer and data:
+            try:
+                self.reference_signal_buffer.add_reference(
+                    data, 
+                    source_sample_rate=self.tts_client.tts_source_sample_rate,
+                    source_channels=self.tts_client.tts_source_channels
+                )
+            except Exception as e:
+                # 参考信号处理失败不应影响播放
+                self.tts_client._log("warning", f"参考信号处理失败: {e}")
+        
+            if self.on_chunk:
+                self.on_chunk(data)
+    
+    def cleanup(self):
+        """清理资源"""
+        if self._cleaned_up or not self._proc:
+            return
+        self._cleaned_up = True
+        
+        # 关闭stdin，让ffmpeg处理完剩余数据
+        if self._proc.stdin and not self._proc.stdin.closed:
+            try:
+                self._proc.stdin.close()
+            except:
+                pass
+        
+        # 等待进程自然结束（根据文本长度估算，最少10秒，最多30秒）
+        # 假设平均语速：3-4字/秒，加上缓冲时间
+        if self._proc.poll() is None:
+            try:
+                # 增加等待时间，确保ffmpeg播放完成
+                # 对于长文本，可能需要更长时间
+                self._proc.wait(timeout=30.0)
+            except:
+                # 超时后，如果进程还在运行，说明可能卡住了，强制终止
+                if self._proc.poll() is None:
+                    self.tts_client._log("warning", "ffmpeg播放超时，强制终止")
+                    try:
+                        self._proc.terminate()
+                        self._proc.wait(timeout=1.0)
+                    except:
+                        try:
+                            self._proc.kill()
+                            self._proc.wait(timeout=0.1)
+                        except:
+                            pass
+        
+        # 清空PID记录
+        if self.tts_client.current_ffmpeg_pid == self._proc.pid:
+            self.tts_client.current_ffmpeg_pid = None
+
--- a/robot_speaker/perception/init.py
+++ b/robot_speaker/perception/init.py
@@ -0,0 +1,5 @@
+"""感知层"""
+
+
+
+
--- a/robot_speaker/perception/audio_pipeline.py
+++ b/robot_speaker/perception/audio_pipeline.py
@@ -0,0 +1,304 @@
+"""
+音频处理模块：录音 + VAD + 回声消除
+"""
+import time
+import pyaudio
+import webrtcvad
+import struct
+import queue
+from .echo_cancellation import EchoCanceller, ReferenceSignalBuffer
+
+
+class VADDetector:
+    """VAD语音检测器"""
+    
+    def __init__(self, mode: int, sample_rate: int):
+        self.vad = webrtcvad.Vad(mode)
+        self.sample_rate = sample_rate
+
+
+class AudioRecorder:
+    """音频录音器 - 录音线程"""
+    
+    def __init__(self, device_index: int, sample_rate: int, channels: int, 
+                 chunk: int, vad_detector: VADDetector,
+                 audio_queue: queue.Queue,  # 音频队列：录音线程 → ASR线程
+                 silence_duration_ms: int = 1000,
+                 min_energy_threshold: int = 300, # 音频能量 > 300：有语音
+                 heartbeat_interval: float = 2.0,
+                 on_heartbeat=None,
+                 is_playing=None,
+                 on_new_segment=None,  # 检测到新的人声段
+                 on_speech_start=None,  # 检测到人声开始
+                 on_speech_end=None,  # 检测到静音结束（说话结束）
+                 stop_flag=None,
+                 on_audio_chunk=None,  # 音频chunk回调（用于声纹录音等，可选）
+                 should_put_to_queue=None,  # 检查是否应该将音频放入队列（用于阻止ASR，可选）
+                 get_silence_threshold=None,  # 获取动态静音阈值（毫秒，可选）
+                 enable_echo_cancellation: bool = True,  # 是否启用回声消除
+                 reference_signal_buffer: ReferenceSignalBuffer = None,  # 参考信号缓冲区（可选）
+                 logger=None):
+        self.device_index = device_index
+        self.sample_rate = sample_rate
+        self.channels = channels
+        self.chunk = chunk
+        self.vad_detector = vad_detector
+        self.audio_queue = audio_queue
+        self.silence_duration_ms = int(silence_duration_ms)
+        self.min_energy_threshold = int(min_energy_threshold)
+        self.heartbeat_interval = heartbeat_interval
+        
+        self.on_heartbeat = on_heartbeat
+        self.is_playing = is_playing or (lambda: False)
+        self.on_new_segment = on_new_segment
+        self.on_speech_start = on_speech_start
+        self.on_speech_end = on_speech_end
+        self.stop_flag = stop_flag or (lambda: False)
+        self.on_audio_chunk = on_audio_chunk  # 音频chunk回调（用于声纹录音等）
+        self.should_put_to_queue = should_put_to_queue or (lambda: True)  # 默认允许放入队列
+        self.get_silence_threshold = get_silence_threshold  # 动态静音阈值回调
+        self.logger = logger
+        self.audio = pyaudio.PyAudio()
+
+        # 自动查找 iFLYTEK 麦克风设备
+        try:
+            count = self.audio.get_device_count()
+            found_index = -1
+            if self.logger:
+                self.logger.info(f"开始扫描音频设备 (总数: {count})...")
+
+            for i in range(count):
+                device_info = self.audio.get_device_info_by_index(i)
+                device_name = device_info.get('name', '')
+                max_input_channels = device_info.get('maxInputChannels', 0)
+                
+                if self.logger:
+                    try:
+                        self.logger.info(f"扫描设备 [{i}]: Name='{device_name}', MaxInput={max_input_channels}, Rate={int(device_info.get('defaultSampleRate'))}")
+                    except:
+                        pass
+
+                # 检查是否包含 iFLYTEK 且支持录音（输入通道 > 0）
+                if 'iFLYTEK' in device_name and max_input_channels > 0:
+                    found_index = i
+                    if self.logger:
+                        self.logger.info(f"已自动定位到麦克风设备: {device_name} (Index: {i})")
+                    break
+            
+            if found_index != -1:
+                self.device_index = found_index
+            else:
+                if self.logger:
+                    self.logger.warning(f"未自动检测到 iFLYTEK 设备，将继续使用配置的索引: {self.device_index}")
+
+        except Exception as e:
+            if self.logger:
+                self.logger.error(f"设备自动检测过程出错: {e}")
+
+        self.format = pyaudio.paInt16
+        self._debug_counter = 0
+        
+        # 回声消除相关
+        self.enable_echo_cancellation = enable_echo_cancellation
+        self.reference_signal_buffer = reference_signal_buffer
+        if enable_echo_cancellation:
+            # 初始化回声消除器（在录音线程中同步处理，不是单独线程）
+            # frame_size设置为chunk大小，确保每次处理一个chunk
+            frame_size = chunk
+            try:
+                # 获取参考信号声道数（从reference_signal_buffer获取，因为它是根据播放声道数创建的）
+                ref_channels = self.reference_signal_buffer.channels if self.reference_signal_buffer else 1
+                self.echo_canceller = EchoCanceller(
+                    sample_rate=sample_rate,
+                    frame_size=frame_size,
+                    channels=self.channels,  # 麦克风输入：1声道
+                    ref_channels=ref_channels,  # 参考信号：播放声道数（2声道）
+                    logger=logger
+                )
+                if self.echo_canceller.aec is not None:
+                    if logger:
+                        logger.info(f"回声消除器已启用: sample_rate={sample_rate}, frame_size={frame_size}")
+                else:
+                    if logger:
+                        logger.warning("回声消除器初始化失败，将禁用回声消除功能")
+                    self.enable_echo_cancellation = False
+                    self.echo_canceller = None
+            except Exception as e:
+                if logger:
+                    logger.warning(f"回声消除器初始化失败: {e}，将禁用回声消除功能")
+                self.enable_echo_cancellation = False
+                self.echo_canceller = None
+        else:
+            self.echo_canceller = None
+    
+    def record_with_vad(self):
+        """录音线程：VAD + 能量检测"""
+        if self.on_heartbeat:
+            self.on_heartbeat()
+
+        try:
+            stream = self.audio.open(
+                format=self.format,
+                channels=self.channels,
+                rate=self.sample_rate,
+                input=True,
+                input_device_index=self.device_index if self.device_index >= 0 else None,
+                frames_per_buffer=self.chunk
+            )
+        except Exception as e:
+            raise RuntimeError(f"无法打开音频输入设备: {e}")
+
+        # VAD检测窗口, 最快 0.5s 内发现说话
+        window_sec = 0.5
+        # 连续 1s 没有检测到语音，就判定为静音状态
+        no_speech_threshold = max(self.silence_duration_ms / 1000.0, 0.1) 
+
+        last_heartbeat_time = time.time()
+
+        audio_buffer = [] # VAD 滑动窗口
+        last_active_time = time.time() # 静音计时基准
+        in_speech_segment = False # 是否处于语音段中（从检测到人声开始，直到静音超时结束）
+
+        try:
+            while not self.stop_flag():
+                # exception_on_overflow=False, 宁可丢帧，也不阻塞
+                data = stream.read(self.chunk, exception_on_overflow=False)
+                
+                # 回声消除处理
+                processed_data = data
+                if self.enable_echo_cancellation and self.echo_canceller and self.reference_signal_buffer:
+                    try:
+                        # 获取参考信号（长度与麦克风信号匹配）
+                        ref_signal = self.reference_signal_buffer.get_reference(num_samples=self.chunk)
+                        # 执行回声消除
+                        processed_data = self.echo_canceller.process(data, ref_signal)
+                    except Exception as e:
+                        if self.logger:
+                            self.logger.warning(f"回声消除处理失败: {e}，使用原始音频")
+                        processed_data = data
+                
+                # 检查是否应该将音频放入队列（用于阻止ASR，例如无声纹文件时需要注册）
+                if self.should_put_to_queue():
+                    # 队列满时丢弃最旧的数据，ASR 跟不上时系统仍然听得见
+                    if self.audio_queue.full():
+                        self.audio_queue.get_nowait()
+                    # 使用处理后的音频数据（经过回声消除）
+                    self.audio_queue.put_nowait(processed_data)
+                
+                # 音频chunk回调（用于声纹录音等，仅在需要时调用）
+                if self.on_audio_chunk:
+                    # 回调使用处理后的音频数据
+                    self.on_audio_chunk(processed_data)
+                
+                # VAD检测使用处理后的音频（经过回声消除）
+                audio_buffer.append(processed_data) # 只用于 VAD，不用于 ASR
+
+                # VAD检测窗口
+                now = time.time()
+                if len(audio_buffer) * self.chunk / self.sample_rate >= window_sec:
+                    raw_audio = b''.join(audio_buffer)
+                    energy = self._calculate_energy(raw_audio)
+                    vad_result = self._check_activity(raw_audio)
+
+                    self._debug_counter += 1
+                    if self._debug_counter >= 10:
+                        if self.logger:
+                            self.logger.info(f"[VAD调试] 能量={energy:.1f}, 阈值={self.min_energy_threshold}, VAD结果={vad_result}")
+                        self._debug_counter = 0
+
+                    if vad_result:
+                        last_active_time = now
+                        
+                        if not in_speech_segment: # 上一轮没说话，本轮开始说话
+                            in_speech_segment = True
+                            if self.on_speech_start:
+                                self.on_speech_start()
+                            
+                            # 检测当前 TTS 是否在播放
+                            if self.is_playing() and self.on_new_segment:
+                                self.on_new_segment() # 打断 TTS的回调
+                    else:
+                        if in_speech_segment:
+                            # 处于语音段中，但当前帧为静音，检查静音时长
+                            silence_duration = now - last_active_time
+                            
+                            # 动态获取静音阈值（如果提供回调函数）
+                            if self.get_silence_threshold:
+                                current_silence_ms = self.get_silence_threshold()
+                                current_no_speech_threshold = max(current_silence_ms / 1000.0, 0.1)
+                            else:
+                                current_no_speech_threshold = no_speech_threshold
+                            
+                            # 添加调试日志
+                            if self.logger and silence_duration < current_no_speech_threshold:
+                                self.logger.debug(f"[VAD] 静音中: {silence_duration:.3f}秒 < {current_no_speech_threshold:.3f}秒阈值")
+                            
+                            if silence_duration >= current_no_speech_threshold:
+                                if self.on_speech_end:
+                                    if self.logger:
+                                        self.logger.debug(f"[VAD] 触发speech_end: 静音持续时间 {silence_duration:.3f}秒 >= 阈值 {current_no_speech_threshold:.3f}秒")
+                                    self.on_speech_end() # 通知系统用户停止说话
+                                in_speech_segment = False
+                        
+                        if self.on_heartbeat and now - last_heartbeat_time >= self.heartbeat_interval:
+                            self.on_heartbeat()
+                            last_heartbeat_time = now
+
+                    audio_buffer = []
+        finally:
+            if stream.is_active():
+                stream.stop_stream()
+            stream.close()
+
+    @staticmethod
+    def _calculate_energy(audio_chunk: bytes) -> float:
+        """计算音频能量（RMS）"""
+        if not audio_chunk:
+            return 0.0
+        # 计算样本数：音频字节数 // 2（因为是16位PCM，1个样本=2字节）
+        n = len(audio_chunk) // 2
+        if n <= 0:
+            return 0.0
+        # 把字节数据解包为16位有符号整数（小端序）
+        samples = struct.unpack(f'<{n}h', audio_chunk[: n * 2])
+        if not samples:
+            return 0.0
+        return (sum(s * s for s in samples) / len(samples)) ** 0.5
+
+    def _check_activity(self, audio_data: bytes) -> bool:
+        """VAD + 能量检测：先VAD检测，能量作为辅助判断"""
+        energy = self._calculate_energy(audio_data)
+        
+        rate = 0.4 # 连续人声经验值
+        num = 0
+
+        # 采样率:16000 Hz, 帧时长:20ms=0.02s, 每帧采样点数=16000×0.02=320samples
+        # 每帧字节数=320×2=640bytes
+        bytes_per_sample = 2 # paInt16
+        frame_samples = int(self.sample_rate * 0.02)
+        frame_bytes = frame_samples * bytes_per_sample
+
+        if frame_bytes <= 0 or len(audio_data) < frame_bytes:
+            return False
+        
+        total_frames = len(audio_data) // frame_bytes
+        required = max(1, int(total_frames * rate))
+
+        for i in range(0, len(audio_data), frame_bytes):
+            chunk = audio_data[i:i + frame_bytes]
+            if len(chunk) == frame_bytes:
+                if self.vad_detector.vad.is_speech(chunk, sample_rate=self.sample_rate):
+                    num += 1
+
+        # 语音开头能量高, 中后段（拖音、尾音）能量下降
+        vad_result = num >= required
+        if vad_result and energy < self.min_energy_threshold * 0.5:
+            return False
+        
+        return vad_result
+    
+    def cleanup(self):
+        """清理资源"""
+        if hasattr(self, 'audio') and self.audio:
+            self.audio.terminate()
+
--- a/robot_speaker/perception/camera_client.py
+++ b/robot_speaker/perception/camera_client.py
@@ -0,0 +1,131 @@
+"""
+相机模块 - RealSense相机封装
+"""
+import numpy as np
+import contextlib
+
+
+class CameraClient:
+    def __init__(self, 
+                 serial_number: str | None,
+                 width: int,
+                 height: int,
+                 fps: int,
+                 format: str,
+                 logger=None):
+        self.serial_number = serial_number
+        self.width = width
+        self.height = height
+        self.fps = fps
+        self.format = format
+        self.logger = logger
+        
+        self.pipeline = None
+        self.config = None
+        self._is_initialized = False
+        self._rs = None
+    
+    def _log(self, level: str, msg: str):
+        if self.logger:
+            getattr(self.logger, level, self.logger.info)(msg)
+        else:
+            print(f"[相机] {msg}")
+    
+    def initialize(self) -> bool:
+        """
+        初始化并启动相机管道
+        """
+        if self._is_initialized:
+            return True
+        
+        try:
+            import pyrealsense2 as rs
+            self._rs = rs
+            
+            self.pipeline = rs.pipeline()
+            self.config = rs.config()
+            
+            if self.serial_number:
+                self.config.enable_device(self.serial_number)
+            
+            self.config.enable_stream(
+                rs.stream.color, 
+                self.width, 
+                self.height, 
+                rs.format.rgb8 if self.format == 'RGB8' else rs.format.bgr8,
+                self.fps
+            )
+            
+            self.pipeline.start(self.config)
+            self._is_initialized = True
+            self._log("info", f"相机已启动并保持运行: {self.width}x{self.height}@{self.fps}fps")
+            return True
+        except Exception as e:
+            self._log("error", f"相机初始化失败: {e}")
+            self.cleanup()
+            return False
+    
+    def cleanup(self):
+        """停止相机管道，释放资源"""
+        if self.pipeline:
+            self.pipeline.stop()
+            self._log("info", "相机已停止")
+        self.pipeline = None
+        self.config = None
+        self._is_initialized = False
+    
+    def capture_rgb(self) -> np.ndarray | None:
+        """
+        从运行中的相机管道捕获一帧RGB图像
+        """
+        if not self._is_initialized:
+            self._log("error", "相机未初始化，无法捕获图像")
+            return None
+        
+        try:
+            frames = self.pipeline.wait_for_frames()
+            color_frame = frames.get_color_frame()
+            
+            return np.asanyarray(color_frame.get_data())
+        except Exception as e:
+            self._log("error", f"捕获图像失败: {e}")
+            return None
+    
+    @contextlib.contextmanager
+    def capture_context(self):
+        """
+        上下文管理器：拍照并自动清理资源
+        """
+        image_data = self.capture_rgb()
+        try:
+            yield image_data
+        finally:
+            if image_data is not None:
+                del image_data
+    
+    def capture_multiple(self, count: int = 1) -> list[np.ndarray]:
+        """
+        捕获多张图像（为未来扩展准备）
+        """
+        images = []
+        for i in range(count):
+            img = self.capture_rgb()
+            if img is not None:
+                images.append(img)
+            else:
+                self._log("warning", f"第{i+1}张图像捕获失败")
+        return images
+    
+    @contextlib.contextmanager
+    def capture_multiple_context(self, count: int = 1):
+        """
+        上下文管理器：捕获多张图像并自动清理资源
+        """
+        images = self.capture_multiple(count)
+        try:
+            yield images
+        finally:
+            for img in images:
+                del img
+            images.clear()
+
--- a/robot_speaker/perception/echo_cancellation.py
+++ b/robot_speaker/perception/echo_cancellation.py
@@ -0,0 +1,98 @@
+import collections
+import numpy as np
+
+
+class ReferenceSignalBuffer:
+    """参考信号缓冲区"""
+    
+    def __init__(self, sample_rate: int, channels: int, max_duration_ms: int | None = None,
+                 buffer_seconds: float = 5.0):
+        self.sample_rate = int(sample_rate)
+        self.channels = int(channels)
+        if max_duration_ms is not None:
+            buffer_seconds = max(float(max_duration_ms) / 1000.0, 0.1)
+        self.max_samples = int(self.sample_rate * buffer_seconds)
+        self._buffer = collections.deque(maxlen=self.max_samples * self.channels)
+    
+    def add_reference(self, data: bytes, source_sample_rate: int, source_channels: int):
+        if source_sample_rate != self.sample_rate or source_channels != self.channels:
+            return
+        samples = np.frombuffer(data, dtype=np.int16)
+        self._buffer.extend(samples.tolist())
+    
+    def get_reference(self, num_samples: int) -> bytes:
+        needed = int(num_samples) * self.channels
+        if needed <= 0:
+            return b""
+        if len(self._buffer) < needed:
+            data = list(self._buffer) + [0] * (needed - len(self._buffer))
+        else:
+            data = list(self._buffer)[-needed:]
+        return np.array(data, dtype=np.int16).tobytes()
+
+
+class EchoCanceller:
+    """回声消除器（基于 aec-audio-processing）"""
+    
+    def __init__(self, sample_rate: int, frame_size: int, channels: int, ref_channels: int, logger=None):
+        self.sample_rate = int(sample_rate)
+        self.frame_size = int(frame_size)
+        self.channels = int(channels)
+        self.ref_channels = int(ref_channels)
+        self.logger = logger
+        self.aec = None
+        self._process_reverse = None
+        self._frame_bytes = int(self.sample_rate / 100) * self.channels * 2  # 10ms, int16
+        self._ref_frame_bytes = int(self.sample_rate / 100) * self.ref_channels * 2
+        try:
+            from aec_audio_processing import AudioProcessor
+            self.aec = AudioProcessor(enable_aec=True, enable_ns=False, enable_agc=False)
+            self.aec.set_stream_format(self.sample_rate, self.channels)
+            if hasattr(self.aec, "set_reverse_stream_format"):
+                self.aec.set_reverse_stream_format(self.sample_rate, self.ref_channels)
+            if hasattr(self.aec, "set_stream_delay"):
+                self.aec.set_stream_delay(0)
+            if hasattr(self.aec, "process_reverse_stream"):
+                self._process_reverse = self.aec.process_reverse_stream
+            elif hasattr(self.aec, "process_reverse"):
+                self._process_reverse = self.aec.process_reverse
+        except Exception:
+            self.aec = None
+    
+    def process(self, mic_data: bytes, ref_data: bytes) -> bytes:
+        if not self.aec:
+            return mic_data
+        if not mic_data:
+            return mic_data
+
+        try:
+            out_chunks = []
+            total_len = len(mic_data)
+            frame_bytes = self._frame_bytes
+            ref_frame_bytes = self._ref_frame_bytes
+
+            frame_count = (total_len + frame_bytes - 1) // frame_bytes
+            for i in range(frame_count):
+                m_start = i * frame_bytes
+                m_end = m_start + frame_bytes
+                mic_frame = mic_data[m_start:m_end]
+                if len(mic_frame) < frame_bytes:
+                    mic_frame = mic_frame + b"\x00" * (frame_bytes - len(mic_frame))
+
+                if ref_data:
+                    r_start = i * ref_frame_bytes
+                    r_end = r_start + ref_frame_bytes
+                    ref_frame = ref_data[r_start:r_end]
+                    if len(ref_frame) < ref_frame_bytes:
+                        ref_frame = ref_frame + b"\x00" * (ref_frame_bytes - len(ref_frame))
+                    if self._process_reverse:
+                        self._process_reverse(ref_frame)
+
+                processed = self.aec.process_stream(mic_frame)
+                out_chunks.append(processed if processed is not None else mic_frame)
+
+            return b"".join(out_chunks)[:total_len]
+        except Exception as e:
+            if self.logger:
+                self.logger.warning(f"回声消除处理失败: {e}，使用原始音频")
+            return mic_data
--- a/robot_speaker/perception/speaker_verifier.py
+++ b/robot_speaker/perception/speaker_verifier.py
@@ -0,0 +1,304 @@
+"""
+声纹识别模块
+"""
+import numpy as np
+import threading
+import tempfile
+import os
+import wave
+import time
+import json
+from enum import Enum
+
+
+class SpeakerState(Enum):
+    """说话人识别状态"""
+    UNKNOWN = "unknown"
+    VERIFIED = "verified"
+    REJECTED = "rejected"
+    ERROR = "error"
+
+
+class SpeakerVerificationClient:
+    """声纹识别客户端 - 非实时、低频处理"""
+    
+    def __init__(self, model_path: str, threshold: float, speaker_db_path: str = None, logger=None):
+        self.model_path = model_path
+        self.threshold = threshold
+        self.speaker_db_path = speaker_db_path
+        self.logger = logger
+        self.speaker_db = {}  # {speaker_id: {"embedding": np.ndarray, "env": str, "threshold": float, "registered_at": float}}
+        self._lock = threading.Lock()
+        
+        # 优化CPU性能：限制Torch使用的线程数，防止多线程竞争导致性能骤降
+        import torch
+        torch.set_num_threads(1)
+
+        from funasr import AutoModel
+        model_path = os.path.expanduser(self.model_path)
+        # 禁用自动更新检查，防止每次初始化都联网检查
+        self.model = AutoModel(model=model_path, device="cpu", disable_update=True)
+        if self.logger:
+            self.logger.info(f"声纹模型已加载: {model_path}, 阈值: {self.threshold}")
+        
+        if self.speaker_db_path:
+            self.load_speakers()
+    
+    def _log(self, level: str, msg: str):
+        """记录日志 - 修复ROS2 logger在多线程环境中的问题"""
+        if self.logger:
+            try:
+                log_methods = {
+                    "debug": self.logger.debug,
+                    "info": self.logger.info,
+                    "warning": self.logger.warning,
+                    "error": self.logger.error,
+                    "fatal": self.logger.fatal
+                }
+                log_method = log_methods.get(level.lower(), self.logger.info)
+                log_method(msg)
+            except ValueError as e:
+                if "severity cannot be changed" in str(e):
+                    try:
+                        self.logger.info(f"[声纹-{level.upper()}] {msg}")
+                    except:
+                        print(f"[声纹-{level.upper()}] {msg}")
+                else:
+                    raise
+        else:
+            print(f"[声纹] {msg}")
+    
+    def _write_temp_wav(self, audio_data: np.ndarray, sample_rate: int = 16000):
+        """将numpy音频数组写入临时wav文件"""
+        audio_int16 = audio_data.astype(np.int16)
+        
+        fd, temp_path = tempfile.mkstemp(suffix='.wav', prefix='sv_')
+        os.close(fd)
+        
+        with wave.open(temp_path, 'wb') as wav_file:
+            wav_file.setnchannels(1)
+            wav_file.setsampwidth(2)
+            wav_file.setframerate(sample_rate)
+            wav_file.writeframes(audio_int16.tobytes())
+        
+        return temp_path
+    
+    def extract_embedding(self, audio_data: np.ndarray, sample_rate: int = 16000):
+        """
+        提取说话人embedding（低频调用，一句话只调用一次）
+        """
+        # 降采样到 16000Hz (如果需要)
+        # Cam++ 等模型通常只支持 16k，如果传入 48k 会导致内部重采样极慢或计算量剧增
+        target_sr = 16000
+        if sample_rate > target_sr:
+            if sample_rate % target_sr == 0:
+                step = sample_rate // target_sr
+                audio_data = audio_data[::step]
+                sample_rate = target_sr
+            else:
+                # 简单的非整数倍降采样可能导致问题，但对于语音验证通常 48k->16k 是整数倍
+                # 如果不是，此处暂不处理，依赖 funasr 内部处理，或者简单的步长取整
+                step = int(sample_rate / target_sr)
+                audio_data = audio_data[::step]
+                sample_rate = target_sr
+        
+        if len(audio_data) < int(sample_rate * 0.5):
+            return None, False
+        
+        temp_wav_path = None
+        try:
+            # 限制Torch在推理时使用单线程，避免在多任务环境下（尤其是一边录音一边识别）
+            # 出现的极端CPU竞争和上下文切换开销
+            import torch
+            with torch.inference_mode():
+                # 临时设置，虽然全局已经设置了，但在调用前再次确保
+                # 注意：set_num_threads 是全局的，这里再次确认
+                if torch.get_num_threads() != 1:
+                    torch.set_num_threads(1)
+                
+                temp_wav_path = self._write_temp_wav(audio_data, sample_rate)
+                result = self.model.generate(input=temp_wav_path)
+                
+                embedding = result[0]['spk_embedding'].detach().cpu().numpy()[0]  # shape [1, 192] -> [192]
+           
+            embedding_dim = len(embedding)
+            if embedding_dim == 0:
+                return None, False
+            
+            return embedding, True
+        except Exception as e:
+            self._log("error", f"提取embedding失败: {e}")
+            return None, False
+        finally:
+            if temp_wav_path and os.path.exists(temp_wav_path):
+                try:
+                    os.unlink(temp_wav_path)
+                except:
+                    pass
+    
+    def register_speaker(self, speaker_id: str, embedding: np.ndarray, 
+                        env: str = "near", threshold: float = None) -> bool:
+        """
+        注册说话人
+        """
+        embedding_dim = len(embedding)
+        if embedding_dim == 0:
+            return False
+        embedding_norm = np.linalg.norm(embedding)
+        if embedding_norm == 0:
+            self._log("error", f"注册失败：embedding范数为0")
+            return False
+        embedding_normalized = embedding / embedding_norm
+        
+        speaker_threshold = threshold if threshold is not None else self.threshold
+        
+        with self._lock:
+            self.speaker_db[speaker_id] = {
+                "embedding": embedding_normalized,
+                "env": env,  # 添加 env 字段
+                "threshold": speaker_threshold,
+                "registered_at": time.time()
+            }
+            self._log("info", f"已注册说话人: {speaker_id}, 阈值: {speaker_threshold:.3f}, 维度: {embedding_dim}")
+        save_result = self.save_speakers()
+        if not save_result:
+            self._log("info", f"保存声纹数据库失败，但说话人已注册到内存: {speaker_id}")
+        return True
+    
+    def match_speaker(self, embedding: np.ndarray):
+        """
+        匹配说话人（一句话只调用一次）
+        """
+        if not self.speaker_db:
+            return None, SpeakerState.UNKNOWN, 0.0, self.threshold
+        
+        embedding_dim = len(embedding)
+        if embedding_dim == 0:
+            return None, SpeakerState.ERROR, 0.0, self.threshold
+    
+        embedding_norm = np.linalg.norm(embedding)
+        if embedding_norm == 0:
+            return None, SpeakerState.ERROR, 0.0, self.threshold
+        embedding_normalized = embedding / embedding_norm
+        
+        best_match = None
+        best_score = -1.0
+        best_threshold = self.threshold
+        
+        with self._lock:
+            for speaker_id, speaker_data in self.speaker_db.items():
+                ref_embedding = speaker_data["embedding"]
+                score = np.dot(embedding_normalized, ref_embedding)
+                
+                if score > best_score:
+                    best_score = score
+                    best_match = speaker_id
+                    best_threshold = speaker_data["threshold"]
+        
+        state = SpeakerState.VERIFIED if best_score >= best_threshold else SpeakerState.REJECTED
+        return (best_match, state, best_score, best_threshold)
+    
+    def is_available(self) -> bool:
+        return self.model is not None
+    
+    def cleanup(self):
+        """清理资源"""
+        pass
+    
+    def get_speaker_count(self) -> int:
+        with self._lock:
+            return len(self.speaker_db)
+    
+    def remove_speaker(self, speaker_id: str) -> bool:
+        with self._lock:
+            if speaker_id not in self.speaker_db:
+                return False
+            del self.speaker_db[speaker_id]
+            self.save_speakers()
+            return True
+    
+    def load_speakers(self) -> bool:
+        """
+        从文件加载已注册的声纹
+        """
+        if not self.speaker_db_path:
+            return False
+        
+        if not os.path.exists(self.speaker_db_path):
+            self._log("info", f"声纹数据库文件不存在: {self.speaker_db_path}，将创建新数据库")
+            return False
+        
+        try:
+            with open(self.speaker_db_path, 'r', encoding='utf-8') as f:
+                data = json.load(f)
+            
+            with self._lock:
+                for speaker_id, speaker_data in data.items():
+                    embedding_list = speaker_data["embedding"]
+                    embedding_array = np.array(embedding_list, dtype=np.float32)
+                    
+                    embedding_dim = len(embedding_array)
+                    if embedding_dim == 0:
+                        self._log("warning", f"跳过无效声纹: {speaker_id} (维度为0)")
+                        continue
+                    embedding_norm = np.linalg.norm(embedding_array)
+                    if embedding_norm > 0:
+                        embedding_array = embedding_array / embedding_norm
+                    
+                    self.speaker_db[speaker_id] = {
+                        "embedding": embedding_array,
+                        "env": speaker_data["env"],
+                        "threshold": speaker_data["threshold"],
+                        "registered_at": speaker_data["registered_at"]
+                    }
+                
+                count = len(self.speaker_db)
+                self._log("info", f"已加载 {count} 个已注册说话人")
+            return True
+        except Exception as e:
+            self._log("error", f"加载声纹数据库失败: {e}")
+            return False
+    
+    def save_speakers(self) -> bool:
+        """
+        保存已注册的声纹到文件
+        """
+        if not self.speaker_db_path:
+            self._log("warning", "声纹数据库路径未配置，无法保存到文件（说话人已注册到内存）")
+            return False
+        
+        try:
+            db_dir = os.path.dirname(self.speaker_db_path)
+            if db_dir and not os.path.exists(db_dir):
+                os.makedirs(db_dir, exist_ok=True)
+            json_data = {}
+            with self._lock:
+                for speaker_id, speaker_data in self.speaker_db.items():
+                    json_data[speaker_id] = {
+                        "embedding": speaker_data["embedding"].tolist(),  # numpy array -> list
+                        "env": speaker_data.get("env", "near"),  # 兼容旧数据，默认使用 "near"
+                        "threshold": speaker_data["threshold"],
+                        "registered_at": speaker_data["registered_at"]
+                    }
+            
+            temp_path = self.speaker_db_path + ".tmp"
+            with open(temp_path, 'w', encoding='utf-8') as f:
+                json.dump(json_data, f, indent=2, ensure_ascii=False)
+
+            os.replace(temp_path, self.speaker_db_path)
+            
+            self._log("info", f"已保存 {len(json_data)} 个说话人到: {self.speaker_db_path}")
+            return True
+        except Exception as e:
+            import traceback
+            self._log("error", f"保存声纹数据库失败: {e}")
+            self._log("error", f"保存路径: {self.speaker_db_path}")
+            self._log("error", f"错误详情: {traceback.format_exc()}")
+            temp_path = self.speaker_db_path + ".tmp"
+            if os.path.exists(temp_path):
+                try:
+                    os.unlink(temp_path)
+                except:
+                    pass
+            return False
+
--- a/robot_speaker/robot_speaker_node.py
+++ b/robot_speaker/robot_speaker_node.py
@@ -1,55 +0,0 @@
-import rclpy
-from rclpy.node import Node
-from example_interfaces.msg import String
-import threading
-from queue import Queue
-import time
-import espeakng
-import pyttsx3
-
-
-class RobotSpeakerNode(Node):
-    def __init__(self, node_name):
-        super().__init__(node_name)
-        self.novels_queue_ = Queue()
-        self.novel_subscriber_ = self.create_subscription(
-            String, 'robot_msg', self.novel_callback, 10)
-        self.speech_thread_ = threading.Thread(target=self.speak_thread)
-        self.speech_thread_.start()
-
-    def novel_callback(self, msg):
-        self.novels_queue_.put(msg.data)
-
-    def speak_thread(self):
-        # 初始化引擎
-        engine = pyttsx3.init()
-        # 调整参数
-        engine.setProperty('rate', 150)  # 语速（150更自然）
-        engine.setProperty('volume', 1.0)  # 音量（0.0-1.0）
-        
-        # 选择中文音色（修正：使用 languages 属性，且是列表）
-        voices = engine.getProperty('voices')
-        for voice in voices:
-            # 检查语音支持的语言列表中是否包含中文（'zh' 或 'zh-CN' 等）
-            if any('zh' in lang for lang in voice.languages):
-                engine.setProperty('voice', voice.id)
-                self.get_logger().info(f'已选择中文语音：{voice.id}')
-                break
-        else:
-            self.get_logger().warning('未找到中文语音库，将使用默认语音')
-        
-        while rclpy.ok():
-            if self.novels_queue_.qsize() > 0:
-                text = self.novels_queue_.get()
-                engine.say(text)
-                engine.runAndWait()  # 等待语音播放完成
-            else:
-                time.sleep(0.5)
-
-
-
-def main(args=None):
-    rclpy.init(args=args)
-    node = RobotSpeakerNode("robot_speaker_node")
-    rclpy.spin(node)
-    rclpy.shutdown()
--- a/robot_speaker/understanding/init.py
+++ b/robot_speaker/understanding/init.py
@@ -0,0 +1,5 @@
+"""理解层"""
+
+
+
+
--- a/robot_speaker/understanding/context_manager.py
+++ b/robot_speaker/understanding/context_manager.py
@@ -0,0 +1,111 @@
+"""
+对话历史管理模块
+"""
+from robot_speaker.core.types import LLMMessage
+import threading
+
+
+class ConversationHistory:
+    """对话历史管理器 - 实时语音"""
+    
+    def __init__(self, max_history: int, summary_trigger: int):
+        self.max_history = max_history
+        self.summary_trigger = summary_trigger
+        self.conversation_history: list[LLMMessage] = []
+        self.summary: str | None = None
+        
+        # 待确认机制
+        self._pending_user_message: LLMMessage | None = None  # 待确认的用户消息
+        self._lock = threading.Lock()  # 线程安全锁
+    
+    def start_turn(self, user_content: str):
+        """开始一个新的对话轮次,暂存用户消息，等待LLM完成后确认写入历史"""
+        with self._lock:
+            self._pending_user_message = LLMMessage(role="user", content=user_content)
+    
+    def commit_turn(self, assistant_content: str) -> bool:
+        """确认当前轮次完成，将usr和assistant消息写入历史"""
+        with self._lock:
+            if self._pending_user_message is None:
+                return False
+            
+            if not assistant_content or not assistant_content.strip():
+                self._pending_user_message = None
+                return False
+            
+            self.conversation_history.append(self._pending_user_message)
+            self.conversation_history.append(
+                LLMMessage(role="assistant", content=assistant_content.strip())
+            )
+         
+            self._pending_user_message = None
+         
+            self._maybe_compress()
+            return True
+    
+    def cancel_turn(self):
+        """取消当前待确认的轮次，丢弃待确认的用户消息,用于处理中断情况，防止不完整内容污染历史"""
+        with self._lock:
+            if self._pending_user_message is not None:
+                self._pending_user_message = None
+    
+    def add_message(self, role: str, content: str):
+        """直接添加消息"""
+        with self._lock:
+            # 如果有待确认的轮次，先取消它
+            self.cancel_turn()
+            self.conversation_history.append(LLMMessage(role=role, content=content))
+            self._maybe_compress()
+    
+    def get_messages(self) -> list[LLMMessage]:
+        """获取消息列表"""
+        with self._lock:
+            messages = []
+
+            if self.summary:
+                messages.append(LLMMessage(role="system", content=self.summary))
+            
+            if self.max_history > 0:
+                messages.extend(self.conversation_history[-self.max_history * 2:])
+            
+            if self._pending_user_message is not None:
+                messages.append(self._pending_user_message)
+            
+            return messages
+    
+    def has_pending_turn(self) -> bool:
+        """检查是否有待确认的轮次"""
+        with self._lock:
+            return self._pending_user_message is not None
+    
+    def _maybe_compress(self):
+        """压缩对话历史"""
+        if self.max_history <= 0:
+            self.conversation_history.clear()
+            return
+        
+        max_len = self.summary_trigger * 2
+        if len(self.conversation_history) <= max_len:
+            return
+        
+        old = self.conversation_history[:-max_len]
+        self.conversation_history = self.conversation_history[-max_len:]
+        
+        summary_text = []
+        for msg in old:
+            summary_text.append(f"{msg.role}: {msg.content}")
+        
+        compressed = "对话摘要：\n" + "\n".join(summary_text[-10:])
+        
+        if self.summary:
+            self.summary += "\n" + compressed
+        else:
+            self.summary = compressed
+    
+    def clear(self):
+        """清空历史和待确认消息"""
+        with self._lock:
+            self.conversation_history.clear()
+            self.summary = None
+            self._pending_user_message = None
+
--- a/setup.py
+++ b/setup.py
@@ -1,26 +1,36 @@
-from setuptools import find_packages, setup
+from setuptools import setup, find_packages
+import os
+from glob import glob

 package_name = 'robot_speaker'

 setup(
    name=package_name,
-    version='0.0.0',
-    packages=[package_name],
+    version='0.0.1',
+    packages=find_packages(where='.'),
+    package_dir={'': '.'},
    data_files=[
        ('share/ament_index/resource_index/packages',
            ['resource/' + package_name]),
        ('share/' + package_name, ['package.xml']),
+        (os.path.join('share', package_name, 'launch'), glob('launch/*.launch.py')),
+        (os.path.join('share', package_name, 'config'), glob('config/*.yaml') + glob('config/*.json')),
+    ],
+    install_requires=[
+        'setuptools',
+        'pypinyin',
    ],
-    install_requires=['setuptools'],
    zip_safe=True,
    maintainer='mzebra',
    maintainer_email='mzebra@foxmail.com',
-    description='TODO: Package description',
+    description='语音识别和合成ROS2包',
    license='Apache-2.0',
    tests_require=['pytest'],
    entry_points={
        'console_scripts': [
-            'robot_speaker_node=robot_speaker.robot_speaker_node:main'
+            'robot_speaker_node = robot_speaker.core.robot_speaker_node:main',
+            'register_speaker_node = robot_speaker.core.register_speaker_node:main',
+            'skill_bridge_node = robot_speaker.bridge.skill_bridge_node:main',
        ],
    },
 )
--- a/view_camera.py
+++ b/view_camera.py
@@ -0,0 +1,68 @@
+#!/usr/bin/env python3
+"""
+查看相机画面的简单脚本
+按空格键保存当前帧，按'q'键退出
+"""
+import sys
+import cv2
+import numpy as np
+try:
+    import pyrealsense2 as rs
+except ImportError:
+    print("错误: 未安装pyrealsense2，请运行: pip install pyrealsense2")
+    sys.exit(1)
+
+def main():
+    # 配置相机
+    pipeline = rs.pipeline()
+    config = rs.config()
+    
+    # 启用彩色流
+    config.enable_stream(rs.stream.color, 640, 480, rs.format.rgb8, 30)
+    
+    # 启动管道
+    pipeline.start(config)
+    print("相机已启动，按空格键保存图片，按'q'键退出")
+    
+    frame_count = 0
+    try:
+        while True:
+            # 等待一帧
+            frames = pipeline.wait_for_frames()
+            color_frame = frames.get_color_frame()
+            
+            if not color_frame:
+                continue
+            
+            # 转换为numpy数组 (RGB格式)
+            color_image = np.asanyarray(color_frame.get_data())
+            
+            # OpenCV使用BGR格式，需要转换
+            bgr_image = cv2.cvtColor(color_image, cv2.COLOR_RGB2BGR)
+            
+            # 显示图像
+            cv2.imshow('Camera View', bgr_image)
+            
+            # 等待按键
+            key = cv2.waitKey(1) & 0xFF
+            
+            if key == ord('q'):
+                print("退出...")
+                break
+            elif key == ord(' '):  # 空格键保存
+                frame_count += 1
+                filename = f'camera_frame_{frame_count:04d}.jpg'
+                cv2.imwrite(filename, bgr_image)
+                print(f"已保存: {filename}")
+    
+    except KeyboardInterrupt:
+        print("\n中断...")
+    finally:
+        pipeline.stop()
+        cv2.destroyAllWindows()
+        print("相机已关闭")
+
+if __name__ == '__main__':
+    main()
+
+
Author	SHA1	Message	Date
NuoDaJia02	e5714e3a8b	fix: Optimize voice interaction pipeline 1. register_speaker_node: Enable AEC to match main node for better SV accuracy. 2. tts/dashscope: Fix ffmpeg argument order (input option thread_queue_size). 3. asr/dashscope: Keep WebSocket connection alive to reduce latency. 4. speaker_verifier: Force single-thread inference to avoid CPU contention.	2026-01-19 16:17:27 +08:00
NuoDaJia02	293e69e9f2	merge develop features	2026-01-19 14:32:57 +08:00
lxy	0409ce0de4	修正声纹验证音频长度计算	2026-01-19 14:21:06 +08:00
NuoDaJia02	ce0d581770	fix torch issue	2026-01-19 13:31:49 +08:00
NuoDaJia02	a1b91ed52f	disable echo cancellation	2026-01-19 11:35:01 +08:00
lxy	6d101b9d9e	添加与行为树的桥接节点	2026-01-19 09:58:40 +08:00
NuoDaJia02	c282f9b4de	fix deploy issues	2026-01-19 09:09:28 +08:00
lxy	9fd658990c	datasets==3.6.0	2026-01-16 10:49:16 +08:00
lxy	0c118412ec	代码重构，区分声纹注册和主节点	2026-01-16 10:40:40 +08:00
lxy	eb91e2f139	增加AEC	2026-01-13 22:14:46 +08:00
lxy	838a4a357c	增加声纹验证	2026-01-12 20:39:47 +08:00
lxy	9c775cff5c	增加中断词	2026-01-12 17:40:08 +08:00
lxy	63a21999bb	增加相机调用，修复对话历史管理，修复asr停止识别逻辑	2026-01-08 20:59:58 +08:00
lxy	8fffd4ab42	chore: add .gitignore and stop tracking build/install/log outputs	2026-01-07 14:30:16 +08:00
xyliu	b90d84c325	feat(robot_speaker): 创建语音包包含唤醒词，asr,llm,tts等。	2026-01-07 14:14:29 +08:00
				`@@ -0,0 +1,2 @@`
				`# Bridge package for connecting LLM outputs to brain execution.`