15 Commits

Author SHA1 Message Date
NuoDaJia02
e5714e3a8b fix: Optimize voice interaction pipeline
1. register_speaker_node: Enable AEC to match main node for better SV accuracy.
2. tts/dashscope: Fix ffmpeg argument order (input option thread_queue_size).
3. asr/dashscope: Keep WebSocket connection alive to reduce latency.
4. speaker_verifier: Force single-thread inference to avoid CPU contention.
2026-01-19 16:17:27 +08:00
NuoDaJia02
293e69e9f2 merge develop features 2026-01-19 14:32:57 +08:00
lxy
0409ce0de4 修正声纹验证音频长度计算 2026-01-19 14:21:06 +08:00
NuoDaJia02
ce0d581770 fix torch issue 2026-01-19 13:31:49 +08:00
NuoDaJia02
a1b91ed52f disable echo cancellation 2026-01-19 11:35:01 +08:00
lxy
6d101b9d9e 添加与行为树的桥接节点 2026-01-19 09:58:40 +08:00
NuoDaJia02
c282f9b4de fix deploy issues 2026-01-19 09:09:28 +08:00
lxy
9fd658990c datasets==3.6.0 2026-01-16 10:49:16 +08:00
lxy
0c118412ec 代码重构,区分声纹注册和主节点 2026-01-16 10:40:40 +08:00
lxy
eb91e2f139 增加AEC 2026-01-13 22:14:46 +08:00
lxy
838a4a357c 增加声纹验证 2026-01-12 20:39:47 +08:00
lxy
9c775cff5c 增加中断词 2026-01-12 17:40:08 +08:00
lxy
63a21999bb 增加相机调用,修复对话历史管理,修复asr停止识别逻辑 2026-01-08 20:59:58 +08:00
lxy
8fffd4ab42 chore: add .gitignore and stop tracking build/install/log outputs 2026-01-07 14:30:16 +08:00
b90d84c325 feat(robot_speaker): 创建语音包
包含唤醒词,asr,llm,tts等。
2026-01-07 14:14:29 +08:00
39 changed files with 4665 additions and 65 deletions

6
.gitignore vendored Normal file
View File

@@ -0,0 +1,6 @@
build/
install/
log/
__pycache__/
*.pyc
*.egg-info/

104
README.md
View File

@@ -1,2 +1,104 @@
# hivecore_robot_voice
# ROS 语音包 (robot_speaker)
## 注册阿里云百炼获取api_key
https://bailian.console.aliyun.com/?tab=model#/api-key
->密钥管理
放到config/voice.yaml
## 安装依赖
1. 系统依赖
```bash
sudo apt-get update
sudo apt-get install -y python3-pyaudio portaudio19-dev alsa-utils ffmpeg swig meson ninja-build build-essential pkg-config libwebrtc-audio-processing-dev
```
2. Python依赖
```bash
cd ~/ros_learn/hivecore_robot_voice
# 在 Python 3.10 环境下,需要单独安装 aec-audio-processing 以跳过版本检查
pip3 install aec-audio-processing --no-binary :all: --ignore-requires-python --break-system-packages
pip3 install -r requirements.txt --break-system-packages
```
## 编译启动
1. 注册声纹
- 启动节点后可以说:二狗今天天气真好开始注册声纹
- 正确的注册姿势:
方法A推荐唤醒后停顿一下然后说一段长句子。
用户:"二狗"
机器:(日志提示等待声纹语音)
用户:"我现在正在注册声纹,这是一段很长的测试语音,请把我的声音录进去。"(持续说 3-5 秒)
方法B连贯说一口气说很长的一句话。
用户:"二狗你好,我是你的主人,请记住我的声音,这是一段用来注册的长语音。"
- 注意要包含唤醒词语句不要停顿尽量大于1.5秒
```bash
cd ~/ros_learn/hivecore_robot_voice
colcon build
source install/setup.bash
ros2 run robot_speaker register_speaker_node
```
2. 主节点
- 启动节点后每句交互包含唤醒词,唤醒词和语句之间不要有停顿
- 二狗拍照看看开启图文交互
- 支持已注册声纹用户打断
```bash
cd ~/ros_learn/hivecore_robot_voice
colcon build
source install/setup.bash
ros2 launch robot_speaker voice.launch.py
```
## 架构说明
[录音线程] - 唯一实时线程
├─ 麦克风采集 PCM
├─ VAD + 能量检测
├─ 检测到人声 → 立即中断TTS
├─ 语音 PCM → ASR 音频队列
└─ 语音 PCM → 声纹音频队列(旁路,不阻塞)
[ASR推理线程] - 只做 audio → text
└─ 从 ASR 音频队列取音频→ 实时 / 流式 ASR → text → 文本队列
[声纹识别线程] - 非实时、低频CAM++
├─ 通过回调函数接收音频chunk写入缓冲区等待 speech_end 事件触发处理
├─ 累积 1~2 秒有效人声VAD 后)
├─ CAM++ 提取 speaker embedding
├─ 声纹匹配 / 注册
└─ 更新 current_speaker_id共享状态只写不控
声纹线程要求不影响录音不影响ASR不控制TTS只更新当前说话人是谁
[主线程/处理线程] - 处理业务逻辑
├─ 从 文本队列 取 ASR 文本
├─ 读取 current_speaker_id只读
├─ 唤醒词处理(结合 speaker_id
├─ 权限 / 身份判断(是否允许继续)
├─ VLM处理文本 / 多模态)
└─ TTS播放启动TTS线程不等待
[TTS播放线程] - 只播放(可被中断)
├─ 接收 TTS 音频流
├─ 播放到输出设备
└─ 响应中断标志(由录音线程触发)
## 用到的命令
1. 音频设备
```bash
# 1. 查看所有音频设备
cat /proc/asound/cards
# 2. 查看 card(1)的流信息(设备参数)
cat /proc/asound/card1/stream0
```
2. 相机设备
```bash
# 1. 查看相机所有基础信息(型号、固件版本、序列号等)
rs-enumerate-devices -c
```
3. 模型下载
```bash
modelscope download --model iic/speech_campplus_sv_zh-cn_16k-common --local_dir [指定路径]
```

18
config/knowledge.json Normal file
View File

@@ -0,0 +1,18 @@
{
"entries": [
{
"id": "robot_identity",
"patterns": [
"ni shi shei"
],
"answer": "我叫二狗,是蜂核科技的机器人,很高兴为你服务"
},
{
"id": "wake_word",
"patterns": [
"ni de ming zi"
],
"answer": "我的名字是二狗"
}
]
}

599
config/speakers.json Normal file
View File

@@ -0,0 +1,599 @@
{
"user_1768311644": {
"embedding": [
0.017083248123526573,
-0.01032772846519947,
0.0058503481559455395,
0.11945011466741562,
0.03864186629652977,
-0.16047827899456024,
0.008000967092812061,
0.10669729858636856,
0.13221754133701324,
0.06365424394607544,
-0.06943577527999878,
0.08401959389448166,
0.09903465211391449,
0.0407508946955204,
-0.07486417144536972,
0.0010617832886055112,
0.12097838521003723,
-0.013734623789787292,
-0.020789025351405144,
-0.02113250270485878,
0.008510188199579716,
-0.05490498244762421,
-0.17027714848518372,
0.09569162130355835,
-0.07379947602748871,
0.05932804197072983,
0.0839226171374321,
0.004776939284056425,
0.050190482288599014,
-0.19962339103221893,
-0.13987377285957336,
0.041607797145843506,
0.10067984461784363,
0.0684289038181305,
0.08163066953420639,
-0.029243428260087967,
-0.10118222236633301,
-0.11619988083839417,
-0.10121472179889679,
-0.04290663078427315,
-0.08373524248600006,
0.03493887186050415,
0.055566269904375076,
-0.11284282803535461,
-0.10970190167427063,
0.03457016497850418,
0.11647575348615646,
-0.014930102974176407,
-0.04663793370127678,
0.0752566009759903,
-0.06746217608451843,
-0.07642832398414612,
0.06518206000328064,
0.07191824167966843,
0.13557033240795135,
0.04906972125172615,
0.03679114207625389,
0.07466751337051392,
0.01071798987686634,
-0.07979520410299301,
-0.10039637982845306,
0.004846179857850075,
-0.07325125485658646,
-0.08750395476818085,
0.05332862585783005,
0.10648373514413834,
-0.035643525421619415,
0.21233271062374115,
0.011915713548660278,
0.13632774353027344,
0.10383394360542297,
-0.053550489246845245,
0.05719169229269028,
0.04600509628653526,
0.043678827583789825,
-0.03646669536828995,
0.08175459504127502,
0.042513635009527206,
-0.09215544164180756,
-0.06402364373207092,
-0.10830589383840561,
0.03379691392183304,
0.07699205726385117,
-0.11046901345252991,
-0.016612332314252853,
-0.02984754927456379,
0.00998819898813963,
-0.05820641294121742,
0.007753593847155571,
-0.016712933778762817,
0.0014505418948829174,
-0.04807407408952713,
-0.048170242458581924,
-0.0531715452671051,
0.019113507121801376,
0.08439801633358002,
0.010585008189082146,
-0.07400234043598175,
0.10156761854887009,
-0.018891986459493637,
-0.052156757563352585,
0.1302887201309204,
0.08590760082006454,
0.13382190465927124,
-0.1498136967420578,
-0.030552342534065247,
-0.09281301498413086,
0.10279291868209839,
0.015315898694097996,
-0.014133274555206299,
-0.01298056822270155,
0.06241781264543533,
0.017693962901830673,
0.0007682808791287243,
0.029756756499409676,
0.12711282074451447,
-0.0695323497056961,
0.01649993099272251,
0.08811338990926743,
-0.06976141035556793,
-0.0763985738158226,
-0.10730905085802078,
0.0256052203476429,
0.05183263123035431,
0.0947495624423027,
0.007070058956742287,
-0.0505177341401577,
-0.009485805407166481,
0.003954170271754265,
0.014901814050972462,
-0.08098141849040985,
0.03615008667111397,
-0.09673020988702774,
0.06970252841711044,
0.009914563037455082,
-0.012040670961141586,
-0.0008170561632141471,
-0.06880783289670944,
-0.053053151816129684,
0.05272500216960907,
0.021709589287638664,
-0.09712725877761841,
0.06947346031665802,
-0.07973745465278625,
-0.036861639469861984,
-0.08714801073074341,
0.05473816394805908,
-0.006384482141584158,
-0.03656519949436188,
0.0605260394513607,
0.0407724604010582,
-0.1314084380865097,
-0.05484895780682564,
0.014381998218595982,
-0.07414797693490982,
-0.013259666971862316,
-0.1076463982462883,
-0.04896606504917145,
0.050690483301877975,
0.0719417929649353,
0.04990950971841812,
-0.049923382699489594,
0.08706197887659073,
-0.06278207153081894,
-0.029196983203291893,
-0.07312408834695816,
0.01651231199502945,
0.025062547996640205,
-0.023919139057397842,
0.05597180873155594,
0.08446669578552246,
-0.06616690754890442,
0.011679486371576786,
0.008357426151633263,
-0.07388673722743988,
0.03612314909696579,
-0.055705588310956955,
-0.008656222373247147,
-0.06408344209194183,
-0.05341912433505058,
0.01561578270047903,
0.002446901286020875,
0.042539432644844055,
0.12226217240095139,
-0.03700198978185654,
0.02393815666437149,
-0.021217981353402138,
0.04431416094303131,
-0.09150857478380203,
-0.004766684491187334,
-0.06133556738495827,
0.07721113413572311
],
"env": "near",
"threshold": 0.4,
"registered_at": 1768311644.5742264
},
"user_1768529827": {
"embedding": [
0.0077949948608875275,
-0.012852567248046398,
0.0014490776229649782,
0.088177390396595,
-0.052150458097457886,
-0.1070166826248169,
-0.051932964473962784,
0.040730226784944534,
0.09491471946239471,
-0.10504328459501266,
-0.17986123263835907,
0.06056514009833336,
0.0002809118013828993,
-0.05353177338838577,
-0.08724740147590637,
-0.01057526096701622,
-0.10766296088695526,
0.024376090615987778,
-0.11535818874835968,
0.12653452157974243,
-0.0063497889786958694,
-0.02372283861041069,
-0.049704890698194504,
0.01079346239566803,
-0.10683158040046692,
0.00932641327381134,
0.043871842324733734,
0.04073511064052582,
0.005968529265373945,
0.05397576093673706,
0.07122175395488739,
0.06804963946342468,
-0.058389563113451004,
-0.03463176265358925,
-0.06834574788808823,
-0.09127284586429596,
-0.09805246442556381,
-0.015370666980743408,
-0.07054834067821503,
-0.07520422339439392,
-0.0502505861222744,
0.01580144092440605,
0.04316972196102142,
-0.010298517532646656,
-0.09042523056268692,
-0.03399325907230377,
0.03738871216773987,
0.09461583197116852,
0.07643604278564453,
-0.04089711233973503,
0.14397914707660675,
-0.03218085318803787,
-0.03981873393058777,
-0.05353623256087303,
-0.06475386023521423,
0.047925639897584915,
0.008481102995574474,
0.09522885829210281,
0.05679373815655708,
0.021448519080877304,
0.04586802423000336,
0.007880095392465591,
-0.08111433684825897,
-0.030093876644968987,
0.18197935819625854,
0.049670975655317307,
-0.029350068420171738,
0.1003178134560585,
0.05890532210469246,
-0.0418926365673542,
-0.015124992467463017,
-0.0016869385726749897,
0.029022999107837677,
0.10370466858148575,
-0.07392475008964539,
-0.041242245584726334,
0.0948185846209526,
0.0766805037856102,
0.12104924768209457,
0.07941737771034241,
-0.024586958810687065,
-0.005290709435939789,
0.08198735862970352,
-0.15709130465984344,
0.11847008019685745,
0.01280289888381958,
0.09401026368141174,
0.10199982672929764,
0.00811630580574274,
0.09336159378290176,
-0.1219155564904213,
0.00885648000985384,
0.08536995947360992,
-0.031735390424728394,
-0.02445235848426819,
0.17981232702732086,
0.05046188458800316,
-0.012413986958563328,
-0.16514025628566742,
-0.09369593858718872,
0.03961285203695297,
-0.024150250479578972,
0.024869512766599655,
0.009099201299250126,
0.0023227918427437544,
0.005291149020195007,
-0.08285452425479889,
0.02174258604645729,
-0.00018321558309253305,
-0.01761690340936184,
-0.13327360153198242,
0.07804469764232635,
-0.03172646835446358,
0.05993621423840523,
-0.0034280805848538876,
0.09203101694583893,
0.04720155894756317,
-0.12012632191181183,
-0.028879230841994286,
-0.04471825063228607,
-0.08928379416465759,
-0.055793069303035736,
-0.0230169165879488,
0.04459748789668083,
-0.08481008559465408,
0.09873232245445251,
-0.057500336319208145,
-0.05438977852463722,
0.06309207528829575,
-0.045493170619010925,
-0.0636027380824089,
-0.03580763190984726,
-0.043026816099882126,
0.04125182330608368,
-0.06327074766159058,
0.02830875851213932,
-0.0697140172123909,
-0.11324217170476913,
-0.02744743973016739,
-0.09659717977046967,
-0.036915868520736694,
0.06836548447608948,
-0.19481360912322998,
-0.08151774108409882,
0.013570327311754227,
-0.013908851891756058,
-0.02302597463130951,
-0.14017312228679657,
-0.0654999315738678,
0.0582318976521492,
-0.023702487349510193,
-0.046911414712667465,
-0.02062028832733631,
0.09885907918214798,
-0.010111358016729355,
-0.009303858503699303,
-0.07802718877792358,
0.09181840717792511,
-0.00822418462485075,
-0.024477459490299225,
0.04909557104110718,
0.024657243862748146,
0.08074013143777847,
0.10684694349765778,
-0.009657780639827251,
0.04053448513150215,
-0.054968591779470444,
0.09773849695920944,
-0.019937219098210335,
-0.11860335618257523,
-0.12553851306438446,
0.0016870739636942744,
0.07446407526731491,
-0.12183381617069244,
-0.07524612545967102,
0.06794209778308868,
-0.04324038699269295,
-0.018201345577836037,
-0.08356837183237076,
0.08218713104724884,
-0.1253940612077713,
-0.05880133807659149,
0.11516888439655304,
-0.007864559069275856,
0.06438153237104416,
-0.06551646441221237,
0.11812424659729004,
-0.07544125616550446,
0.033888354897499084,
0.02552076056599617,
0.019394448027014732,
-0.009682931937277317
],
"env": "near",
"threshold": 0.55,
"registered_at": 1768529827.4784193
},
"user_1768530001": {
"embedding": [
-0.02827363647520542,
0.04181317239999771,
-0.07721243053674698,
0.031220311298966408,
-0.006549456622451544,
-0.045262161642313004,
-0.06796529144048691,
0.10546170920133591,
-0.054266564548015594,
-0.04982651397585869,
0.008982052095234394,
0.0887555256485939,
-0.03736695274710655,
-0.027568811550736427,
-0.01881324127316475,
-0.030173255130648613,
-0.03817622363567352,
-0.027703644707798958,
-0.020354237407445908,
0.08958664536476135,
0.027346525341272354,
-0.007979321293532848,
-0.01638970896601677,
0.14815205335617065,
-0.029478076845407486,
0.0968138799071312,
0.011266525834798813,
0.10481037944555283,
0.006314543075859547,
-0.07480890303850174,
-0.126618891954422,
0.054260920733213425,
-0.054261378943920135,
0.02066616155207157,
0.056972429156303406,
-0.02620418183505535,
-0.08435375243425369,
-0.06768523901700974,
-0.001804384752176702,
-0.03350691497325897,
-0.06783927977085114,
0.09583555907011032,
0.042077258229255676,
-0.03811662644147873,
-0.09298640489578247,
0.11314687132835388,
0.06972789764404297,
-0.10421980172395706,
0.02739877998828888,
-0.06242597475647926,
0.06683704257011414,
0.030034003779292107,
-0.04094783961772919,
0.08657337725162506,
0.02882716991007328,
0.07672230899333954,
-0.0162385031580925,
0.12335177510976791,
-0.07505486160516739,
0.05924128741025925,
0.02278822474181652,
0.051575034856796265,
-0.07616295665502548,
-0.049982234835624695,
-0.021159915253520012,
0.023469945415854454,
-0.008445728570222855,
0.18868982791900635,
0.10217619687318802,
0.0029947187285870314,
0.003596147522330284,
-0.010885344818234444,
0.002336243400350213,
-0.06228164955973625,
-0.09452632069587708,
0.06288570165634155,
0.09799493104219437,
0.05772380530834198,
-0.012649190612137318,
0.037833958864212036,
-0.07815677672624588,
0.11595622450113297,
-0.006132716778665781,
-0.047689273953437805,
0.10451581329107285,
0.12618094682693481,
-0.012135603465139866,
-0.14452683925628662,
-0.011882219463586807,
0.05687599256634712,
-0.10221579670906067,
0.09555421024560928,
0.050166770815849304,
0.026791365817189217,
0.0343380831182003,
0.0643647089600563,
-0.09814899414777756,
-0.01735001988708973,
0.0002968672488350421,
-0.16691210865974426,
-0.044747937470674515,
0.10229559987783432,
0.01551489345729351,
0.0614253506064415,
-0.012457458302378654,
-0.059297215193510056,
-0.0662546306848526,
0.06900843977928162,
-0.15012530982494354,
0.14357514679431915,
-0.08563537150621414,
0.1512402445077896,
-0.05548126623034477,
-0.13191379606723785,
0.02588576264679432,
-0.007292638067156076,
-0.033004030585289,
-0.08764250576496124,
-0.04006534814834595,
0.001069005811586976,
0.0708790197968483,
-0.11471016705036163,
-0.08249906450510025,
-0.07923658937215805,
-0.029890256002545357,
0.027568599209189415,
-0.00042784016113728285,
0.01911524124443531,
0.002947323489934206,
-0.058468904346227646,
0.0006662740488536656,
-0.09472604095935822,
-0.07827164232730865,
0.05823435261845589,
-0.022661248221993446,
0.007729553151875734,
0.044511985033750534,
-0.17424426972866058,
-0.054321326315402985,
-0.010871038772165775,
-0.04280569776892662,
0.01373684499412775,
-0.03464324399828911,
0.0012510031228885055,
-0.13786448538303375,
0.13943427801132202,
0.07161138951778412,
-0.0017689999658614397,
-0.0330035537481308,
0.01767006888985634,
-0.06832484155893326,
-0.16906532645225525,
-0.08673631399869919,
0.016205811873078346,
-0.040736377239227295,
-0.053034041076898575,
-0.057571377605199814,
-0.018383856862783432,
0.029812879860401154,
-0.005708644632250071,
0.07977750152349472,
0.03715944290161133,
0.029830463230609894,
-0.15909501910209656,
0.10081987082958221,
0.07019384205341339,
0.05683498457074165,
0.008955223485827446,
-0.06697771698236465,
0.044268134981393814,
0.08812808990478516,
-0.17523430287837982,
0.05148027464747429,
-0.11579684168100357,
-0.06281758099794388,
-0.08106749504804611,
-0.07915353775024414,
0.03760797902941704,
-0.059639666229486465,
0.012170189991593361,
-0.028386766090989113,
-0.043592486530542374,
0.029122747480869293,
0.052276406437158585,
0.06929390132427216,
-0.10774848610162735,
0.06797030568122864,
-0.017512541264295578,
0.07446594536304474,
-0.07573172450065613,
-0.15186654031276703,
-0.03710319101810455
],
"env": "near",
"threshold": 0.55,
"registered_at": 1768530001.2158406
}
}

70
config/voice.yaml Normal file
View File

@@ -0,0 +1,70 @@
# ROS 语音包配置文件
dashscope:
api_key: "sk-7215a5ab7a00469db4072e1672a0661e"
asr:
model: "qwen3-asr-flash-realtime"
url: "wss://dashscope.aliyuncs.com/api-ws/v1/realtime"
llm:
model: "qwen3-vl-flash"
base_url: "https://dashscope.aliyuncs.com/compatible-mode/v1"
temperature: 0.7
max_tokens: 4096
max_history: 10
summary_trigger: 3
tts:
model: "cosyvoice-v3-flash"
voice: "longanyang"
audio:
microphone:
device_index: 3 # 指向 iFLYTEK-M2 (hw:1,0)
sample_rate: 48000 # 尝试使用硬件原生采样率 48kHz避免重采样可能导致的问题
channels: 1 # 输入声道数单声道MONO适合语音采集
chunk: 1024
heartbeat_interval: 2.0 # 心跳间隔(秒),用于定期输出录音状态
soundcard:
card_index: 1 # USB Audio Device (card 1)
device_index: 0 # USB Audio [USB Audio] (device 0)
# card_index: -1 # 使用默认声卡
# device_index: -1 # 使用默认输出设备
sample_rate: 48000 # 输出采样率48kHziFLYTEK 支持 48000
channels: 2 # 输出声道数立体声2声道FL+FR
volume: 1.0 # 音量比例0.0-1.00.2表示20%音量)
echo_cancellation:
enabled: false # 是否启用回声消除true/false
max_duration_ms: 500 # 参考信号缓冲区最大时长(毫秒)
tts:
source_sample_rate: 22050 # TTS服务固定输出采样率DashScope服务固定值不可修改
source_channels: 1 # TTS服务固定输出声道数DashScope服务固定值不可修改
ffmpeg_thread_queue_size: 4096 # ffmpeg输入线程队列大小增大以减少卡顿
vad:
vad_mode: 3 # VAD模式0-33最严格
silence_duration_ms: 1000 # 静音持续时长(毫秒)
min_energy_threshold: 300 # 最小能量阈值
system:
use_llm: true # 是否使用LLM
use_wake_word: true # 是否启用唤醒词检测
wake_word: "er gou" # 唤醒词(拼音)
session_timeout: 3.0 # 会话超时时间(秒)
shutup_keywords: "bi zui" # 闭嘴指令关键词(拼音,逗号分隔)
interrupt_command_queue_depth: 10 # 中断命令订阅的队列深度QoS
sv_enabled: true # 是否启用声纹识别
sv_model_path: "~/hivecore_robot_os1/voice_model" # 声纹模型路径
sv_threshold: 0.55 # 声纹识别阈值0.0-1.0,值越小越宽松,值越大越严格)
sv_speaker_db_path: "~/hivecore_robot_os1/config/speakers.json" # 声纹数据库保存路径JSON格式相对于ROS2包share目录
sv_buffer_size: 240000 # 声纹验证录音缓冲区大小样本数48kHz下5秒=240000
sv_registration_silence_threshold_ms: 500 # 声纹注册状态下的静音阈值(毫秒)
camera:
serial_number: "405622075404" # 相机序列号Intel RealSense D435
rgb:
width: 640 # 图像宽度
height: 480 # 图像高度
fps: 30 # 帧率支持6, 10, 15, 30, 60
format: "RGB8" # 图像格式RGB8, BGR8
image:
jpeg_quality: 85 # JPEG压缩质量0-10085是质量和大小平衡点
max_size: "1280x720" # 最大尺寸

17
launch/voice.launch.py Normal file
View File

@@ -0,0 +1,17 @@
from launch import LaunchDescription
from launch_ros.actions import Node
def generate_launch_description():
"""启动语音交互节点,所有参数从 voice.yaml 读取"""
return LaunchDescription([
Node(
package='robot_speaker',
executable='robot_speaker_node',
name='robot_speaker_node',
output='screen'
),
])

View File

@@ -2,13 +2,22 @@
<?xml-model href="http://download.ros.org/schema/package_format3.xsd" schematypens="http://www.w3.org/2001/XMLSchema"?>
<package format="3">
<name>robot_speaker</name>
<version>0.0.0</version>
<description>TODO: Package description</description>
<version>0.0.1</version>
<description>语音识别和合成ROS2包</description>
<maintainer email="mzebra@foxmail.com">mzebra</maintainer>
<license>Apache-2.0</license>
<depend>rclpy</depend>
<depend>example_interfaces</depend>
<depend>std_msgs</depend>
<depend>ament_index_python</depend>
<depend>interfaces</depend>
<exec_depend>python3-pyaudio</exec_depend>
<exec_depend>python3-requests</exec_depend>
<exec_depend>python3-edge-tts</exec_depend>
<exec_depend>python3-webrtcvad</exec_depend>
<exec_depend>python3-yaml</exec_depend>
<exec_depend>python3-pypinyin</exec_depend>
<test_depend>ament_copyright</test_depend>
<test_depend>ament_flake8</test_depend>

17
requirements.txt Normal file
View File

@@ -0,0 +1,17 @@
dashscope>=1.20.0
openai>=1.0.0
pyaudio>=0.2.11
webrtcvad>=2.0.10
pypinyin>=0.49.0
rclpy>=3.0.0
pyrealsense2>=2.54.0
Pillow>=10.0.0
numpy>=1.24.0
PyYAML>=6.0
aec-audio-processing
modelscope>=1.33.0
funasr>=1.0.0
datasets==3.6.0

View File

@@ -0,0 +1,6 @@
# robot_speaker package

View File

@@ -0,0 +1,2 @@
# Bridge package for connecting LLM outputs to brain execution.

View File

@@ -0,0 +1,136 @@
#!/usr/bin/env python3
"""
桥接LLM技能序列到小脑ExecuteBtAction并转发反馈/结果。
"""
import json
import os
import re
import rclpy
from rclpy.node import Node
from rclpy.action import ActionClient
from std_msgs.msg import String
from ament_index_python.packages import get_package_share_directory
from interfaces.action import ExecuteBtAction
class SkillBridgeNode(Node):
def __init__(self):
super().__init__('skill_bridge_node')
self._action_client = ActionClient(self, ExecuteBtAction, '/execute_bt_action')
self._current_epoch = 1
self._allowed_skills = self._load_allowed_skills()
self.skill_seq_sub = self.create_subscription(
String, '/llm_skill_sequence', self._on_skill_sequence_received, 10
)
self.feedback_pub = self.create_publisher(String, '/skill_execution_feedback', 10)
self.result_pub = self.create_publisher(String, '/skill_execution_result', 10)
self.get_logger().info('SkillBridgeNode started')
def _on_skill_sequence_received(self, msg: String):
raw = (msg.data or "").strip()
if not raw:
return
if not self._allowed_skills:
self.get_logger().warning("No skill whitelist loaded; reject all sequences")
return
sequence, invalid = self._extract_skill_sequence(raw)
if invalid:
self.get_logger().warning(f"Rejected sequence with invalid skills: {invalid}")
return
if not sequence:
self.get_logger().warning(f"Invalid skill sequence: {raw}")
return
self._send_skill_sequence(sequence)
def _load_allowed_skills(self) -> set[str]:
try:
brain_share = get_package_share_directory("brain")
skill_path = os.path.join(brain_share, "config", "robot_skills.yaml")
if not os.path.exists(skill_path):
return set()
import yaml
with open(skill_path, "r", encoding="utf-8") as f:
data = yaml.safe_load(f) or []
return {str(entry["name"]) for entry in data if isinstance(entry, dict) and entry.get("name")}
except Exception as e:
self.get_logger().warning(f"Load skills failed: {e}")
return set()
def _extract_skill_sequence(self, text: str) -> tuple[str, list[str]]:
# Accept CSV/space/semicolon and filter by CamelCase tokens
tokens = re.split(r'[,\s;]+', text.strip())
skills = [t for t in tokens if re.match(r'^[A-Z][A-Za-z0-9]*$', t)]
if not skills:
return "", []
invalid = [s for s in skills if s not in self._allowed_skills]
return ",".join(skills), invalid
def _send_skill_sequence(self, skill_sequence: str):
if not self._action_client.wait_for_server(timeout_sec=2.0):
self.get_logger().error('ExecuteBtAction server unavailable')
return
goal = ExecuteBtAction.Goal()
goal.epoch = self._current_epoch
self._current_epoch += 1
goal.action_name = skill_sequence
goal.calls = []
self.get_logger().info(f"Dispatch skill sequence: {skill_sequence}")
send_future = self._action_client.send_goal_async(goal, feedback_callback=self._feedback_callback)
rclpy.spin_until_future_complete(self, send_future, timeout_sec=5.0)
if not send_future.done():
self.get_logger().warning("Send goal timed out")
return
goal_handle = send_future.result()
if not goal_handle or not goal_handle.accepted:
self.get_logger().error("Goal rejected")
return
result_future = goal_handle.get_result_async()
rclpy.spin_until_future_complete(self, result_future)
if result_future.done():
self._handle_result(result_future.result())
def _feedback_callback(self, feedback_msg):
fb = feedback_msg.feedback
payload = {
"stage": fb.stage,
"current_skill": fb.current_skill,
"progress": float(fb.progress),
"detail": fb.detail,
"epoch": int(fb.epoch),
}
msg = String()
msg.data = json.dumps(payload, ensure_ascii=True)
self.feedback_pub.publish(msg)
def _handle_result(self, result_wrapper):
result = result_wrapper.result
if not result:
return
payload = {
"success": bool(result.success),
"message": result.message,
"total_skills": int(result.total_skills),
"succeeded_skills": int(result.succeeded_skills),
}
msg = String()
msg.data = json.dumps(payload, ensure_ascii=True)
self.result_pub.publish(msg)
def main(args=None):
rclpy.init(args=args)
node = SkillBridgeNode()
rclpy.spin(node)
node.destroy_node()
rclpy.shutdown()
if __name__ == '__main__':
main()

View File

@@ -0,0 +1,5 @@
"""核心模块"""

View File

@@ -0,0 +1,10 @@
from enum import Enum
class ConversationState(Enum):
"""会话状态机"""
IDLE = "idle" # 等待用户唤醒或声音
CHECK_VOICE = "check_voice" # 用户说话 → 检查声纹
AUTHORIZED = "authorized" # 已注册用户

View File

@@ -0,0 +1,158 @@
from dataclasses import dataclass
from typing import Optional
import os
import yaml
from ament_index_python.packages import get_package_share_directory
from pypinyin import pinyin, Style
@dataclass
class IntentResult:
intent: str # "skill_sequence" | "kb_qa" | "chat_text" | "chat_camera"
text: str
need_camera: bool
camera_mode: Optional[str] # "head" | "left_hand" | "right_hand" | None
system_prompt: Optional[str]
class IntentRouter:
def __init__(self):
self.camera_capture_keywords = [
"pai zhao", "pai ge zhao", "pai zhang zhao"
]
self.skill_keywords = [
"ban xiang zi"
]
self.kb_keywords = [
"ni shi shei", "ni de ming zi"
]
self._cached_skill_names: list[str] | None = None
def _load_brain_skill_names(self) -> list[str]:
if self._cached_skill_names is not None:
return self._cached_skill_names
skill_names: list[str] = []
try:
brain_share = get_package_share_directory("brain")
skill_path = os.path.join(brain_share, "config", "robot_skills.yaml")
with open(skill_path, "r", encoding="utf-8") as f:
data = yaml.safe_load(f) or []
for entry in data:
if isinstance(entry, dict) and entry.get("name"):
skill_names.append(str(entry["name"]))
except Exception:
skill_names = []
self._cached_skill_names = skill_names
return skill_names
def to_pinyin(self, text: str) -> str:
chars = [c for c in text if '\u4e00' <= c <= '\u9fa5']
if not chars:
return ""
py_list = pinyin(''.join(chars), style=Style.NORMAL)
return ' '.join([item[0] for item in py_list]).lower().strip()
def is_skill_sequence_intent(self, text: str) -> bool:
text_pinyin = self.to_pinyin(text)
return any(k in text_pinyin for k in self.skill_keywords)
def check_camera_command(self, text: str) -> tuple[bool, Optional[str]]:
if not text:
return False, None
text_pinyin = self.to_pinyin(text)
for keyword in self.camera_capture_keywords:
if keyword in text_pinyin:
return True, self.detect_camera_mode(text)
return False, None
def detect_camera_mode(self, text: str) -> str:
text_pinyin = self.to_pinyin(text)
left_keys = ["zuo shou", "zuo bi", "zuo bian"]
right_keys = ["you shou", "you bi", "you bian"]
head_keys = ["tou", "nao dai"]
for kw in left_keys:
if kw in text_pinyin:
return "left_hand"
for kw in right_keys:
if kw in text_pinyin:
return "right_hand"
for kw in head_keys:
if kw in text_pinyin:
return "head"
return "head"
def build_skill_prompt(self) -> str:
skills = self._load_brain_skill_names()
skills_text = ", ".join(skills) if skills else ""
skill_guard = (
"【技能限制】只能使用以下技能名称:" + skills_text
if skills_text
else "【技能限制】技能列表不可用,请不要输出任何技能名称。"
)
return (
"你是机器人任务规划器。\n"
"本任务必须拍照。请根据用户请求选择使用哪个相机拍照(默认头部相机),并结合当前环境信息生成简洁、可执行的技能序列。\n"
"【重要】如果对话历史中包含【执行结果】或【执行状态】,请参考上一轮技能序列的执行情况,根据成功/失败信息调整本次技能序列。\n"
"【输出格式要求】只输出逗号分隔的技能名称,不要任何解释说明。\n"
+ skill_guard
)
def build_chat_prompt(self, need_camera: bool) -> str:
if need_camera:
return (
"你是一个智能语音助手。\n"
"请结合图片内容简短回答。"
)
return (
"你是一个智能语音助手。\n"
"请自然、简短地与用户对话。"
)
def build_kb_prompt(self) -> str:
return (
"你是蜂核科技的员工。\n"
"请基于知识库信息回答用户问题,回答要准确简洁。"
)
def build_default_system_prompt(self) -> str:
return (
"你是一个智能语音助手。\n"
"- 当用户发送图片时,请仔细观察图片内容,结合用户的问题或描述,提供简短、专业的回答。\n"
"- 当用户没有发送图片时,请自然、友好地与用户对话。\n"
"请根据对话模式调整你的回答风格。"
)
def route(self, text: str) -> IntentResult:
need_camera, camera_mode = self.check_camera_command(text)
text_pinyin = self.to_pinyin(text)
if self.is_skill_sequence_intent(text):
if camera_mode is None:
camera_mode = "head"
return IntentResult(
intent="skill_sequence",
text=text,
need_camera=True,
camera_mode=camera_mode,
system_prompt=self.build_skill_prompt()
)
if any(k in text_pinyin for k in self.kb_keywords):
return IntentResult(
intent="kb_qa",
text=text,
need_camera=False,
camera_mode=None,
system_prompt=self.build_kb_prompt()
)
return IntentResult(
intent="chat_camera" if need_camera else "chat_text",
text=text,
need_camera=need_camera,
camera_mode=camera_mode,
system_prompt=self.build_chat_prompt(need_camera)
)

View File

@@ -0,0 +1,246 @@
import threading
import numpy as np
from robot_speaker.core.conversation_state import ConversationState
from robot_speaker.perception.speaker_verifier import SpeakerState
class NodeCallbacks:
# ==================== 初始化与内部工具 ====================
def __init__(self, node):
self.node = node
def _mark_utterance_processed(self) -> bool:
node = self.node
with node.utterance_lock:
if node.current_utterance_id == node.last_processed_utterance_id:
return False
node.last_processed_utterance_id = node.current_utterance_id
return True
def _trigger_sv_for_check_voice(self, source: str):
node = self.node
if not (node.sv_enabled and node.sv_client):
return
if not self._mark_utterance_processed():
return
if node._handle_empty_speaker_db():
node.get_logger().info(f"[声纹] CHECK_VOICE状态数据库为空跳过声纹验证来源: {source}")
return
if not node.sv_speech_end_event.is_set():
with node.sv_lock:
node.sv_recording = False
buffer_size = len(node.sv_audio_buffer)
node.get_logger().info(f"[声纹] {source}触发验证,缓冲区大小: {buffer_size} 样本({buffer_size/node.sample_rate:.2f}秒)")
if buffer_size > 0:
node.sv_speech_end_event.set()
else:
node.get_logger().debug(f"[声纹] 声纹验证已触发,跳过(来源: {source}")
# ==================== 业务逻辑代理 ====================
def handle_interrupt_command(self, msg):
return self.node._handle_interrupt_command(msg)
def check_interrupt_and_cancel_turn(self) -> bool:
return self.node._check_interrupt_and_cancel_turn()
def handle_wake_word(self, text: str) -> str:
return self.node._handle_wake_word(text)
def check_shutup_command(self, text: str) -> bool:
return self.node._check_shutup_command(text)
def check_camera_command(self, text: str):
return self.node.intent_router.check_camera_command(text)
def llm_process_stream_with_camera(self, user_text: str, need_camera: bool) -> str:
return self.node._llm_process_stream_with_camera(user_text, need_camera)
def put_tts_text(self, text: str):
return self.node._put_tts_text(text)
def force_stop_tts(self):
return self.node._force_stop_tts()
def drain_queue(self, q):
return self.node._drain_queue(q)
# ==================== 录音/VAD回调 ====================
def get_silence_threshold(self) -> int:
"""获取动态静音阈值(毫秒)"""
node = self.node
return node.silence_duration_ms
def should_put_audio_to_queue(self) -> bool:
"""
检查是否应该将音频放入队列用于ASR,根据状态机决定是否允许ASR
"""
node = self.node
state = node._get_state()
if state in [ConversationState.IDLE, ConversationState.CHECK_VOICE,
ConversationState.AUTHORIZED]:
return True
return False
def on_speech_start(self):
"""录音线程检测到人声开始"""
node = self.node
node.get_logger().info("[录音线程] 检测到人声,开始录音")
with node.utterance_lock:
node.current_utterance_id += 1
state = node._get_state()
if state == ConversationState.IDLE:
# Idle -> CheckVoice
if node.sv_enabled and node.sv_client:
# 开始录音用于声纹验证
with node.sv_lock:
node.sv_recording = True
node.sv_audio_buffer.clear()
node.get_logger().debug("[声纹] 开始录音用于声纹验证")
node._change_state(ConversationState.CHECK_VOICE, "检测到语音,开始检查声纹")
else:
node._change_state(ConversationState.AUTHORIZED, "未启用声纹,直接授权")
elif state == ConversationState.CHECK_VOICE:
# CheckVoice状态继续录音用于声纹验证
if node.sv_enabled:
with node.sv_lock:
node.sv_recording = True
node.sv_audio_buffer.clear()
node.get_logger().debug("[声纹] 继续录音用于声纹验证")
elif state == ConversationState.AUTHORIZED:
# Authorized状态开始录音用于声纹验证验证当前用户
if node.sv_enabled:
with node.sv_lock:
node.sv_recording = True
node.sv_audio_buffer.clear()
node.get_logger().debug("[声纹] 开始录音用于声纹验证")
def on_audio_chunk_for_sv(self, audio_chunk: bytes):
"""录音线程音频chunk回调 - 仅在需要时录音到声纹缓冲区"""
node = self.node
state = node._get_state()
# 声纹验证录音CHECK_VOICE, AUTHORIZED状态
if node.sv_enabled and node.sv_recording:
try:
audio_array = np.frombuffer(audio_chunk, dtype=np.int16)
with node.sv_lock:
node.sv_audio_buffer.extend(audio_array)
except Exception as e:
node.get_logger().debug(f"[声纹] 录音失败: {e}")
def on_speech_end(self):
"""录音线程检测到说话结束(静音一段时间)"""
node = self.node
node.get_logger().info("[录音线程] 检测到说话结束")
state = node._get_state()
node.get_logger().info(f"[录音线程] 说话结束时的状态: {state}")
if state == ConversationState.CHECK_VOICE:
if node.asr_client and node.asr_client.running:
node.asr_client.stop_current_recognition()
self._trigger_sv_for_check_voice("VAD")
return
elif state == ConversationState.AUTHORIZED:
if node.asr_client and node.asr_client.running:
node.asr_client.stop_current_recognition()
if node.sv_enabled:
with node.sv_lock:
node.sv_recording = False
buffer_size = len(node.sv_audio_buffer)
node.get_logger().debug(f"[声纹] 停止录音,缓冲区大小: {buffer_size}")
node.sv_speech_end_event.set()
# 如果TTS正在播放异步等待声纹验证结果如果通过才中断TTS
# 使用独立线程避免阻塞录音线程影响TTS播放
if node.tts_playing_event.is_set():
node.get_logger().info("[打断] TTS播放中用户说话结束异步等待声纹验证结果...")
def _check_sv_and_interrupt():
# 等待声纹验证结果最多等待2秒
with node.sv_result_cv:
current_seq = node.sv_result_seq
if node.sv_result_cv.wait_for(
lambda: node.sv_result_seq > current_seq,
timeout=2.0
):
# 声纹验证完成,检查结果
with node.sv_lock:
speaker_id = node.current_speaker_id
speaker_state = node.current_speaker_state
if speaker_id and speaker_state == SpeakerState.VERIFIED:
node.get_logger().info(f"[打断] 声纹验证通过({speaker_id})中断TTS播放")
node._interrupt_tts("检测到人声(已授权用户,说话结束)")
else:
node.get_logger().debug(f"[打断] 声纹验证未通过不中断TTS状态: {speaker_state.value}")
else:
node.get_logger().warning("[打断] 声纹验证超时不中断TTS")
# 在独立线程中等待,避免阻塞录音线程
threading.Thread(target=_check_sv_and_interrupt, daemon=True, name="SVInterruptCheck").start()
return
def on_new_segment(self):
"""录音线程检测到新的已授权用户声段,开始录音用于声纹验证(不立即中断)"""
node = self.node
state = node._get_state()
if state == ConversationState.AUTHORIZED:
# TTS播放期间检测到人声时不立即中断而是开始录音用于声纹验证
# 等待用户说话结束speech_end如果声纹验证通过才中断TTS
# 这样可以避免TTS回声误触发但支持真正的用户打断
if node.tts_playing_event.is_set():
node.get_logger().debug("[打断] TTS播放中检测到人声开始录音用于声纹验证等待说话结束后验证")
# 录音已经在 on_speech_start 中开始了,这里不需要额外操作
else:
# TTS未播放时检查声纹验证结果并立即中断
if node.sv_enabled and node.sv_client:
with node.sv_lock:
current_speaker_id = node.current_speaker_id
speaker_state = node.current_speaker_state
if speaker_state == SpeakerState.VERIFIED and current_speaker_id:
node._interrupt_tts("检测到人声(已授权用户)")
node.get_logger().info(f"[打断] 已授权用户({current_speaker_id})发言中断TTS播放")
else:
node.get_logger().debug(f"[打断] 检测到人声但声纹未验证或未匹配不中断TTS当前状态: {speaker_state.value}")
else:
# 未启用声纹,直接中断(保持原有行为)
node._interrupt_tts("检测到人声(未启用声纹)")
node.get_logger().info("[打断] 检测到人声中断TTS播放")
else:
node.get_logger().debug(f"[打断] 检测到人声,但当前状态为 {state.value}非已授权用户不允许打断TTS")
def on_heartbeat(self):
"""录音线程静音心跳回调"""
self.node.get_logger().info("[录音线程] 静音中")
# ==================== ASR回调 ====================
def on_asr_sentence_end(self, text: str):
"""ASR sentence_end回调 - 将文本放入队列"""
node = self.node
if not text or not text.strip():
return
text_clean = text.strip()
node.get_logger().info(f"[ASR] 识别完成: {text_clean}")
state = node._get_state()
# 规则2CHECK_VOICE状态下如果ASR识别完成但VAD还没有触发speech_end主动触发声纹验证
if state == ConversationState.CHECK_VOICE:
if node.sv_enabled and node.sv_client:
node.get_logger().info("[ASR] CHECK_VOICE状态ASR识别完成主动触发声纹验证")
self._trigger_sv_for_check_voice("ASR")
# 其他状态,将文本放入队列
node.text_queue.put(text_clean, timeout=1.0)
def on_asr_text_update(self, text: str):
"""ASR 实时文本更新回调 - 用于多轮提示"""
if not text or not text.strip():
return
self.node.get_logger().debug(f"[ASR] 识别中: {text.strip()}")

View File

@@ -0,0 +1,188 @@
import queue
import time
import numpy as np
from robot_speaker.core.conversation_state import ConversationState
from robot_speaker.perception.speaker_verifier import SpeakerState
class NodeWorkers:
def __init__(self, node):
self.node = node
def recording_worker(self):
"""线程1: 录音线程 - 唯一实时线程"""
node = self.node
node.get_logger().info("[录音线程] 启动")
node.audio_recorder.record_with_vad()
def asr_worker(self):
"""线程2: ASR推理线程 - 只做 audio → text"""
node = self.node
node.get_logger().info("[ASR推理线程] 启动")
while not node.stop_event.is_set():
try:
audio_chunk = node.audio_queue.get(timeout=0.1)
except queue.Empty:
continue
if node.interrupt_event.is_set():
continue
if node.callbacks.should_put_audio_to_queue() and node.asr_client and node.asr_client.running:
node.asr_client.send_audio(audio_chunk)
def process_worker(self):
"""线程3: 主线程 - 处理业务逻辑"""
node = self.node
node.get_logger().info("[主线程] 启动")
while not node.stop_event.is_set():
try:
text = node.text_queue.get(timeout=0.1)
except queue.Empty:
continue
node.get_logger().info(f"[主线程] 收到识别文本: {text}")
current_state = node._get_state()
if current_state == ConversationState.CHECK_VOICE:
if node.use_wake_word:
node.get_logger().info(f"[主线程] CHECK_VOICE状态检查唤醒词文本: {text}")
processed_text = node.callbacks.handle_wake_word(text)
if not processed_text:
node.get_logger().info(f"[主线程] 未检测到唤醒词(唤醒词配置: '{node.wake_word}'回到Idle状态")
node._change_state(ConversationState.IDLE, "未检测到唤醒词")
continue
node.get_logger().info(f"[主线程] 检测到唤醒词,处理后的文本: {processed_text}")
text = processed_text
if node.sv_enabled and node.sv_client:
node.get_logger().info("[主线程] CHECK_VOICE状态等待声纹验证结果...")
with node.sv_result_cv:
current_seq = node.sv_result_seq
if not node.sv_result_cv.wait_for(
lambda: node.sv_result_seq > current_seq,
timeout=15.0
):
node.get_logger().warning("[主线程] CHECK_VOICE状态声纹结果未ready超时15秒拒绝本轮")
with node.sv_lock:
node.sv_audio_buffer.clear()
node._change_state(ConversationState.IDLE, "声纹结果未ready")
continue
node.get_logger().info("[主线程] CHECK_VOICE状态声纹结果ready继续处理")
with node.sv_lock:
speaker_id = node.current_speaker_id
speaker_state = node.current_speaker_state
score = node.current_speaker_score
if speaker_id and speaker_state == SpeakerState.VERIFIED:
node.get_logger().info(f"[主线程] 声纹验证成功: {speaker_id}, 得分: {score:.4f}")
node._change_state(ConversationState.AUTHORIZED, "声纹验证成功")
else:
node.get_logger().info(f"[主线程] 声纹验证失败,得分: {score:.4f}")
node.callbacks.put_tts_text("声纹验证失败")
node._change_state(ConversationState.IDLE, "声纹验证失败")
continue
else:
node._change_state(ConversationState.AUTHORIZED, "未启用声纹")
elif current_state == ConversationState.AUTHORIZED:
if node.tts_playing_event.is_set():
node.get_logger().debug("[主线程] AUTHORIZED状态TTS播放中忽略ASR识别结果只有VAD检测到已授权用户人声才能中断")
continue
elif current_state == ConversationState.IDLE:
node.get_logger().warning("[主线程] Idle状态收到文本忽略")
continue
if node.use_wake_word and current_state == ConversationState.AUTHORIZED:
processed_text = node.callbacks.handle_wake_word(text)
if not processed_text:
node._change_state(ConversationState.IDLE, "未检测到唤醒词")
continue
text = processed_text
if node.callbacks.check_shutup_command(text):
node.get_logger().info("[主线程] 检测到闭嘴指令")
node.interrupt_event.set()
node.callbacks.force_stop_tts()
node._change_state(ConversationState.IDLE, "用户闭嘴指令")
continue
intent_payload = node.intent_router.route(text)
node._handle_intent(intent_payload)
if current_state == ConversationState.AUTHORIZED:
node.session_start_time = time.time()
def sv_worker(self):
"""线程5: 声纹识别线程 - 非实时、低频CAM++"""
node = self.node
node.get_logger().info("[声纹识别线程] 启动")
# 动态计算最小音频样本数确保降采样到16kHz后≥0.5秒
target_sr = 16000 # CAM++模型目标采样率
min_duration_seconds = 0.5
min_samples_at_target_sr = int(target_sr * min_duration_seconds) # 8000样本@16kHz
if node.sample_rate >= target_sr:
downsample_step = int(node.sample_rate / target_sr)
min_audio_samples = min_samples_at_target_sr * downsample_step
else:
min_audio_samples = int(node.sample_rate * min_duration_seconds)
while not node.stop_event.is_set():
try:
if node.sv_speech_end_event.wait(timeout=0.1):
node.sv_speech_end_event.clear()
with node.sv_lock:
audio_list = list(node.sv_audio_buffer)
buffer_size = len(audio_list)
node.sv_audio_buffer.clear()
node.get_logger().info(f"[声纹识别] 收到speech_end事件录音长度: {buffer_size} 样本({buffer_size/node.sample_rate:.2f}秒)")
if node._handle_empty_speaker_db():
node.get_logger().info("[声纹识别] 数据库为空跳过验证直接设置UNKNOWN状态")
continue
if buffer_size >= min_audio_samples:
audio_array = np.array(audio_list, dtype=np.int16)
embedding, success = node.sv_client.extract_embedding(
audio_array,
sample_rate=node.sample_rate
)
if not success or embedding is None:
node.get_logger().debug("[声纹识别] 提取embedding失败")
with node.sv_lock:
node.current_speaker_id = None
node.current_speaker_state = SpeakerState.ERROR
node.current_speaker_score = 0.0
else:
speaker_id, match_state, score, _ = node.sv_client.match_speaker(embedding)
with node.sv_lock:
node.current_speaker_id = speaker_id
node.current_speaker_state = match_state
node.current_speaker_score = score
if match_state == SpeakerState.VERIFIED:
node.get_logger().info(f"[声纹识别] 识别到说话人: {speaker_id}, 相似度: {score:.4f}")
elif match_state == SpeakerState.REJECTED:
node.get_logger().info(f"[声纹识别] 未匹配到已知说话人(相似度不足), 相似度: {score:.4f}")
else:
node.get_logger().info(f"[声纹识别] 状态: {match_state.value}, 相似度: {score:.4f}")
else:
node.get_logger().debug(f"[声纹识别] 录音太短: {buffer_size} < {min_audio_samples},跳过处理")
with node.sv_lock:
node.current_speaker_id = None
node.current_speaker_state = SpeakerState.UNKNOWN
node.current_speaker_score = 0.0
with node.sv_result_cv:
node.sv_result_seq += 1
node.sv_result_cv.notify_all()
except Exception as e:
node.get_logger().error(f"[声纹识别线程] 错误: {e}")
time.sleep(0.1)

View File

@@ -0,0 +1,463 @@
"""
声纹注册独立节点:运行完成后退出
"""
import collections
import os
import queue
import threading
import time
import yaml
import numpy as np
import rclpy
from rclpy.node import Node
from ament_index_python.packages import get_package_share_directory
from robot_speaker.perception.audio_pipeline import VADDetector, AudioRecorder
from robot_speaker.perception.speaker_verifier import SpeakerVerificationClient
from robot_speaker.perception.echo_cancellation import ReferenceSignalBuffer
from robot_speaker.models.asr.dashscope import DashScopeASR
from robot_speaker.models.tts.dashscope import DashScopeTTSClient
from robot_speaker.core.types import TTSRequest
from pypinyin import pinyin, Style
class RegisterSpeakerNode(Node):
def __init__(self):
super().__init__('register_speaker_node')
self._load_config()
self.stop_event = threading.Event()
self.processing = False
self.buffer_lock = threading.Lock()
self.audio_buffer = collections.deque(maxlen=self.sv_buffer_size)
# 状态:等待唤醒词 -> 等待声纹语音
self.waiting_for_wake_word = True
self.waiting_for_voiceprint = False
# 音频队列和文本队列用于ASR
self.audio_queue = queue.Queue()
self.text_queue = queue.Queue()
self.vad_detector = VADDetector(
mode=self.vad_mode,
sample_rate=self.sample_rate
)
# 创建参考信号缓冲区(用于回声消除)
self.reference_signal_buffer = ReferenceSignalBuffer(
max_duration_ms=self.audio_echo_cancellation_max_duration_ms,
sample_rate=self.sample_rate,
channels=self.output_channels
) if self.audio_echo_cancellation_enabled else None
self.audio_recorder = AudioRecorder(
device_index=self.input_device_index,
sample_rate=self.sample_rate,
channels=self.channels,
chunk=self.chunk,
vad_detector=self.vad_detector,
audio_queue=self.audio_queue, # 送ASR用于唤醒词检测
silence_duration_ms=self.silence_duration_ms,
min_energy_threshold=self.min_energy_threshold,
heartbeat_interval=self.audio_microphone_heartbeat_interval,
on_heartbeat=self._on_heartbeat,
is_playing=lambda: False,
on_new_segment=None,
on_speech_start=self._on_speech_start,
on_speech_end=self._on_speech_end,
stop_flag=self.stop_event.is_set,
on_audio_chunk=self._on_audio_chunk,
should_put_to_queue=self._should_put_to_queue,
get_silence_threshold=lambda: self.silence_duration_ms,
enable_echo_cancellation=self.audio_echo_cancellation_enabled, # 启用回声消除,保持与主程序一致
reference_signal_buffer=self.reference_signal_buffer,
logger=self.get_logger()
)
# ASR客户端 - 用于唤醒词检测
self.asr_client = DashScopeASR(
api_key=self.dashscope_api_key,
sample_rate=self.sample_rate,
model=self.asr_model,
url=self.asr_url,
logger=self.get_logger()
)
self.asr_client.on_sentence_end = self._on_asr_sentence_end
self.asr_client.start()
# ASR处理线程
self.asr_thread = threading.Thread(
target=self._asr_worker,
name="RegisterASRThread",
daemon=True
)
self.asr_thread.start()
# 文本处理线程
self.text_thread = threading.Thread(
target=self._text_worker,
name="RegisterTextThread",
daemon=True
)
self.text_thread.start()
self.sv_client = SpeakerVerificationClient(
model_path=self.sv_model_path,
threshold=self.sv_threshold,
speaker_db_path=self.sv_speaker_db_path,
logger=self.get_logger()
)
self.tts_client = DashScopeTTSClient(
api_key=self.dashscope_api_key,
model=self.tts_model,
voice=self.tts_voice,
card_index=self.output_card_index,
device_index=self.output_device_index,
output_sample_rate=self.output_sample_rate,
output_channels=self.output_channels,
output_volume=self.output_volume,
tts_source_sample_rate=self.audio_tts_source_sample_rate,
tts_source_channels=self.audio_tts_source_channels,
tts_ffmpeg_thread_queue_size=self.audio_tts_ffmpeg_thread_queue_size,
reference_signal_buffer=self.reference_signal_buffer,
logger=self.get_logger()
)
self.get_logger().info("声纹注册节点启动,请说'er gou......'唤醒注册")
self.recording_thread = threading.Thread(
target=self.audio_recorder.record_with_vad,
name="RegisterRecordingThread",
daemon=True
)
self.recording_thread.start()
self.timer = self.create_timer(0.2, self._check_done)
def _load_config(self):
config_file = os.path.join(
get_package_share_directory('robot_speaker'),
'config',
'voice.yaml'
)
with open(config_file, 'r') as f:
config = yaml.safe_load(f)
dashscope = config['dashscope']
audio = config['audio']
mic = audio['microphone']
soundcard = audio['soundcard']
vad = config['vad']
system = config['system']
self.dashscope_api_key = dashscope['api_key']
self.asr_model = dashscope['asr']['model']
self.asr_url = dashscope['asr']['url']
self.tts_model = dashscope['tts']['model']
self.tts_voice = dashscope['tts']['voice']
self.input_device_index = mic['device_index']
self.sample_rate = mic['sample_rate']
self.channels = mic['channels']
self.chunk = mic['chunk']
self.audio_microphone_heartbeat_interval = mic['heartbeat_interval']
self.output_card_index = soundcard['card_index']
self.output_device_index = soundcard['device_index']
self.output_sample_rate = soundcard['sample_rate']
self.output_channels = soundcard['channels']
self.output_volume = soundcard['volume']
echo = audio.get('echo_cancellation', {})
self.audio_echo_cancellation_enabled = echo.get('enabled', True) # 默认启用
self.audio_echo_cancellation_max_duration_ms = echo.get('max_duration_ms', 200)
tts_audio = audio.get('tts', {})
self.audio_tts_source_sample_rate = tts_audio.get('source_sample_rate', 22050)
self.audio_tts_source_channels = tts_audio.get('source_channels', 1)
self.audio_tts_ffmpeg_thread_queue_size = tts_audio.get('ffmpeg_thread_queue_size', 5)
self.vad_mode = vad['vad_mode']
self.silence_duration_ms = vad['silence_duration_ms']
self.min_energy_threshold = vad['min_energy_threshold']
self.sv_model_path = os.path.expanduser(system['sv_model_path'])
self.sv_threshold = system['sv_threshold']
self.sv_speaker_db_path = os.path.expanduser(system['sv_speaker_db_path'])
self.sv_buffer_size = system['sv_buffer_size']
self.wake_word = system['wake_word']
def _should_put_to_queue(self) -> bool:
"""判断是否应该将音频放入ASR队列仅在等待唤醒词时"""
return self.waiting_for_wake_word
def _on_heartbeat(self):
if self.waiting_for_wake_word:
self.get_logger().info("[注册录音] 等待唤醒词'er gou'...")
elif self.waiting_for_voiceprint:
self.get_logger().info("[注册录音] 等待声纹语音...")
def _on_speech_start(self):
if self.waiting_for_wake_word:
# 等待唤醒词时,开始录音(可能包含唤醒词)
self.get_logger().info("[注册录音] 检测到人声,开始录音")
elif self.waiting_for_voiceprint:
self.get_logger().info("[注册录音] 检测到人声,继续录音(用于声纹注册)")
# 注意:不清空缓冲区,保留包含唤醒词的音频
def _on_audio_chunk(self, audio_chunk: bytes):
# 记录所有音频(包括唤醒词),用于声纹注册
try:
audio_array = np.frombuffer(audio_chunk, dtype=np.int16)
with self.buffer_lock:
self.audio_buffer.extend(audio_array)
except Exception as e:
self.get_logger().debug(f"[注册录音] 录音失败: {e}")
def _on_speech_end(self):
# 如果还在等待唤醒词,不处理
if self.waiting_for_wake_word:
return
# 如果已经在处理,不重复处理
if self.processing:
return
# 等待声纹语音时用户说话结束使用当前音频即使不足3秒
if self.waiting_for_voiceprint:
self._process_voiceprint_audio(use_current_audio_if_short=True)
return # 处理完毕后直接返回,防止重复调用
def _process_voiceprint_audio(self, use_current_audio_if_short: bool = False):
"""处理声纹音频:使用用户完整的第一段语音进行注册
Args:
use_current_audio_if_short: 如果音频不足3秒是否使用当前音频用于用户已说完的情况
"""
if self.processing:
return
self.processing = True
with self.buffer_lock:
audio_list = list(self.audio_buffer)
buffer_size = len(audio_list)
buffer_sec = buffer_size / self.sample_rate
self.get_logger().info(f"[注册录音] 当前音频长度: {buffer_sec:.2f}")
required_samples = int(self.sample_rate * 3)
# 如果音频不足3秒
if buffer_size < required_samples:
if use_current_audio_if_short:
# 用户已经说完了使用当前音频即使不足3秒
self.get_logger().info(f"[注册录音] 音频不足3秒当前{buffer_sec:.2f}秒),但用户已说完,使用当前音频进行注册")
audio_to_use = audio_list
else:
# 等待继续录音
self.get_logger().info(f"[注册录音] 音频不足3秒当前{buffer_sec:.2f}秒),等待继续录音...")
self.processing = False
return
else:
# 策略优化不再强行截取最后3秒因为唤醒词检测有延迟
# "er gou" 可能在缓冲区的中间偏后位置。
# 为了防止截取到尾部的静音,并在包含完整唤醒词,
# 我们截取最近的 3.0 秒或者全部如果不足3秒
# 这样能最大程度包含有效语音 "二狗"。
target_samples = int(self.sample_rate * 3.0)
if buffer_size > target_samples:
audio_to_use = audio_list[-target_samples:]
else:
audio_to_use = audio_list
duration = len(audio_to_use) / self.sample_rate
self.get_logger().info(f"[注册录音] 使用最近 {duration:.2f} 秒音频用于注册(覆盖唤醒词)")
# 清空缓冲区
with self.buffer_lock:
self.audio_buffer.clear()
try:
audio_array = np.array(audio_to_use, dtype=np.int16)
embedding, success = self.sv_client.extract_embedding(
audio_array,
sample_rate=self.sample_rate
)
if not success or embedding is None:
self.get_logger().error("[注册录音] 提取embedding失败")
self.processing = False
return
speaker_id = f"user_{int(time.time())}"
if self.sv_client.register_speaker(speaker_id, embedding):
self.get_logger().info(f"[注册录音] 注册成功用户ID: {speaker_id},准备退出")
# 播放成功提示
try:
self.get_logger().info("[注册录音] 播放注册成功提示")
request = TTSRequest(text="声纹注册成功", voice=self.tts_voice)
self.tts_client.synthesize(request)
time.sleep(5)
except Exception as e:
self.get_logger().error(f"[注册录音] 播放提示失败: {e}")
self.stop_event.set()
else:
self.get_logger().error("[注册录音] 注册失败")
self.processing = False
except Exception as e:
self.get_logger().error(f"[注册录音] 注册异常: {e}")
self.processing = False
def _extract_speech_segments(self, audio_array: np.ndarray, frame_size: int = 1024) -> list:
"""使用能量检测提取人声片段(过滤静音)"""
speech_segments = []
frame_samples = frame_size
total_frames = 0
speech_frames = 0
for i in range(0, len(audio_array), frame_samples):
frame = audio_array[i:i + frame_samples]
if len(frame) < frame_samples:
break
total_frames += 1
# 计算帧的能量RMS对于int16音频
frame_float = frame.astype(np.float32)
energy = np.sqrt(np.mean(frame_float ** 2))
# 使用更低的阈值来检测人声(降低阈值,避免误判静音)
# 阈值可以动态调整,或者使用自适应阈值
threshold = self.min_energy_threshold * 0.50 # 降低阈值到原来的50%
# 如果能量超过阈值,认为是人声
if energy >= threshold:
speech_segments.append((i, i + frame_samples))
speech_frames += 1
# 调试信息
if total_frames > 0:
speech_ratio = speech_frames / total_frames
self.get_logger().debug(f"[注册录音] 能量检测: 总帧数={total_frames}, 人声帧数={speech_frames}, 人声比例={speech_ratio:.2%}, 阈值={self.min_energy_threshold}")
return speech_segments
def _merge_speech_segments(self, audio_array: np.ndarray, segments: list, min_samples: int) -> np.ndarray:
"""合并人声片段,返回连续的人声音频"""
if not segments:
return np.array([], dtype=np.int16)
# 合并相邻的片段
merged_segments = []
current_start, current_end = segments[0]
for start, end in segments[1:]:
if start <= current_end + 1024: # 允许小间隙1帧
current_end = end
else:
merged_segments.append((current_start, current_end))
current_start, current_end = start, end
merged_segments.append((current_start, current_end))
# 从后往前选择片段直到达到3秒
selected_audio = []
total_samples = 0
for start, end in reversed(merged_segments):
segment_audio = audio_array[start:end]
selected_audio.insert(0, segment_audio)
total_samples += len(segment_audio)
if total_samples >= min_samples:
break
if not selected_audio:
return np.array([], dtype=np.int16)
return np.concatenate(selected_audio)
def _asr_worker(self):
"""ASR处理线程"""
while not self.stop_event.is_set():
try:
audio_chunk = self.audio_queue.get(timeout=0.1)
if self.asr_client and self.asr_client.running:
self.asr_client.send_audio(audio_chunk)
except queue.Empty:
continue
except Exception as e:
self.get_logger().error(f"[注册ASR] 处理异常: {e}")
def _on_asr_sentence_end(self, text: str):
"""ASR识别完成回调"""
if text and text.strip():
self.text_queue.put(text.strip())
def _text_worker(self):
"""文本处理线程:检测唤醒词"""
while not self.stop_event.is_set():
try:
text = self.text_queue.get(timeout=0.1)
if self.waiting_for_wake_word:
self._check_wake_word(text)
except queue.Empty:
continue
except Exception as e:
self.get_logger().error(f"[注册文本] 处理异常: {e}")
def _to_pinyin(self, text: str) -> str:
"""将中文文本转换为拼音"""
chars = [c for c in text if '\u4e00' <= c <= '\u9fa5']
if not chars:
return ""
py_list = pinyin(chars, style=Style.NORMAL)
return ' '.join([item[0] for item in py_list]).lower().strip()
def _check_wake_word(self, text: str):
"""检查是否包含唤醒词"""
text_pinyin = self._to_pinyin(text)
wake_word_pinyin = self.wake_word.lower().strip()
self.get_logger().info(f"[注册唤醒词] 原始文本: {text}, 文本拼音: {text_pinyin}, 唤醒词拼音: {wake_word_pinyin}")
if not wake_word_pinyin:
return
text_pinyin_parts = text_pinyin.split() if text_pinyin else []
wake_word_parts = wake_word_pinyin.split()
# 检查是否包含唤醒词
for i in range(len(text_pinyin_parts) - len(wake_word_parts) + 1):
if text_pinyin_parts[i:i + len(wake_word_parts)] == wake_word_parts:
self.get_logger().info(f"[注册唤醒词] 检测到唤醒词 '{self.wake_word}'")
self.get_logger().info("=" * 50)
self.get_logger().info("[声纹注册] 开始注册声纹将截取3秒音频用于注册")
self.get_logger().info("=" * 50)
self.waiting_for_wake_word = False
self.waiting_for_voiceprint = True
# 停止ASR不再需要识别
if self.asr_client:
self.asr_client.stop_current_recognition()
# 立即处理当前音频缓冲区中的完整音频
# 用户可能已经说完了(包含唤醒词的整段语音)
self._process_voiceprint_audio()
return
def _check_done(self):
if self.stop_event.is_set():
self.get_logger().info("注册完成,节点退出")
# 清理资源
if self.asr_client:
self.asr_client.stop()
self.destroy_node()
rclpy.shutdown()
def main(args=None):
rclpy.init(args=args)
node = RegisterSpeakerNode()
rclpy.spin(node)
if __name__ == '__main__':
main()

View File

@@ -0,0 +1,858 @@
"""
语音交互节点
"""
import rclpy
from rclpy.node import Node
from std_msgs.msg import String
import threading
import queue
import time
import re
import base64
import io
import numpy as np
from PIL import Image
import subprocess
import collections
import os
import yaml
import json
from ament_index_python.packages import get_package_share_directory
from robot_speaker.perception.audio_pipeline import VADDetector, AudioRecorder
from robot_speaker.models.asr.dashscope import DashScopeASR
from robot_speaker.models.tts.dashscope import DashScopeTTSClient
from robot_speaker.models.llm.dashscope import DashScopeLLM
from robot_speaker.understanding.context_manager import ConversationHistory
from robot_speaker.core.types import LLMMessage, TTSRequest
from robot_speaker.perception.camera_client import CameraClient
from robot_speaker.perception.speaker_verifier import SpeakerVerificationClient, SpeakerState
from robot_speaker.perception.echo_cancellation import ReferenceSignalBuffer
from robot_speaker.core.conversation_state import ConversationState
from robot_speaker.core.node_workers import NodeWorkers
from robot_speaker.core.node_callbacks import NodeCallbacks
from robot_speaker.core.intent_router import IntentRouter, IntentResult
class RobotSpeakerNode(Node):
# ==================== 初始化 ====================
def __init__(self):
super().__init__('robot_speaker_node')
# 直接从配置文件加载参数
self._load_config()
# 初始化队列(线程间通信)
self.audio_queue = queue.Queue() # 录音线程 → ASR线程
self.text_queue = queue.Queue() # ASR线程 → 主线程
self.tts_queue = queue.Queue() # 主线程 → TTS线程
# 初始化线程同步事件
self.interrupt_event = threading.Event() # 中断标志
self.stop_event = threading.Event() # 停止标志
self.tts_playing_event = threading.Event() # TTS播放状态
# 初始化会话管理
self.session_active = False
self.session_start_time = 0.0
self.session_lock = threading.Lock()
# 状态机状态
self.conversation_state = ConversationState.IDLE # 当前会话状态
self.state_lock = threading.Lock() # 保护状态机状态
# 声纹识别共享状态
self.current_speaker_id = None # 当前说话人ID共享状态只读
self.current_speaker_state = SpeakerState.UNKNOWN # 当前说话人状态
self.current_speaker_score = 0.0 # 当前说话人相似度得分
self.sv_lock = threading.Lock() # 保护声纹识别共享状态
self.sv_speech_end_event = threading.Event() # 通知声纹线程处理speech_end触发
self.sv_result_ready_event = threading.Event() # 保留兼容(已不用于同步)
self.sv_result_lock = threading.Lock() # 声纹结果序号锁
self.sv_result_cv = threading.Condition(self.sv_result_lock)
self.sv_result_seq = 0
# 声纹缓冲区大小将在_init_components中初始化需要先读取参数
self.sv_audio_buffer = None # 声纹验证录音缓冲区将在_init_components中初始化
self.sv_recording = False # 是否正在为声纹验证录音
# 声纹注册状态
self.utterance_lock = threading.Lock()
self.current_utterance_id = 0
self.last_processed_utterance_id = 0
self.intent_router = IntentRouter()
self.callbacks = NodeCallbacks(self)
# 初始化组件VAD、录音器、ASR、LLM、TTS
self._init_components()
self.workers = NodeWorkers(self)
# 状态机初始状态
if self.sv_enabled and self.sv_client:
speaker_count = self.sv_client.get_speaker_count()
if speaker_count == 0:
self.get_logger().info("声纹数据库为空,请注册声纹")
# ROS订阅
self.interrupt_sub = self.create_subscription(
String, 'interrupt_command', self.callbacks.handle_interrupt_command, self.system_interrupt_command_queue_depth
)
self.skill_sequence_pub = self.create_publisher(String, '/llm_skill_sequence', 10)
self.skill_feedback_sub = self.create_subscription(
String, '/skill_execution_feedback', self._on_skill_feedback, 10
)
self.skill_result_sub = self.create_subscription(
String, '/skill_execution_result', self._on_skill_result, 10
)
self.latest_skill_feedback = None
self.latest_skill_result = None
# 启动线程
self._start_threads()
self.get_logger().info("语音节点已启动")
# ==================== 配置加载 ====================
def _load_config(self):
"""直接从 voice.yaml 配置文件加载参数"""
config_file = os.path.join(
get_package_share_directory('robot_speaker'),
'config',
'voice.yaml'
)
with open(config_file, 'r') as f:
config = yaml.safe_load(f)
# 音频参数
audio = config['audio']
mic = audio['microphone']
soundcard = audio['soundcard']
echo = audio['echo_cancellation']
tts_audio = audio['tts']
self.input_device_index = mic['device_index']
self.output_card_index = soundcard['card_index']
self.output_device_index = soundcard['device_index']
self.sample_rate = mic['sample_rate']
self.channels = mic['channels']
self.chunk = mic['chunk']
self.audio_microphone_heartbeat_interval = mic['heartbeat_interval']
self.output_sample_rate = soundcard['sample_rate']
self.output_channels = soundcard['channels']
self.output_volume = soundcard['volume']
self.audio_echo_cancellation_enabled = echo.get('enabled', True) # 默认启用
self.audio_echo_cancellation_max_duration_ms = echo['max_duration_ms']
self.audio_tts_source_sample_rate = tts_audio['source_sample_rate']
self.audio_tts_source_channels = tts_audio['source_channels']
self.audio_tts_ffmpeg_thread_queue_size = tts_audio['ffmpeg_thread_queue_size']
# VAD参数
vad = config['vad']
self.vad_mode = vad['vad_mode']
self.silence_duration_ms = vad['silence_duration_ms']
self.min_energy_threshold = vad['min_energy_threshold']
# DashScope参数
dashscope = config['dashscope']
self.dashscope_api_key = dashscope['api_key']
self.asr_model = dashscope['asr']['model']
self.asr_url = dashscope['asr']['url']
self.llm_model = dashscope['llm']['model']
self.llm_base_url = dashscope['llm']['base_url']
self.llm_temperature = dashscope['llm']['temperature']
self.llm_max_tokens = dashscope['llm']['max_tokens']
self.llm_max_history = dashscope['llm']['max_history']
self.llm_summary_trigger = dashscope['llm']['summary_trigger']
self.tts_model = dashscope['tts']['model']
self.tts_voice = dashscope['tts']['voice']
# 系统参数
system = config['system']
self.use_llm = system['use_llm']
self.use_wake_word = system['use_wake_word']
self.wake_word = system['wake_word']
self.session_timeout = system['session_timeout']
self.system_shutup_keywords = system['shutup_keywords']
self.system_interrupt_command_queue_depth = system['interrupt_command_queue_depth']
self.sv_enabled = system['sv_enabled']
self.sv_model_path = os.path.expanduser(system['sv_model_path'])
self.sv_threshold = system['sv_threshold']
self.sv_speaker_db_path = os.path.expanduser(system['sv_speaker_db_path']) # 展开用户目录
self.sv_buffer_size = system['sv_buffer_size']
# 相机参数
camera = config['camera']
self.camera_serial_number = camera['serial_number']
self.camera_rgb_width = camera['rgb']['width']
self.camera_rgb_height = camera['rgb']['height']
self.camera_rgb_fps = camera['rgb']['fps']
self.camera_rgb_format = camera['rgb']['format']
self.camera_image_jpeg_quality = camera['image']['jpeg_quality']
self.camera_image_max_size = camera['image']['max_size']
self.knowledge_file = os.path.join(
get_package_share_directory('robot_speaker'),
'config',
'knowledge.json'
)
# ==================== 组件初始化 ====================
def _init_components(self):
"""初始化所有组件"""
self.shutup_keywords = [k.strip() for k in self.system_shutup_keywords.split(',') if k.strip()]
self.kb_answers_map = {}
if self.knowledge_file and os.path.exists(self.knowledge_file):
try:
with open(self.knowledge_file, 'r') as f:
kb_data = json.load(f)
entries = kb_data["entries"]
for entry in entries:
patterns = entry["patterns"]
answer = entry["answer"]
if not answer.strip():
continue
for pattern in patterns:
key = pattern.strip().lower()
if key:
self.kb_answers_map[key] = answer.strip()
self.get_logger().info(f"知识库已加载: {len(self.kb_answers_map)}")
except Exception as e:
self.get_logger().warning(f"知识库加载失败: {e}")
self.sv_audio_buffer = collections.deque(maxlen=self.sv_buffer_size)
self.vad_detector = VADDetector(
mode=self.vad_mode,
sample_rate=self.sample_rate
)
# 创建参考信号缓冲区(用于回声消除),虽然播放是44100Hz但麦克风输入是16kHz
self.reference_signal_buffer = ReferenceSignalBuffer(
max_duration_ms=self.audio_echo_cancellation_max_duration_ms,
sample_rate=self.sample_rate,
channels=self.output_channels
) if self.audio_echo_cancellation_enabled else None
# 录音器 - 直接发送音频chunk到队列
self.audio_recorder = AudioRecorder(
device_index=self.input_device_index,
sample_rate=self.sample_rate,
channels=self.channels,
chunk=self.chunk,
vad_detector=self.vad_detector,
audio_queue=self.audio_queue,
silence_duration_ms=self.silence_duration_ms,
min_energy_threshold=self.min_energy_threshold,
heartbeat_interval=self.audio_microphone_heartbeat_interval,
on_heartbeat=self.callbacks.on_heartbeat,
is_playing=self.tts_playing_event.is_set,
on_new_segment=self.callbacks.on_new_segment,
on_speech_start=self.callbacks.on_speech_start,
on_speech_end=self.callbacks.on_speech_end,
stop_flag=self.stop_event.is_set,
on_audio_chunk=self.callbacks.on_audio_chunk_for_sv if self.sv_enabled else None, # 声纹录音回调
should_put_to_queue=self.callbacks.should_put_audio_to_queue, # 检查是否应该将音频放入队列
get_silence_threshold=self.callbacks.get_silence_threshold, # 动态静音阈值回调
enable_echo_cancellation=self.audio_echo_cancellation_enabled, # 从配置文件读取
reference_signal_buffer=self.reference_signal_buffer, # 传递参考信号缓冲区
logger=self.get_logger()
)
# ASR客户端 - 流式识别
self.asr_client = DashScopeASR(
api_key=self.dashscope_api_key,
sample_rate=self.sample_rate,
model=self.asr_model,
url=self.asr_url,
logger=self.get_logger()
)
self.asr_client.on_sentence_end = self.callbacks.on_asr_sentence_end
self.asr_client.on_text_update = self.callbacks.on_asr_text_update
self.asr_client.start()
# LLM客户端
if self.use_llm:
self.llm_client = DashScopeLLM(
api_key=self.dashscope_api_key,
model=self.llm_model,
base_url=self.llm_base_url,
temperature=self.llm_temperature,
max_tokens=self.llm_max_tokens,
name="LLM-chat",
logger=self.get_logger()
)
self.history = ConversationHistory(
max_history=self.llm_max_history,
summary_trigger=self.llm_summary_trigger
)
else:
self.llm_client = None
self.history = None
# TTS客户端
self.get_logger().info(f"TTS配置: model={self.tts_model}, voice={self.tts_voice}")
self.get_logger().info(f"音频输出配置: sample_rate={self.output_sample_rate}, channels={self.output_channels}")
self.tts_client = DashScopeTTSClient(
api_key=self.dashscope_api_key,
model=self.tts_model,
voice=self.tts_voice,
card_index=self.output_card_index,
device_index=self.output_device_index,
output_sample_rate=self.output_sample_rate,
output_channels=self.output_channels,
output_volume=self.output_volume,
tts_source_sample_rate=self.audio_tts_source_sample_rate,
tts_source_channels=self.audio_tts_source_channels,
tts_ffmpeg_thread_queue_size=self.audio_tts_ffmpeg_thread_queue_size,
reference_signal_buffer=self.reference_signal_buffer, # 传递参考信号缓冲区
logger=self.get_logger()
)
# 相机客户端(默认一直运行)
try:
self.camera_client = CameraClient(
serial_number=self.camera_serial_number,
width=self.camera_rgb_width,
height=self.camera_rgb_height,
fps=self.camera_rgb_fps,
format=self.camera_rgb_format,
logger=self.get_logger()
)
self.camera_client.initialize()
except Exception as e:
self.get_logger().warning(f"相机初始化失败: {e},相机功能将不可用")
self.camera_client = None
# 声纹识别客户端
if self.sv_enabled and self.sv_model_path:
try:
self.sv_client = SpeakerVerificationClient(
model_path=self.sv_model_path,
threshold=self.sv_threshold,
speaker_db_path=self.sv_speaker_db_path,
logger=self.get_logger()
)
except Exception as e:
self.get_logger().warning(f"声纹识别初始化失败: {e},声纹功能将不可用")
self.sv_client = None
self.sv_enabled = False
else:
self.sv_client = None
# ==================== 线程启动 ====================
def _start_threads(self):
"""启动线程"""
# 线程1: 录音线程
self.recording_thread = threading.Thread(
target=self.workers.recording_worker,
name="RecordingThread",
daemon=True
)
self.recording_thread.start()
# 线程2: ASR推理线程
self.asr_thread = threading.Thread(
target=self.workers.asr_worker,
name="ASRThread",
daemon=True
)
self.asr_thread.start()
# 线程3: 主线程 - 处理业务逻辑
self.process_thread = threading.Thread(
target=self.workers.process_worker,
name="ProcessThread",
daemon=True
)
self.process_thread.start()
# 线程4: TTS播放线程
self.tts_thread = threading.Thread(
target=self._tts_worker,
name="TTSThread",
daemon=True
)
self.tts_thread.start()
# 线程5: 声纹识别线程(如果启用)
if self.sv_enabled and self.sv_client:
self.sv_thread = threading.Thread(
target=self.workers.sv_worker,
name="SVThread",
daemon=True
)
self.sv_thread.start()
else:
self.sv_thread = None
# ==================== TTS播放线程 ====================
def _tts_worker(self):
"""
线程4: TTS播放线程 - 只播放
"""
self.get_logger().info("[TTS播放线程] 启动")
while not self.stop_event.is_set():
try:
text = self.tts_queue.get(timeout=1.0)
except queue.Empty:
if self.interrupt_event.is_set():
self.get_logger().debug("[TTS播放线程] 检测到中断事件")
continue
if self.interrupt_event.is_set():
self.get_logger().info("[TTS播放线程] 中断播放,跳过文本")
continue
if not text or not str(text).strip():
continue
text_str = str(text).strip()
text_len = len(text_str)
self.get_logger().info(f"[TTS播放线程] 开始播放: {text_str[:100]}... (总长度: {text_len}字符)")
self.tts_playing_event.set()
request = TTSRequest(text=text_str, voice=None)
success = self.tts_client.synthesize(
request,
interrupt_check=lambda: self.interrupt_event.is_set()
)
if success:
self.get_logger().info("[TTS播放线程] 播放完成")
else:
self.get_logger().info("[TTS播放线程] 播放被中断")
self.tts_playing_event.clear()
if self.interrupt_event.is_set():
self.get_logger().info("[TTS播放线程] 播放完成后检测到中断,清空队列")
self._drain_queue(self.tts_queue)
self.interrupt_event.clear()
# ==================== 状态机方法 ====================
def _change_state(self, new_state: ConversationState, reason: str | None = None):
"""改变状态机状态"""
with self.state_lock:
old_state = self.conversation_state
self.conversation_state = new_state
if reason:
self.get_logger().info(f"[状态机] {old_state.value} -> {new_state.value}: {reason}")
else:
self.get_logger().info(f"[状态机] {old_state.value} -> {new_state.value}")
def _get_state(self) -> ConversationState:
"""获取当前状态"""
with self.state_lock:
return self.conversation_state
# ==================== LLM处理含拍照 ====================
def _encode_image_to_base64(self, image_data: np.ndarray, quality: int = 85) -> str:
"""将numpy图像数组编码为base64字符串"""
try:
if image_data.shape[2] == 3:
pil_image = Image.fromarray(image_data, 'RGB')
else:
pil_image = Image.fromarray(image_data)
buffer = io.BytesIO()
pil_image.save(buffer, format='JPEG', quality=quality)
image_bytes = buffer.getvalue()
base64_str = base64.b64encode(image_bytes).decode('utf-8')
return base64_str
except Exception as e:
self.get_logger().error(f"图像编码失败: {e}")
return ""
def _llm_process_stream_with_camera(
self,
user_text: str,
need_camera: bool,
system_prompt: str | None = None,
suppress_tts: bool = False
) -> str:
"""LLM流式处理 - 支持多模态(文本+图像)"""
if not self.llm_client or not self.history:
return ""
messages = list(self.history.get_messages())
has_system_msg = any(msg.role == "system" for msg in messages)
if not has_system_msg:
if not system_prompt:
system_prompt = self.intent_router.build_default_system_prompt()
messages.insert(0, LLMMessage(role="system", content=system_prompt))
full_reply = ""
tts_text_buffer = ""
image_base64_list = []
def on_token(token: str):
nonlocal full_reply, tts_text_buffer
if self.interrupt_event.is_set():
self.get_logger().info("[LLM流式处理] on_token回调中检测到中断停止处理")
return
full_reply += token
tts_text_buffer += token
if need_camera and self.camera_client:
with self.camera_client.capture_context() as image_data:
if image_data is not None:
image_base64 = self._encode_image_to_base64(
image_data,
quality=self.camera_image_jpeg_quality
)
if image_base64:
image_base64_list.append(image_base64)
self.get_logger().info("[相机] 已拍照")
if image_base64_list:
self.get_logger().info(
f"[多模态] 准备发送给LLM: {len(image_base64_list)}张图片,用户文本: {user_text[:50]}"
)
for idx, img_b64 in enumerate(image_base64_list):
self.get_logger().debug(f"[多模态] 图片#{idx+1} base64长度: {len(img_b64)}")
reply = self.llm_client.chat_stream(
messages,
on_token=on_token,
images=image_base64_list if image_base64_list else None,
interrupt_check=lambda: self.interrupt_event.is_set()
)
if self.interrupt_event.is_set() or (reply is None):
if self.interrupt_event.is_set():
self.get_logger().info("[LLM流式处理] 处理被中断")
return ""
if image_base64_list:
for img_b64 in image_base64_list:
del img_b64
image_base64_list.clear()
self.get_logger().info("[相机] 已删除照片")
if reply and reply.strip():
tts_text_to_send = reply.strip()
tts_buffer_len = len(tts_text_buffer.strip()) if tts_text_buffer else 0
reply_len = len(tts_text_to_send)
if tts_buffer_len != reply_len:
self.get_logger().info(
f"[流式TTS] tts_text_buffer({tts_buffer_len}字符)和reply({reply_len}字符)长度不一致使用reply作为TTS文本"
)
elif tts_text_buffer and tts_text_buffer.strip():
tts_text_to_send = tts_text_buffer.strip()
self.get_logger().warning(
f"[流式TTS] reply为空使用tts_text_buffer({len(tts_text_to_send)}字符)作为TTS文本"
)
else:
tts_text_to_send = ""
self.get_logger().warning("[流式TTS] reply和tts_text_buffer都为空无法发送TTS文本")
if not self.interrupt_event.is_set() and tts_text_to_send and not suppress_tts:
text_len = len(tts_text_to_send)
self.get_logger().info(
f"[流式TTS] 发送完整文本到TTS队列: {tts_text_to_send[:100]}... (总长度: {text_len}字符)"
)
if text_len > 100:
self.get_logger().debug(f"[流式TTS] 完整文本内容: {tts_text_to_send}")
self._put_tts_text(tts_text_to_send)
elif suppress_tts:
self.get_logger().info("[流式TTS] suppress_tts开启跳过TTS输出")
return reply.strip() if reply else ""
# ==================== 中断与TTS工具 ====================
def _force_stop_tts(self):
"""强制停止TTS播放 - 直接杀死记录的ffmpeg进程PID"""
self._drain_queue(self.tts_queue)
self.interrupt_event.set()
if self.tts_client and self.tts_client.current_ffmpeg_pid:
try:
pid = self.tts_client.current_ffmpeg_pid
os.kill(pid, 9) # SIGKILL
self.get_logger().info(f"[强制停止TTS] 已终止ffmpeg进程PID={pid}")
self.tts_client.current_ffmpeg_pid = None
except ProcessLookupError:
self.get_logger().debug(f"[强制停止TTS] ffmpeg进程已不存在PID={pid}")
self.tts_client.current_ffmpeg_pid = None
except Exception as e:
self.get_logger().warning(f"[强制停止TTS] 终止ffmpeg进程失败: {e}")
def _check_interrupt(self, auto_clear: bool = False) -> bool:
"""
检查中断标志
"""
if self.interrupt_event.is_set():
if auto_clear:
self.interrupt_event.clear()
return True
return False
def _check_interrupt_and_cancel_turn(self) -> bool:
"""检查中断并取消轮次(统一处理中断后的清理)"""
if self._check_interrupt(auto_clear=True):
if self.use_llm and self.history:
self.history.cancel_turn()
return True
return False
# ==================== 注册/会话/唤醒词 ====================
def _handle_empty_speaker_db(self) -> bool:
"""处理数据库为空的情况(统一处理)"""
if not (self.sv_enabled and self.sv_client):
return False
speaker_count = self.sv_client.get_speaker_count()
if speaker_count == 0:
with self.sv_lock:
self.current_speaker_id = None
self.current_speaker_state = SpeakerState.UNKNOWN
self.current_speaker_score = 0.0
self.sv_result_ready_event.set()
return True
return False
def _put_tts_text(self, text: str):
"""统一处理TTS队列put带异常处理"""
try:
self.tts_queue.put(text, timeout=0.2)
self.get_logger().debug(f"[TTS队列] 文本已成功放入队列: {text[:50]}... (队列大小: {self.tts_queue.qsize()})")
except Exception as e:
self.get_logger().error(f"[TTS队列] 放入队列失败: {e}, 文本: {text[:50]}")
def _interrupt_tts(self, reason: str):
"""
中断TTS播放,只设置中断事件不清空队列让TTS线程自己检查并停止播放
"""
self.get_logger().info(f"[中断] {reason}")
self.interrupt_event.set()
@staticmethod
def _drain_queue(q: queue.Queue):
"""清空队列"""
while True:
try:
q.get_nowait()
except queue.Empty:
break
def _start_session(self):
"""开始会话"""
with self.session_lock:
self.session_active = True
self.session_start_time = time.time()
def _reset_session(self):
"""重置会话"""
with self.session_lock:
self.session_start_time = time.time()
def _is_session_active(self) -> bool:
"""检查会话是否活跃"""
with self.session_lock:
if not self.session_active:
return False
if time.time() - self.session_start_time >= self.session_timeout:
self.session_active = False
return False
return True
# ==================== 意图处理 ====================
def _handle_wake_word(self, text: str) -> str:
"""处理唤醒词ASR文本转拼音检查是否包含唤醒词拼音"""
if not self.use_wake_word:
return text.strip()
if self._is_session_active():
self._reset_session()
return text.strip()
text_pinyin = self.intent_router.to_pinyin(text)
wake_word_pinyin = self.wake_word.lower().strip()
self.get_logger().info(f"[唤醒词] 原始文本: {text}, 文本拼音: {text_pinyin}, 唤醒词拼音: {wake_word_pinyin}")
if not wake_word_pinyin:
self.get_logger().info("[唤醒词] 唤醒词为空,过滤文本")
return ""
text_pinyin_parts = text_pinyin.split() if text_pinyin else []
wake_word_parts = wake_word_pinyin.split()
start_idx = -1
for i in range(len(text_pinyin_parts) - len(wake_word_parts) + 1):
if text_pinyin_parts[i:i + len(wake_word_parts)] == wake_word_parts:
start_idx = i
break
if start_idx == -1:
self.get_logger().info(f"[唤醒词] 未检测到唤醒词 '{self.wake_word}',过滤文本")
return ""
removed = 0
new_text = ""
for c in text:
if '\u4e00' <= c <= '\u9fa5':
if removed < start_idx or removed >= start_idx + len(wake_word_parts):
new_text += c
removed += 1
else:
new_text += c
self._start_session()
return new_text.strip()
def _check_shutup_command(self, text: str) -> bool:
"""检查闭嘴指令"""
if not text:
return False
text_lower = text.lower()
text_pinyin = self.intent_router.to_pinyin(text)
for keyword in self.shutup_keywords:
kw = keyword.lower().strip()
if not kw:
continue
if kw in text_lower or (text_pinyin and kw in text_pinyin):
return True
return False
def _handle_intent(self, intent_payload: IntentResult):
"""按意图路由到不同处理逻辑"""
intent = intent_payload.intent
text = intent_payload.text
need_camera = intent_payload.need_camera
system_prompt = intent_payload.system_prompt
if intent == "kb_qa":
answer = None
text_pinyin = self.intent_router.to_pinyin(text)
if text_pinyin:
answer = self.kb_answers_map.get(text_pinyin)
if answer:
if "{wake_word}" in answer:
answer = answer.replace("{wake_word}", self.wake_word or "")
self._put_tts_text(answer)
else:
pass
return
if self.use_llm and self.llm_client:
if self.history:
self.history.start_turn(text)
reply = self._llm_process_stream_with_camera(
text,
need_camera=need_camera,
system_prompt=system_prompt,
suppress_tts=(intent == "skill_sequence")
)
if reply:
if self.history:
self.history.commit_turn(reply)
if intent == "skill_sequence":
skill_msg = String()
skill_msg.data = reply.strip()
self.skill_sequence_pub.publish(skill_msg)
self.get_logger().info(f"[技能序列] 已发布: {skill_msg.data}")
else:
if self.history:
self.history.cancel_turn()
else:
self.get_logger().warning("[主线程] 未启用LLM无法处理文本")
# ==================== 资源清理 ====================
def destroy_node(self):
"""销毁节点"""
self.get_logger().info("语音节点正在关闭...")
self.stop_event.set()
self.interrupt_event.set()
self.get_logger().info("强制停止TTS播放...")
self._force_stop_tts()
self._drain_queue(self.tts_queue)
threads_to_join = [self.recording_thread, self.asr_thread, self.process_thread, self.tts_thread]
if self.sv_thread:
threads_to_join.append(self.sv_thread)
for thread in threads_to_join:
if thread and thread.is_alive():
thread.join(timeout=1.0)
self._force_stop_tts()
if hasattr(self, 'asr_client') and self.asr_client:
self.asr_client.stop()
if hasattr(self, 'audio_recorder') and self.audio_recorder:
self.audio_recorder.cleanup()
if hasattr(self, 'camera_client') and self.camera_client:
self.camera_client.cleanup()
if hasattr(self, 'sv_client') and self.sv_client:
try:
self.sv_client.save_speakers()
self.sv_client.cleanup()
except Exception as e:
self.get_logger().warning(f"清理声纹识别资源时出错: {e}")
super().destroy_node()
def _on_skill_feedback(self, msg: String):
try:
feedback = json.loads(msg.data)
self.latest_skill_feedback = feedback
feedback_text = (
f"【执行状态】阶段:{feedback.get('stage','')}, "
f"技能:{feedback.get('current_skill','')}, "
f"进度:{feedback.get('progress', 0):.1%}, "
f"详情:{feedback.get('detail','')}"
)
if self.history:
self.history.add_message("system", feedback_text)
except Exception as e:
self.get_logger().warning(f"[技能反馈] 解析失败: {e}")
def _on_skill_result(self, msg: String):
try:
result = json.loads(msg.data)
self.latest_skill_result = result
result_text = (
f"【执行结果】{'成功' if result.get('success') else '失败'}, "
f"总技能数:{result.get('total_skills', 0)}, "
f"成功数:{result.get('succeeded_skills', 0)}, "
f"消息:{result.get('message','')}"
)
if self.history:
self.history.add_message("system", result_text)
except Exception as e:
self.get_logger().warning(f"[技能结果] 解析失败: {e}")
def _init_ros(args):
rclpy.init(args=args)
def _create_node():
return RobotSpeakerNode()
def _run_node(node):
rclpy.spin(node)
def _cleanup_node(node):
if node:
node.destroy_node()
def _shutdown_ros():
if rclpy.ok():
rclpy.shutdown()
# ==================== 入口 ====================
def main(args=None):
node = None
_init_ros(args)
node = _create_node()
_run_node(node)
_cleanup_node(node)
_shutdown_ros()
if __name__ == '__main__':
main()

View File

@@ -0,0 +1,36 @@
"""
统一数据结构定义
"""
from dataclasses import dataclass
@dataclass
class ASRResult:
"""ASR识别结果"""
text: str
confidence: float | None = None
language: str | None = None
@dataclass
class LLMMessage:
"""LLM消息"""
role: str # "user", "assistant", "system"
content: str
@dataclass
class TTSRequest:
"""TTS请求"""
text: str
voice: str | None = None # 如果为None使用控制台配置的默认音色
speed: float | None = None
pitch: float | None = None
@dataclass
class ImageMessage:
"""图像消息 - 用于多模态LLM"""
image_data: bytes # base64编码的图像数据
image_format: str = "jpeg"

View File

@@ -0,0 +1,5 @@
"""模型层"""

View File

@@ -0,0 +1,5 @@
"""ASR模型"""

View File

@@ -0,0 +1,13 @@
class ASRClient:
def start(self) -> bool:
raise NotImplementedError
def stop(self) -> bool:
raise NotImplementedError
def send_audio(self, audio_data: bytes) -> bool:
raise NotImplementedError

View File

@@ -0,0 +1,218 @@
"""
ASR语音识别模块
"""
import base64
import time
import threading
import dashscope
from dashscope.audio.qwen_omni import OmniRealtimeConversation, OmniRealtimeCallback
from dashscope.audio.qwen_omni.omni_realtime import TranscriptionParams, MultiModality
from robot_speaker.models.asr.base import ASRClient
class DashScopeASR(ASRClient):
"""DashScope实时ASR识别器封装"""
def __init__(self, api_key: str,
sample_rate: int,
model: str,
url: str,
logger=None):
dashscope.api_key = api_key
self.sample_rate = sample_rate
self.model = model
self.url = url
self.logger = logger
self.conversation = None
self.running = False
self.on_sentence_end = None
self.on_text_update = None # 实时文本更新回调
# 线程同步机制
self._stop_lock = threading.Lock() # 防止并发调用 stop_current_recognition
self._final_result_event = threading.Event() # 等待 final 回调完成
self._pending_commit = False # 标记是否有待处理的 commit
def _log(self, level: str, msg: str):
"""记录日志根据级别调用对应的ROS2日志方法"""
if self.logger:
# ROS2 logger不能动态改变severity级别需要显式调用对应方法
if level == "debug":
self.logger.debug(msg)
elif level == "info":
self.logger.info(msg)
elif level == "warning":
self.logger.warn(msg)
elif level == "error":
self.logger.error(msg)
else:
self.logger.info(msg) # 默认使用info级别
else:
print(f"[ASR] {msg}")
def start(self):
"""启动ASR识别器"""
if self.running:
return False
try:
callback = _ASRCallback(self)
self.conversation = OmniRealtimeConversation(
model=self.model,
url=self.url,
callback=callback
)
callback.conversation = self.conversation
self.conversation.connect()
transcription_params = TranscriptionParams(
language='zh',
sample_rate=self.sample_rate,
input_audio_format="pcm",
)
# 本地 VAD → 只控制 TTS 打断
# 服务端 turn detection → 只控制 ASR 输出、LLM 生成轮次
self.conversation.update_session(
output_modalities=[MultiModality.TEXT],
enable_input_audio_transcription=True,
transcription_params=transcription_params,
enable_turn_detection=True,
# 保留服务端 turn detection
turn_detection_type='server_vad', # 服务端VAD
turn_detection_threshold=0.2, # 可调
turn_detection_silence_duration_ms=800
)
self.running = True
self._log("info", "ASR已启动")
return True
except Exception as e:
self.running = False
self._log("error", f"ASR启动失败: {e}")
if self.conversation:
try:
self.conversation.close()
except:
pass
self.conversation = None
return False
def send_audio(self, audio_chunk: bytes):
"""发送音频chunk到ASR"""
if not self.running or not self.conversation:
return False
try:
audio_b64 = base64.b64encode(audio_chunk).decode('ascii')
self.conversation.append_audio(audio_b64)
return True
except Exception as e:
# 连接已关闭或其他错误,静默处理(避免日志过多)
# running状态会在stop_current_recognition中正确设置
return False
def stop_current_recognition(self):
"""
触发提交操作获取当前识别结果,但不关闭连接
"""
if not self.running or not self.conversation:
return False
# 使用锁防止并发调用
if not self._stop_lock.acquire(blocking=False):
self._log("warning", "stop_current_recognition 正在执行,跳过本次调用")
return False
try:
# 重置事件,准备等待 final 回调
self._final_result_event.clear()
self._pending_commit = True
# 触发 commit等待 final 结果
self.conversation.commit()
# 等待 final 回调完成最多等待1秒
if self._final_result_event.wait(timeout=1.0):
self._log("debug", "已收到 final 回调")
else:
self._log("warning", "等待 final 回调超时,继续执行")
return True
except Exception as e:
self._log("error", f"提交当前识别结果失败: {e}")
# 出现错误时尝试重启连接
self.running = False
try:
if self.conversation:
self.conversation.close()
except:
pass
self.conversation = None
time.sleep(0.1)
return self.start()
finally:
self._pending_commit = False
self._stop_lock.release()
def stop(self):
"""停止ASR识别器"""
# 等待正在执行的 stop_current_recognition 完成
with self._stop_lock:
self.running = False
self._final_result_event.set() # 唤醒可能正在等待的线程
if self.conversation:
try:
self.conversation.close()
except Exception as e:
self._log("warning", f"停止时关闭连接出错: {e}")
self.conversation = None
self._log("info", "ASR已停止")
class _ASRCallback(OmniRealtimeCallback):
"""ASR回调处理"""
def __init__(self, asr_client: DashScopeASR):
self.asr_client = asr_client
self.conversation = None
def on_open(self):
self.asr_client._log("info", "ASR WebSocket已连接")
def on_close(self, code, msg):
self.asr_client._log("info", f"ASR WebSocket已关闭: code={code}, msg={msg}")
def on_event(self, response):
event_type = response.get('type', '')
if event_type == 'session.created':
session_id = response.get('session', {}).get('id', '')
self.asr_client._log("info", f"ASR会话已创建: {session_id}")
elif event_type == 'conversation.item.input_audio_transcription.completed':
# 最终识别结果
transcript = response.get('transcript', '')
if transcript and transcript.strip() and self.asr_client.on_sentence_end:
self.asr_client.on_sentence_end(transcript.strip())
# 如果有待处理的 commit通知等待的线程
if self.asr_client._pending_commit:
self.asr_client._final_result_event.set()
elif event_type == 'conversation.item.input_audio_transcription.text':
# 实时识别文本更新(多轮提示)
transcript = response.get('transcript', '') or response.get('text', '')
if transcript and transcript.strip() and self.asr_client.on_text_update:
self.asr_client.on_text_update(transcript.strip())
elif event_type == 'input_audio_buffer.speech_started':
self.asr_client._log("info", "ASR检测到说话开始")
elif event_type == 'input_audio_buffer.speech_stopped':
self.asr_client._log("info", "ASR检测到说话结束")

View File

@@ -0,0 +1,5 @@
"""LLM模型"""

View File

@@ -0,0 +1,15 @@
from robot_speaker.core.types import LLMMessage
class LLMClient:
def chat(self, messages: list[LLMMessage]) -> str | None:
raise NotImplementedError
def chat_stream(self, messages: list[LLMMessage],
on_token=None,
interrupt_check=None) -> str | None:
raise NotImplementedError

View File

@@ -0,0 +1,149 @@
"""
LLM大语言模型模块
支持多模态(文本+图像)
"""
from openai import OpenAI
from typing import Optional, List
from robot_speaker.core.types import LLMMessage
from robot_speaker.models.llm.base import LLMClient
class DashScopeLLM(LLMClient):
"""DashScope LLM客户端封装"""
def __init__(self, api_key: str,
model: str,
base_url: str,
temperature: float,
max_tokens: int,
name: str = "LLM",
logger=None):
self.client = OpenAI(api_key=api_key, base_url=base_url)
self.model = model
self.temperature = temperature
self.max_tokens = max_tokens
self.name = name
self.logger = logger
def _log(self, level: str, msg: str):
"""记录日志根据级别调用对应的ROS2日志方法"""
msg = f"[{self.name}] {msg}"
if self.logger:
# ROS2 logger不能动态改变severity级别需要显式调用对应方法
if level == "debug":
self.logger.debug(msg)
elif level == "info":
self.logger.info(msg)
elif level == "warning":
self.logger.warn(msg)
elif level == "error":
self.logger.error(msg)
else:
self.logger.info(msg) # 默认使用info级别
def chat(self, messages: list[LLMMessage]) -> str | None:
"""非流式聊天:任务规划"""
payload_messages = [{"role": msg.role, "content": msg.content} for msg in messages]
response = self.client.chat.completions.create(
model=self.model,
messages=payload_messages,
temperature=self.temperature,
max_tokens=self.max_tokens,
stream=False
)
reply = response.choices[0].message.content.strip()
return reply if reply else None
def chat_stream(self, messages: list[LLMMessage],
on_token=None,
images: Optional[List[str]] = None,
interrupt_check=None) -> str | None:
"""
流式聊天:语音系统
支持多模态(文本+图像)
支持中断检查interrupt_check: 返回True表示需要中断
"""
# 转换消息格式,支持多模态
# 图像只添加到最后一个user消息中
payload_messages = []
last_user_idx = -1
for i, msg in enumerate(messages):
if msg.role == "user":
last_user_idx = i
has_images_in_message = False
for i, msg in enumerate(messages):
msg_dict = {"role": msg.role}
# 如果当前消息是最后一个user消息且有图像构建多模态content
if i == last_user_idx and msg.role == "user" and images and len(images) > 0:
content_list = [{"type": "text", "text": msg.content}]
# 添加所有图像
for img_idx, img_base64 in enumerate(images):
image_url = f"data:image/jpeg;base64,{img_base64[:50]}..." if len(img_base64) > 50 else f"data:image/jpeg;base64,{img_base64}"
content_list.append({
"type": "image_url",
"image_url": {
"url": f"data:image/jpeg;base64,{img_base64}"
}
})
self._log("info", f"[多模态] 添加图像 #{img_idx+1} 到user消息base64长度: {len(img_base64)}")
msg_dict["content"] = content_list
has_images_in_message = True
else:
msg_dict["content"] = msg.content
payload_messages.append(msg_dict)
# 记录多模态信息
if images and len(images) > 0:
if has_images_in_message:
# 找到最后一个user消息记录其content结构
last_user_msg = payload_messages[last_user_idx] if last_user_idx >= 0 else None
if last_user_msg and isinstance(last_user_msg.get("content"), list):
content_items = last_user_msg["content"]
text_items = [item for item in content_items if item.get("type") == "text"]
image_items = [item for item in content_items if item.get("type") == "image_url"]
self._log("info", f"[多模态] 已发送多模态请求: {len(text_items)}个文本 + {len(image_items)}张图片")
self._log("debug", f"[多模态] 用户文本: {text_items[0].get('text', '')[:50] if text_items else 'N/A'}")
else:
self._log("warning", "[多模态] 消息格式异常,无法确认图片是否添加")
else:
self._log("warning", f"[多模态] 有{len(images)}张图片但未找到user消息图片未被添加")
else:
self._log("debug", "[多模态] 纯文本请求(无图片)")
full_reply = ""
interrupted = False
stream = self.client.chat.completions.create(
model=self.model,
messages=payload_messages,
temperature=self.temperature,
max_tokens=self.max_tokens,
stream=True
)
for chunk in stream:
# 检查中断标志
if interrupt_check and interrupt_check():
self._log("info", "LLM流式处理被中断")
interrupted = True
break
if chunk.choices and chunk.choices[0].delta.content:
content = chunk.choices[0].delta.content
full_reply += content
if on_token:
on_token(content)
# 在on_token回调后再次检查中断on_token可能设置中断标志
if interrupt_check and interrupt_check():
self._log("info", "LLM流式处理在on_token回调后被中断")
interrupted = True
break
if interrupted:
return None # 被中断时返回None表示未完成
return full_reply.strip() if full_reply else None

View File

@@ -0,0 +1,5 @@
"""TTS模型"""

View File

@@ -0,0 +1,14 @@
from robot_speaker.core.types import TTSRequest
class TTSClient:
"""TTS客户端抽象基类"""
def synthesize(self, request: TTSRequest,
on_chunk=None,
interrupt_check=None) -> bool:
raise NotImplementedError

View File

@@ -0,0 +1,244 @@
"""
TTS语音合成模块
"""
import subprocess
import dashscope
from dashscope.audio.tts_v2 import SpeechSynthesizer, ResultCallback, AudioFormat
from robot_speaker.core.types import TTSRequest
from robot_speaker.models.tts.base import TTSClient
class DashScopeTTSClient(TTSClient):
"""DashScope流式TTS客户端封装"""
def __init__(self, api_key: str,
model: str,
voice: str,
card_index: int,
device_index: int,
output_sample_rate: int = 44100,
output_channels: int = 2,
output_volume: float = 1.0,
tts_source_sample_rate: int = 22050, # TTS服务固定输出采样率
tts_source_channels: int = 1, # TTS服务固定输出声道数
tts_ffmpeg_thread_queue_size: int = 1024, # ffmpeg输入线程队列大小
reference_signal_buffer=None, # 参考信号缓冲区(用于回声消除)
logger=None):
dashscope.api_key = api_key
self.model = model
self.voice = voice
self.card_index = card_index
self.device_index = device_index
self.output_sample_rate = output_sample_rate
self.output_channels = output_channels
self.output_volume = output_volume
self.tts_source_sample_rate = tts_source_sample_rate
self.tts_source_channels = tts_source_channels
self.tts_ffmpeg_thread_queue_size = tts_ffmpeg_thread_queue_size
self.reference_signal_buffer = reference_signal_buffer # 参考信号缓冲区
self.logger = logger
self.current_ffmpeg_pid = None # 当前ffmpeg进程的PID
# 构建ALSA设备, 允许 ffmpeg 自动重采样 / 重声道
self.alsa_device = f"plughw:{card_index},{device_index}" if (
card_index >= 0 and device_index >= 0
) else "default"
def _log(self, level: str, msg: str):
"""记录日志根据级别调用对应的ROS2日志方法"""
if self.logger:
# ROS2 logger不能动态改变severity级别需要显式调用对应方法
if level == "debug":
self.logger.debug(msg)
elif level == "info":
self.logger.info(msg)
elif level == "warning":
self.logger.warn(msg)
elif level == "error":
self.logger.error(msg)
else:
self.logger.info(msg) # 默认使用info级别
else:
print(f"[TTS] {msg}")
def synthesize(self, request: TTSRequest,
on_chunk=None,
interrupt_check=None) -> bool:
"""主流程:流式合成并播放"""
callback = _TTSCallback(self, interrupt_check, on_chunk, self.reference_signal_buffer)
# 使用配置的voicerequest.voice为None或空时使用self.voice
voice_to_use = request.voice if request.voice and request.voice.strip() else self.voice
if not voice_to_use or not voice_to_use.strip():
self._log("error", f"Voice参数无效: '{voice_to_use}'")
return False
self._log("info", f"TTS开始: 文本='{request.text[:50]}...', voice='{voice_to_use}'")
synthesizer = SpeechSynthesizer(
model=self.model,
voice=voice_to_use,
format=AudioFormat.PCM_22050HZ_MONO_16BIT,
callback=callback,
)
try:
synthesizer.streaming_call(request.text)
synthesizer.streaming_complete()
finally:
callback.cleanup()
return not callback._interrupted
class _TTSCallback(ResultCallback):
"""TTS回调处理 - 使用ffmpeg播放自动处理采样率转换"""
def __init__(self, tts_client: DashScopeTTSClient,
interrupt_check=None,
on_chunk=None,
reference_signal_buffer=None):
self.tts_client = tts_client
self.interrupt_check = interrupt_check
self.on_chunk = on_chunk
self.reference_signal_buffer = reference_signal_buffer # 参考信号缓冲区
self._proc = None
self._interrupted = False
self._cleaned_up = False
def on_open(self):
# 使用ffmpeg播放自动处理采样率转换TTS源采样率 -> 设备采样率)
# TTS服务输出固定采样率和声道数ffmpeg会自动转换为设备采样率和声道数
ffmpeg_cmd = [
'ffmpeg',
'-f', 's16le', # 原始 PCM
'-ar', str(self.tts_client.tts_source_sample_rate), # TTS输出采样率从配置文件读取
'-ac', str(self.tts_client.tts_source_channels), # TTS输出声道数从配置文件读取
'-i', 'pipe:0', # stdin
'-f', 'alsa', # 输出到 ALSA
'-ar', str(self.tts_client.output_sample_rate), # 输出设备采样率(从配置文件读取)
'-ac', str(self.tts_client.output_channels), # 输出设备声道数(从配置文件读取)
'-acodec', 'pcm_s16le', # 输出编码
'-fflags', 'nobuffer', # 减少缓冲
'-flags', 'low_delay', # 低延迟
'-avioflags', 'direct', # 尝试直通写入 ALSA减少延迟
self.tts_client.alsa_device
]
# 将 -thread_queue_size 放到输入文件之前
insert_pos = ffmpeg_cmd.index('-i')
ffmpeg_cmd.insert(insert_pos, str(self.tts_client.tts_ffmpeg_thread_queue_size))
ffmpeg_cmd.insert(insert_pos, '-thread_queue_size')
# 添加音量调节filter如果音量不是1.0
if self.tts_client.output_volume != 1.0:
# 在输出编码前插入音量filter
# volume filter放在输入之后、输出编码之前
acodec_idx = ffmpeg_cmd.index('-acodec')
ffmpeg_cmd.insert(acodec_idx, f'volume={self.tts_client.output_volume}')
ffmpeg_cmd.insert(acodec_idx, '-af')
self.tts_client._log("info", f"启动ffmpeg播放: ALSA设备={self.tts_client.alsa_device}, "
f"输出采样率={self.tts_client.output_sample_rate}Hz, "
f"输出声道数={self.tts_client.output_channels}, "
f"音量={self.tts_client.output_volume * 100:.0f}%")
self._proc = subprocess.Popen(
ffmpeg_cmd,
stdin=subprocess.PIPE,
stdout=subprocess.DEVNULL,
stderr=subprocess.PIPE # 改为PIPE以便捕获错误
)
# 记录ffmpeg进程PID
self.tts_client.current_ffmpeg_pid = self._proc.pid
self.tts_client._log("debug", f"ffmpeg进程已启动PID={self._proc.pid}")
def on_complete(self):
pass
def on_error(self, message: str):
self.tts_client._log("error", f"TTS错误: {message}")
def on_close(self):
self.cleanup()
def on_event(self, message):
pass
def on_data(self, data: bytes) -> None:
"""接收音频数据并播放"""
if self._interrupted:
return
if self.interrupt_check and self.interrupt_check():
# 停止播放,不停止 TTS
self._interrupted = True
if self._proc:
self._proc.terminate()
return
# 优先写入ffmpeg避免阻塞播放
# 优先写入ffmpeg避免阻塞播放
if self._proc and self._proc.stdin and not self._interrupted:
try:
self._proc.stdin.write(data)
self._proc.stdin.flush()
except BrokenPipeError:
# ffmpeg进程可能已退出检查错误
if self._proc.stderr:
error_msg = self._proc.stderr.read().decode('utf-8', errors='ignore')
self.tts_client._log("error", f"ffmpeg错误: {error_msg}")
self._interrupted = True
# 将音频数据添加到参考信号缓冲区(用于回声消除)
# 在写入ffmpeg之后处理避免阻塞播放
if self.reference_signal_buffer and data:
try:
self.reference_signal_buffer.add_reference(
data,
source_sample_rate=self.tts_client.tts_source_sample_rate,
source_channels=self.tts_client.tts_source_channels
)
except Exception as e:
# 参考信号处理失败不应影响播放
self.tts_client._log("warning", f"参考信号处理失败: {e}")
if self.on_chunk:
self.on_chunk(data)
def cleanup(self):
"""清理资源"""
if self._cleaned_up or not self._proc:
return
self._cleaned_up = True
# 关闭stdin让ffmpeg处理完剩余数据
if self._proc.stdin and not self._proc.stdin.closed:
try:
self._proc.stdin.close()
except:
pass
# 等待进程自然结束根据文本长度估算最少10秒最多30秒
# 假设平均语速3-4字/秒,加上缓冲时间
if self._proc.poll() is None:
try:
# 增加等待时间确保ffmpeg播放完成
# 对于长文本,可能需要更长时间
self._proc.wait(timeout=30.0)
except:
# 超时后,如果进程还在运行,说明可能卡住了,强制终止
if self._proc.poll() is None:
self.tts_client._log("warning", "ffmpeg播放超时强制终止")
try:
self._proc.terminate()
self._proc.wait(timeout=1.0)
except:
try:
self._proc.kill()
self._proc.wait(timeout=0.1)
except:
pass
# 清空PID记录
if self.tts_client.current_ffmpeg_pid == self._proc.pid:
self.tts_client.current_ffmpeg_pid = None

View File

@@ -0,0 +1,5 @@
"""感知层"""

View File

@@ -0,0 +1,304 @@
"""
音频处理模块:录音 + VAD + 回声消除
"""
import time
import pyaudio
import webrtcvad
import struct
import queue
from .echo_cancellation import EchoCanceller, ReferenceSignalBuffer
class VADDetector:
"""VAD语音检测器"""
def __init__(self, mode: int, sample_rate: int):
self.vad = webrtcvad.Vad(mode)
self.sample_rate = sample_rate
class AudioRecorder:
"""音频录音器 - 录音线程"""
def __init__(self, device_index: int, sample_rate: int, channels: int,
chunk: int, vad_detector: VADDetector,
audio_queue: queue.Queue, # 音频队列:录音线程 → ASR线程
silence_duration_ms: int = 1000,
min_energy_threshold: int = 300, # 音频能量 > 300有语音
heartbeat_interval: float = 2.0,
on_heartbeat=None,
is_playing=None,
on_new_segment=None, # 检测到新的人声段
on_speech_start=None, # 检测到人声开始
on_speech_end=None, # 检测到静音结束(说话结束)
stop_flag=None,
on_audio_chunk=None, # 音频chunk回调用于声纹录音等可选
should_put_to_queue=None, # 检查是否应该将音频放入队列用于阻止ASR可选
get_silence_threshold=None, # 获取动态静音阈值(毫秒,可选)
enable_echo_cancellation: bool = True, # 是否启用回声消除
reference_signal_buffer: ReferenceSignalBuffer = None, # 参考信号缓冲区(可选)
logger=None):
self.device_index = device_index
self.sample_rate = sample_rate
self.channels = channels
self.chunk = chunk
self.vad_detector = vad_detector
self.audio_queue = audio_queue
self.silence_duration_ms = int(silence_duration_ms)
self.min_energy_threshold = int(min_energy_threshold)
self.heartbeat_interval = heartbeat_interval
self.on_heartbeat = on_heartbeat
self.is_playing = is_playing or (lambda: False)
self.on_new_segment = on_new_segment
self.on_speech_start = on_speech_start
self.on_speech_end = on_speech_end
self.stop_flag = stop_flag or (lambda: False)
self.on_audio_chunk = on_audio_chunk # 音频chunk回调用于声纹录音等
self.should_put_to_queue = should_put_to_queue or (lambda: True) # 默认允许放入队列
self.get_silence_threshold = get_silence_threshold # 动态静音阈值回调
self.logger = logger
self.audio = pyaudio.PyAudio()
# 自动查找 iFLYTEK 麦克风设备
try:
count = self.audio.get_device_count()
found_index = -1
if self.logger:
self.logger.info(f"开始扫描音频设备 (总数: {count})...")
for i in range(count):
device_info = self.audio.get_device_info_by_index(i)
device_name = device_info.get('name', '')
max_input_channels = device_info.get('maxInputChannels', 0)
if self.logger:
try:
self.logger.info(f"扫描设备 [{i}]: Name='{device_name}', MaxInput={max_input_channels}, Rate={int(device_info.get('defaultSampleRate'))}")
except:
pass
# 检查是否包含 iFLYTEK 且支持录音(输入通道 > 0
if 'iFLYTEK' in device_name and max_input_channels > 0:
found_index = i
if self.logger:
self.logger.info(f"已自动定位到麦克风设备: {device_name} (Index: {i})")
break
if found_index != -1:
self.device_index = found_index
else:
if self.logger:
self.logger.warning(f"未自动检测到 iFLYTEK 设备,将继续使用配置的索引: {self.device_index}")
except Exception as e:
if self.logger:
self.logger.error(f"设备自动检测过程出错: {e}")
self.format = pyaudio.paInt16
self._debug_counter = 0
# 回声消除相关
self.enable_echo_cancellation = enable_echo_cancellation
self.reference_signal_buffer = reference_signal_buffer
if enable_echo_cancellation:
# 初始化回声消除器(在录音线程中同步处理,不是单独线程)
# frame_size设置为chunk大小确保每次处理一个chunk
frame_size = chunk
try:
# 获取参考信号声道数从reference_signal_buffer获取因为它是根据播放声道数创建的
ref_channels = self.reference_signal_buffer.channels if self.reference_signal_buffer else 1
self.echo_canceller = EchoCanceller(
sample_rate=sample_rate,
frame_size=frame_size,
channels=self.channels, # 麦克风输入1声道
ref_channels=ref_channels, # 参考信号播放声道数2声道
logger=logger
)
if self.echo_canceller.aec is not None:
if logger:
logger.info(f"回声消除器已启用: sample_rate={sample_rate}, frame_size={frame_size}")
else:
if logger:
logger.warning("回声消除器初始化失败,将禁用回声消除功能")
self.enable_echo_cancellation = False
self.echo_canceller = None
except Exception as e:
if logger:
logger.warning(f"回声消除器初始化失败: {e},将禁用回声消除功能")
self.enable_echo_cancellation = False
self.echo_canceller = None
else:
self.echo_canceller = None
def record_with_vad(self):
"""录音线程VAD + 能量检测"""
if self.on_heartbeat:
self.on_heartbeat()
try:
stream = self.audio.open(
format=self.format,
channels=self.channels,
rate=self.sample_rate,
input=True,
input_device_index=self.device_index if self.device_index >= 0 else None,
frames_per_buffer=self.chunk
)
except Exception as e:
raise RuntimeError(f"无法打开音频输入设备: {e}")
# VAD检测窗口, 最快 0.5s 内发现说话
window_sec = 0.5
# 连续 1s 没有检测到语音,就判定为静音状态
no_speech_threshold = max(self.silence_duration_ms / 1000.0, 0.1)
last_heartbeat_time = time.time()
audio_buffer = [] # VAD 滑动窗口
last_active_time = time.time() # 静音计时基准
in_speech_segment = False # 是否处于语音段中(从检测到人声开始,直到静音超时结束)
try:
while not self.stop_flag():
# exception_on_overflow=False, 宁可丢帧,也不阻塞
data = stream.read(self.chunk, exception_on_overflow=False)
# 回声消除处理
processed_data = data
if self.enable_echo_cancellation and self.echo_canceller and self.reference_signal_buffer:
try:
# 获取参考信号(长度与麦克风信号匹配)
ref_signal = self.reference_signal_buffer.get_reference(num_samples=self.chunk)
# 执行回声消除
processed_data = self.echo_canceller.process(data, ref_signal)
except Exception as e:
if self.logger:
self.logger.warning(f"回声消除处理失败: {e},使用原始音频")
processed_data = data
# 检查是否应该将音频放入队列用于阻止ASR例如无声纹文件时需要注册
if self.should_put_to_queue():
# 队列满时丢弃最旧的数据ASR 跟不上时系统仍然听得见
if self.audio_queue.full():
self.audio_queue.get_nowait()
# 使用处理后的音频数据(经过回声消除)
self.audio_queue.put_nowait(processed_data)
# 音频chunk回调用于声纹录音等仅在需要时调用
if self.on_audio_chunk:
# 回调使用处理后的音频数据
self.on_audio_chunk(processed_data)
# VAD检测使用处理后的音频经过回声消除
audio_buffer.append(processed_data) # 只用于 VAD不用于 ASR
# VAD检测窗口
now = time.time()
if len(audio_buffer) * self.chunk / self.sample_rate >= window_sec:
raw_audio = b''.join(audio_buffer)
energy = self._calculate_energy(raw_audio)
vad_result = self._check_activity(raw_audio)
self._debug_counter += 1
if self._debug_counter >= 10:
if self.logger:
self.logger.info(f"[VAD调试] 能量={energy:.1f}, 阈值={self.min_energy_threshold}, VAD结果={vad_result}")
self._debug_counter = 0
if vad_result:
last_active_time = now
if not in_speech_segment: # 上一轮没说话,本轮开始说话
in_speech_segment = True
if self.on_speech_start:
self.on_speech_start()
# 检测当前 TTS 是否在播放
if self.is_playing() and self.on_new_segment:
self.on_new_segment() # 打断 TTS的回调
else:
if in_speech_segment:
# 处于语音段中,但当前帧为静音,检查静音时长
silence_duration = now - last_active_time
# 动态获取静音阈值(如果提供回调函数)
if self.get_silence_threshold:
current_silence_ms = self.get_silence_threshold()
current_no_speech_threshold = max(current_silence_ms / 1000.0, 0.1)
else:
current_no_speech_threshold = no_speech_threshold
# 添加调试日志
if self.logger and silence_duration < current_no_speech_threshold:
self.logger.debug(f"[VAD] 静音中: {silence_duration:.3f}秒 < {current_no_speech_threshold:.3f}秒阈值")
if silence_duration >= current_no_speech_threshold:
if self.on_speech_end:
if self.logger:
self.logger.debug(f"[VAD] 触发speech_end: 静音持续时间 {silence_duration:.3f}秒 >= 阈值 {current_no_speech_threshold:.3f}")
self.on_speech_end() # 通知系统用户停止说话
in_speech_segment = False
if self.on_heartbeat and now - last_heartbeat_time >= self.heartbeat_interval:
self.on_heartbeat()
last_heartbeat_time = now
audio_buffer = []
finally:
if stream.is_active():
stream.stop_stream()
stream.close()
@staticmethod
def _calculate_energy(audio_chunk: bytes) -> float:
"""计算音频能量RMS"""
if not audio_chunk:
return 0.0
# 计算样本数:音频字节数 // 2因为是16位PCM1个样本=2字节
n = len(audio_chunk) // 2
if n <= 0:
return 0.0
# 把字节数据解包为16位有符号整数小端序
samples = struct.unpack(f'<{n}h', audio_chunk[: n * 2])
if not samples:
return 0.0
return (sum(s * s for s in samples) / len(samples)) ** 0.5
def _check_activity(self, audio_data: bytes) -> bool:
"""VAD + 能量检测先VAD检测能量作为辅助判断"""
energy = self._calculate_energy(audio_data)
rate = 0.4 # 连续人声经验值
num = 0
# 采样率:16000 Hz, 帧时长:20ms=0.02s, 每帧采样点数=16000×0.02=320samples
# 每帧字节数=320×2=640bytes
bytes_per_sample = 2 # paInt16
frame_samples = int(self.sample_rate * 0.02)
frame_bytes = frame_samples * bytes_per_sample
if frame_bytes <= 0 or len(audio_data) < frame_bytes:
return False
total_frames = len(audio_data) // frame_bytes
required = max(1, int(total_frames * rate))
for i in range(0, len(audio_data), frame_bytes):
chunk = audio_data[i:i + frame_bytes]
if len(chunk) == frame_bytes:
if self.vad_detector.vad.is_speech(chunk, sample_rate=self.sample_rate):
num += 1
# 语音开头能量高, 中后段(拖音、尾音)能量下降
vad_result = num >= required
if vad_result and energy < self.min_energy_threshold * 0.5:
return False
return vad_result
def cleanup(self):
"""清理资源"""
if hasattr(self, 'audio') and self.audio:
self.audio.terminate()

View File

@@ -0,0 +1,131 @@
"""
相机模块 - RealSense相机封装
"""
import numpy as np
import contextlib
class CameraClient:
def __init__(self,
serial_number: str | None,
width: int,
height: int,
fps: int,
format: str,
logger=None):
self.serial_number = serial_number
self.width = width
self.height = height
self.fps = fps
self.format = format
self.logger = logger
self.pipeline = None
self.config = None
self._is_initialized = False
self._rs = None
def _log(self, level: str, msg: str):
if self.logger:
getattr(self.logger, level, self.logger.info)(msg)
else:
print(f"[相机] {msg}")
def initialize(self) -> bool:
"""
初始化并启动相机管道
"""
if self._is_initialized:
return True
try:
import pyrealsense2 as rs
self._rs = rs
self.pipeline = rs.pipeline()
self.config = rs.config()
if self.serial_number:
self.config.enable_device(self.serial_number)
self.config.enable_stream(
rs.stream.color,
self.width,
self.height,
rs.format.rgb8 if self.format == 'RGB8' else rs.format.bgr8,
self.fps
)
self.pipeline.start(self.config)
self._is_initialized = True
self._log("info", f"相机已启动并保持运行: {self.width}x{self.height}@{self.fps}fps")
return True
except Exception as e:
self._log("error", f"相机初始化失败: {e}")
self.cleanup()
return False
def cleanup(self):
"""停止相机管道,释放资源"""
if self.pipeline:
self.pipeline.stop()
self._log("info", "相机已停止")
self.pipeline = None
self.config = None
self._is_initialized = False
def capture_rgb(self) -> np.ndarray | None:
"""
从运行中的相机管道捕获一帧RGB图像
"""
if not self._is_initialized:
self._log("error", "相机未初始化,无法捕获图像")
return None
try:
frames = self.pipeline.wait_for_frames()
color_frame = frames.get_color_frame()
return np.asanyarray(color_frame.get_data())
except Exception as e:
self._log("error", f"捕获图像失败: {e}")
return None
@contextlib.contextmanager
def capture_context(self):
"""
上下文管理器:拍照并自动清理资源
"""
image_data = self.capture_rgb()
try:
yield image_data
finally:
if image_data is not None:
del image_data
def capture_multiple(self, count: int = 1) -> list[np.ndarray]:
"""
捕获多张图像(为未来扩展准备)
"""
images = []
for i in range(count):
img = self.capture_rgb()
if img is not None:
images.append(img)
else:
self._log("warning", f"{i+1}张图像捕获失败")
return images
@contextlib.contextmanager
def capture_multiple_context(self, count: int = 1):
"""
上下文管理器:捕获多张图像并自动清理资源
"""
images = self.capture_multiple(count)
try:
yield images
finally:
for img in images:
del img
images.clear()

View File

@@ -0,0 +1,98 @@
import collections
import numpy as np
class ReferenceSignalBuffer:
"""参考信号缓冲区"""
def __init__(self, sample_rate: int, channels: int, max_duration_ms: int | None = None,
buffer_seconds: float = 5.0):
self.sample_rate = int(sample_rate)
self.channels = int(channels)
if max_duration_ms is not None:
buffer_seconds = max(float(max_duration_ms) / 1000.0, 0.1)
self.max_samples = int(self.sample_rate * buffer_seconds)
self._buffer = collections.deque(maxlen=self.max_samples * self.channels)
def add_reference(self, data: bytes, source_sample_rate: int, source_channels: int):
if source_sample_rate != self.sample_rate or source_channels != self.channels:
return
samples = np.frombuffer(data, dtype=np.int16)
self._buffer.extend(samples.tolist())
def get_reference(self, num_samples: int) -> bytes:
needed = int(num_samples) * self.channels
if needed <= 0:
return b""
if len(self._buffer) < needed:
data = list(self._buffer) + [0] * (needed - len(self._buffer))
else:
data = list(self._buffer)[-needed:]
return np.array(data, dtype=np.int16).tobytes()
class EchoCanceller:
"""回声消除器(基于 aec-audio-processing"""
def __init__(self, sample_rate: int, frame_size: int, channels: int, ref_channels: int, logger=None):
self.sample_rate = int(sample_rate)
self.frame_size = int(frame_size)
self.channels = int(channels)
self.ref_channels = int(ref_channels)
self.logger = logger
self.aec = None
self._process_reverse = None
self._frame_bytes = int(self.sample_rate / 100) * self.channels * 2 # 10ms, int16
self._ref_frame_bytes = int(self.sample_rate / 100) * self.ref_channels * 2
try:
from aec_audio_processing import AudioProcessor
self.aec = AudioProcessor(enable_aec=True, enable_ns=False, enable_agc=False)
self.aec.set_stream_format(self.sample_rate, self.channels)
if hasattr(self.aec, "set_reverse_stream_format"):
self.aec.set_reverse_stream_format(self.sample_rate, self.ref_channels)
if hasattr(self.aec, "set_stream_delay"):
self.aec.set_stream_delay(0)
if hasattr(self.aec, "process_reverse_stream"):
self._process_reverse = self.aec.process_reverse_stream
elif hasattr(self.aec, "process_reverse"):
self._process_reverse = self.aec.process_reverse
except Exception:
self.aec = None
def process(self, mic_data: bytes, ref_data: bytes) -> bytes:
if not self.aec:
return mic_data
if not mic_data:
return mic_data
try:
out_chunks = []
total_len = len(mic_data)
frame_bytes = self._frame_bytes
ref_frame_bytes = self._ref_frame_bytes
frame_count = (total_len + frame_bytes - 1) // frame_bytes
for i in range(frame_count):
m_start = i * frame_bytes
m_end = m_start + frame_bytes
mic_frame = mic_data[m_start:m_end]
if len(mic_frame) < frame_bytes:
mic_frame = mic_frame + b"\x00" * (frame_bytes - len(mic_frame))
if ref_data:
r_start = i * ref_frame_bytes
r_end = r_start + ref_frame_bytes
ref_frame = ref_data[r_start:r_end]
if len(ref_frame) < ref_frame_bytes:
ref_frame = ref_frame + b"\x00" * (ref_frame_bytes - len(ref_frame))
if self._process_reverse:
self._process_reverse(ref_frame)
processed = self.aec.process_stream(mic_frame)
out_chunks.append(processed if processed is not None else mic_frame)
return b"".join(out_chunks)[:total_len]
except Exception as e:
if self.logger:
self.logger.warning(f"回声消除处理失败: {e},使用原始音频")
return mic_data

View File

@@ -0,0 +1,304 @@
"""
声纹识别模块
"""
import numpy as np
import threading
import tempfile
import os
import wave
import time
import json
from enum import Enum
class SpeakerState(Enum):
"""说话人识别状态"""
UNKNOWN = "unknown"
VERIFIED = "verified"
REJECTED = "rejected"
ERROR = "error"
class SpeakerVerificationClient:
"""声纹识别客户端 - 非实时、低频处理"""
def __init__(self, model_path: str, threshold: float, speaker_db_path: str = None, logger=None):
self.model_path = model_path
self.threshold = threshold
self.speaker_db_path = speaker_db_path
self.logger = logger
self.speaker_db = {} # {speaker_id: {"embedding": np.ndarray, "env": str, "threshold": float, "registered_at": float}}
self._lock = threading.Lock()
# 优化CPU性能限制Torch使用的线程数防止多线程竞争导致性能骤降
import torch
torch.set_num_threads(1)
from funasr import AutoModel
model_path = os.path.expanduser(self.model_path)
# 禁用自动更新检查,防止每次初始化都联网检查
self.model = AutoModel(model=model_path, device="cpu", disable_update=True)
if self.logger:
self.logger.info(f"声纹模型已加载: {model_path}, 阈值: {self.threshold}")
if self.speaker_db_path:
self.load_speakers()
def _log(self, level: str, msg: str):
"""记录日志 - 修复ROS2 logger在多线程环境中的问题"""
if self.logger:
try:
log_methods = {
"debug": self.logger.debug,
"info": self.logger.info,
"warning": self.logger.warning,
"error": self.logger.error,
"fatal": self.logger.fatal
}
log_method = log_methods.get(level.lower(), self.logger.info)
log_method(msg)
except ValueError as e:
if "severity cannot be changed" in str(e):
try:
self.logger.info(f"[声纹-{level.upper()}] {msg}")
except:
print(f"[声纹-{level.upper()}] {msg}")
else:
raise
else:
print(f"[声纹] {msg}")
def _write_temp_wav(self, audio_data: np.ndarray, sample_rate: int = 16000):
"""将numpy音频数组写入临时wav文件"""
audio_int16 = audio_data.astype(np.int16)
fd, temp_path = tempfile.mkstemp(suffix='.wav', prefix='sv_')
os.close(fd)
with wave.open(temp_path, 'wb') as wav_file:
wav_file.setnchannels(1)
wav_file.setsampwidth(2)
wav_file.setframerate(sample_rate)
wav_file.writeframes(audio_int16.tobytes())
return temp_path
def extract_embedding(self, audio_data: np.ndarray, sample_rate: int = 16000):
"""
提取说话人embedding低频调用一句话只调用一次
"""
# 降采样到 16000Hz (如果需要)
# Cam++ 等模型通常只支持 16k如果传入 48k 会导致内部重采样极慢或计算量剧增
target_sr = 16000
if sample_rate > target_sr:
if sample_rate % target_sr == 0:
step = sample_rate // target_sr
audio_data = audio_data[::step]
sample_rate = target_sr
else:
# 简单的非整数倍降采样可能导致问题,但对于语音验证通常 48k->16k 是整数倍
# 如果不是,此处暂不处理,依赖 funasr 内部处理,或者简单的步长取整
step = int(sample_rate / target_sr)
audio_data = audio_data[::step]
sample_rate = target_sr
if len(audio_data) < int(sample_rate * 0.5):
return None, False
temp_wav_path = None
try:
# 限制Torch在推理时使用单线程避免在多任务环境下尤其是一边录音一边识别
# 出现的极端CPU竞争和上下文切换开销
import torch
with torch.inference_mode():
# 临时设置,虽然全局已经设置了,但在调用前再次确保
# 注意set_num_threads 是全局的,这里再次确认
if torch.get_num_threads() != 1:
torch.set_num_threads(1)
temp_wav_path = self._write_temp_wav(audio_data, sample_rate)
result = self.model.generate(input=temp_wav_path)
embedding = result[0]['spk_embedding'].detach().cpu().numpy()[0] # shape [1, 192] -> [192]
embedding_dim = len(embedding)
if embedding_dim == 0:
return None, False
return embedding, True
except Exception as e:
self._log("error", f"提取embedding失败: {e}")
return None, False
finally:
if temp_wav_path and os.path.exists(temp_wav_path):
try:
os.unlink(temp_wav_path)
except:
pass
def register_speaker(self, speaker_id: str, embedding: np.ndarray,
env: str = "near", threshold: float = None) -> bool:
"""
注册说话人
"""
embedding_dim = len(embedding)
if embedding_dim == 0:
return False
embedding_norm = np.linalg.norm(embedding)
if embedding_norm == 0:
self._log("error", f"注册失败embedding范数为0")
return False
embedding_normalized = embedding / embedding_norm
speaker_threshold = threshold if threshold is not None else self.threshold
with self._lock:
self.speaker_db[speaker_id] = {
"embedding": embedding_normalized,
"env": env, # 添加 env 字段
"threshold": speaker_threshold,
"registered_at": time.time()
}
self._log("info", f"已注册说话人: {speaker_id}, 阈值: {speaker_threshold:.3f}, 维度: {embedding_dim}")
save_result = self.save_speakers()
if not save_result:
self._log("info", f"保存声纹数据库失败,但说话人已注册到内存: {speaker_id}")
return True
def match_speaker(self, embedding: np.ndarray):
"""
匹配说话人(一句话只调用一次)
"""
if not self.speaker_db:
return None, SpeakerState.UNKNOWN, 0.0, self.threshold
embedding_dim = len(embedding)
if embedding_dim == 0:
return None, SpeakerState.ERROR, 0.0, self.threshold
embedding_norm = np.linalg.norm(embedding)
if embedding_norm == 0:
return None, SpeakerState.ERROR, 0.0, self.threshold
embedding_normalized = embedding / embedding_norm
best_match = None
best_score = -1.0
best_threshold = self.threshold
with self._lock:
for speaker_id, speaker_data in self.speaker_db.items():
ref_embedding = speaker_data["embedding"]
score = np.dot(embedding_normalized, ref_embedding)
if score > best_score:
best_score = score
best_match = speaker_id
best_threshold = speaker_data["threshold"]
state = SpeakerState.VERIFIED if best_score >= best_threshold else SpeakerState.REJECTED
return (best_match, state, best_score, best_threshold)
def is_available(self) -> bool:
return self.model is not None
def cleanup(self):
"""清理资源"""
pass
def get_speaker_count(self) -> int:
with self._lock:
return len(self.speaker_db)
def remove_speaker(self, speaker_id: str) -> bool:
with self._lock:
if speaker_id not in self.speaker_db:
return False
del self.speaker_db[speaker_id]
self.save_speakers()
return True
def load_speakers(self) -> bool:
"""
从文件加载已注册的声纹
"""
if not self.speaker_db_path:
return False
if not os.path.exists(self.speaker_db_path):
self._log("info", f"声纹数据库文件不存在: {self.speaker_db_path},将创建新数据库")
return False
try:
with open(self.speaker_db_path, 'r', encoding='utf-8') as f:
data = json.load(f)
with self._lock:
for speaker_id, speaker_data in data.items():
embedding_list = speaker_data["embedding"]
embedding_array = np.array(embedding_list, dtype=np.float32)
embedding_dim = len(embedding_array)
if embedding_dim == 0:
self._log("warning", f"跳过无效声纹: {speaker_id} (维度为0)")
continue
embedding_norm = np.linalg.norm(embedding_array)
if embedding_norm > 0:
embedding_array = embedding_array / embedding_norm
self.speaker_db[speaker_id] = {
"embedding": embedding_array,
"env": speaker_data["env"],
"threshold": speaker_data["threshold"],
"registered_at": speaker_data["registered_at"]
}
count = len(self.speaker_db)
self._log("info", f"已加载 {count} 个已注册说话人")
return True
except Exception as e:
self._log("error", f"加载声纹数据库失败: {e}")
return False
def save_speakers(self) -> bool:
"""
保存已注册的声纹到文件
"""
if not self.speaker_db_path:
self._log("warning", "声纹数据库路径未配置,无法保存到文件(说话人已注册到内存)")
return False
try:
db_dir = os.path.dirname(self.speaker_db_path)
if db_dir and not os.path.exists(db_dir):
os.makedirs(db_dir, exist_ok=True)
json_data = {}
with self._lock:
for speaker_id, speaker_data in self.speaker_db.items():
json_data[speaker_id] = {
"embedding": speaker_data["embedding"].tolist(), # numpy array -> list
"env": speaker_data.get("env", "near"), # 兼容旧数据,默认使用 "near"
"threshold": speaker_data["threshold"],
"registered_at": speaker_data["registered_at"]
}
temp_path = self.speaker_db_path + ".tmp"
with open(temp_path, 'w', encoding='utf-8') as f:
json.dump(json_data, f, indent=2, ensure_ascii=False)
os.replace(temp_path, self.speaker_db_path)
self._log("info", f"已保存 {len(json_data)} 个说话人到: {self.speaker_db_path}")
return True
except Exception as e:
import traceback
self._log("error", f"保存声纹数据库失败: {e}")
self._log("error", f"保存路径: {self.speaker_db_path}")
self._log("error", f"错误详情: {traceback.format_exc()}")
temp_path = self.speaker_db_path + ".tmp"
if os.path.exists(temp_path):
try:
os.unlink(temp_path)
except:
pass
return False

View File

@@ -1,55 +0,0 @@
import rclpy
from rclpy.node import Node
from example_interfaces.msg import String
import threading
from queue import Queue
import time
import espeakng
import pyttsx3
class RobotSpeakerNode(Node):
def __init__(self, node_name):
super().__init__(node_name)
self.novels_queue_ = Queue()
self.novel_subscriber_ = self.create_subscription(
String, 'robot_msg', self.novel_callback, 10)
self.speech_thread_ = threading.Thread(target=self.speak_thread)
self.speech_thread_.start()
def novel_callback(self, msg):
self.novels_queue_.put(msg.data)
def speak_thread(self):
# 初始化引擎
engine = pyttsx3.init()
# 调整参数
engine.setProperty('rate', 150) # 语速150更自然
engine.setProperty('volume', 1.0) # 音量0.0-1.0
# 选择中文音色(修正:使用 languages 属性,且是列表)
voices = engine.getProperty('voices')
for voice in voices:
# 检查语音支持的语言列表中是否包含中文('zh' 或 'zh-CN' 等)
if any('zh' in lang for lang in voice.languages):
engine.setProperty('voice', voice.id)
self.get_logger().info(f'已选择中文语音:{voice.id}')
break
else:
self.get_logger().warning('未找到中文语音库,将使用默认语音')
while rclpy.ok():
if self.novels_queue_.qsize() > 0:
text = self.novels_queue_.get()
engine.say(text)
engine.runAndWait() # 等待语音播放完成
else:
time.sleep(0.5)
def main(args=None):
rclpy.init(args=args)
node = RobotSpeakerNode("robot_speaker_node")
rclpy.spin(node)
rclpy.shutdown()

View File

@@ -0,0 +1,5 @@
"""理解层"""

View File

@@ -0,0 +1,111 @@
"""
对话历史管理模块
"""
from robot_speaker.core.types import LLMMessage
import threading
class ConversationHistory:
"""对话历史管理器 - 实时语音"""
def __init__(self, max_history: int, summary_trigger: int):
self.max_history = max_history
self.summary_trigger = summary_trigger
self.conversation_history: list[LLMMessage] = []
self.summary: str | None = None
# 待确认机制
self._pending_user_message: LLMMessage | None = None # 待确认的用户消息
self._lock = threading.Lock() # 线程安全锁
def start_turn(self, user_content: str):
"""开始一个新的对话轮次,暂存用户消息等待LLM完成后确认写入历史"""
with self._lock:
self._pending_user_message = LLMMessage(role="user", content=user_content)
def commit_turn(self, assistant_content: str) -> bool:
"""确认当前轮次完成将usr和assistant消息写入历史"""
with self._lock:
if self._pending_user_message is None:
return False
if not assistant_content or not assistant_content.strip():
self._pending_user_message = None
return False
self.conversation_history.append(self._pending_user_message)
self.conversation_history.append(
LLMMessage(role="assistant", content=assistant_content.strip())
)
self._pending_user_message = None
self._maybe_compress()
return True
def cancel_turn(self):
"""取消当前待确认的轮次,丢弃待确认的用户消息,用于处理中断情况,防止不完整内容污染历史"""
with self._lock:
if self._pending_user_message is not None:
self._pending_user_message = None
def add_message(self, role: str, content: str):
"""直接添加消息"""
with self._lock:
# 如果有待确认的轮次,先取消它
self.cancel_turn()
self.conversation_history.append(LLMMessage(role=role, content=content))
self._maybe_compress()
def get_messages(self) -> list[LLMMessage]:
"""获取消息列表"""
with self._lock:
messages = []
if self.summary:
messages.append(LLMMessage(role="system", content=self.summary))
if self.max_history > 0:
messages.extend(self.conversation_history[-self.max_history * 2:])
if self._pending_user_message is not None:
messages.append(self._pending_user_message)
return messages
def has_pending_turn(self) -> bool:
"""检查是否有待确认的轮次"""
with self._lock:
return self._pending_user_message is not None
def _maybe_compress(self):
"""压缩对话历史"""
if self.max_history <= 0:
self.conversation_history.clear()
return
max_len = self.summary_trigger * 2
if len(self.conversation_history) <= max_len:
return
old = self.conversation_history[:-max_len]
self.conversation_history = self.conversation_history[-max_len:]
summary_text = []
for msg in old:
summary_text.append(f"{msg.role}: {msg.content}")
compressed = "对话摘要:\n" + "\n".join(summary_text[-10:])
if self.summary:
self.summary += "\n" + compressed
else:
self.summary = compressed
def clear(self):
"""清空历史和待确认消息"""
with self._lock:
self.conversation_history.clear()
self.summary = None
self._pending_user_message = None

View File

@@ -1,26 +1,36 @@
from setuptools import find_packages, setup
from setuptools import setup, find_packages
import os
from glob import glob
package_name = 'robot_speaker'
setup(
name=package_name,
version='0.0.0',
packages=[package_name],
version='0.0.1',
packages=find_packages(where='.'),
package_dir={'': '.'},
data_files=[
('share/ament_index/resource_index/packages',
['resource/' + package_name]),
('share/' + package_name, ['package.xml']),
(os.path.join('share', package_name, 'launch'), glob('launch/*.launch.py')),
(os.path.join('share', package_name, 'config'), glob('config/*.yaml') + glob('config/*.json')),
],
install_requires=[
'setuptools',
'pypinyin',
],
install_requires=['setuptools'],
zip_safe=True,
maintainer='mzebra',
maintainer_email='mzebra@foxmail.com',
description='TODO: Package description',
description='语音识别和合成ROS2包',
license='Apache-2.0',
tests_require=['pytest'],
entry_points={
'console_scripts': [
'robot_speaker_node=robot_speaker.robot_speaker_node:main'
'robot_speaker_node = robot_speaker.core.robot_speaker_node:main',
'register_speaker_node = robot_speaker.core.register_speaker_node:main',
'skill_bridge_node = robot_speaker.bridge.skill_bridge_node:main',
],
},
)

68
view_camera.py Executable file
View File

@@ -0,0 +1,68 @@
#!/usr/bin/env python3
"""
查看相机画面的简单脚本
按空格键保存当前帧,按'q'键退出
"""
import sys
import cv2
import numpy as np
try:
import pyrealsense2 as rs
except ImportError:
print("错误: 未安装pyrealsense2请运行: pip install pyrealsense2")
sys.exit(1)
def main():
# 配置相机
pipeline = rs.pipeline()
config = rs.config()
# 启用彩色流
config.enable_stream(rs.stream.color, 640, 480, rs.format.rgb8, 30)
# 启动管道
pipeline.start(config)
print("相机已启动,按空格键保存图片,按'q'键退出")
frame_count = 0
try:
while True:
# 等待一帧
frames = pipeline.wait_for_frames()
color_frame = frames.get_color_frame()
if not color_frame:
continue
# 转换为numpy数组 (RGB格式)
color_image = np.asanyarray(color_frame.get_data())
# OpenCV使用BGR格式需要转换
bgr_image = cv2.cvtColor(color_image, cv2.COLOR_RGB2BGR)
# 显示图像
cv2.imshow('Camera View', bgr_image)
# 等待按键
key = cv2.waitKey(1) & 0xFF
if key == ord('q'):
print("退出...")
break
elif key == ord(' '): # 空格键保存
frame_count += 1
filename = f'camera_frame_{frame_count:04d}.jpg'
cv2.imwrite(filename, bgr_image)
print(f"已保存: {filename}")
except KeyboardInterrupt:
print("\n中断...")
finally:
pipeline.stop()
cv2.destroyAllWindows()
print("相机已关闭")
if __name__ == '__main__':
main()