Compare commits
15 Commits
main
...
feature-de
| Author | SHA1 | Date | |
|---|---|---|---|
|
|
e5714e3a8b | ||
|
|
293e69e9f2 | ||
| 0409ce0de4 | |||
|
|
ce0d581770 | ||
|
|
a1b91ed52f | ||
| 6d101b9d9e | |||
|
|
c282f9b4de | ||
| 9fd658990c | |||
| 0c118412ec | |||
| eb91e2f139 | |||
| 838a4a357c | |||
| 9c775cff5c | |||
| 63a21999bb | |||
| 8fffd4ab42 | |||
| b90d84c325 |
6
.gitignore
vendored
Normal file
6
.gitignore
vendored
Normal file
@@ -0,0 +1,6 @@
|
||||
build/
|
||||
install/
|
||||
log/
|
||||
__pycache__/
|
||||
*.pyc
|
||||
*.egg-info/
|
||||
104
README.md
104
README.md
@@ -1,2 +1,104 @@
|
||||
# hivecore_robot_voice
|
||||
# ROS 语音包 (robot_speaker)
|
||||
|
||||
## 注册阿里云百炼获取api_key
|
||||
https://bailian.console.aliyun.com/?tab=model#/api-key
|
||||
->密钥管理
|
||||
放到config/voice.yaml
|
||||
|
||||
## 安装依赖
|
||||
1. 系统依赖
|
||||
```bash
|
||||
sudo apt-get update
|
||||
sudo apt-get install -y python3-pyaudio portaudio19-dev alsa-utils ffmpeg swig meson ninja-build build-essential pkg-config libwebrtc-audio-processing-dev
|
||||
```
|
||||
|
||||
2. Python依赖
|
||||
```bash
|
||||
cd ~/ros_learn/hivecore_robot_voice
|
||||
# 在 Python 3.10 环境下,需要单独安装 aec-audio-processing 以跳过版本检查
|
||||
pip3 install aec-audio-processing --no-binary :all: --ignore-requires-python --break-system-packages
|
||||
pip3 install -r requirements.txt --break-system-packages
|
||||
```
|
||||
|
||||
## 编译启动
|
||||
1. 注册声纹
|
||||
- 启动节点后可以说:二狗今天天气真好开始注册声纹
|
||||
- 正确的注册姿势:
|
||||
方法A(推荐):唤醒后停顿一下,然后说一段长句子。
|
||||
用户:"二狗"
|
||||
机器:(日志提示等待声纹语音)
|
||||
用户:"我现在正在注册声纹,这是一段很长的测试语音,请把我的声音录进去。"(持续说 3-5 秒)
|
||||
方法B(连贯说):一口气说很长的一句话。
|
||||
用户:"二狗你好,我是你的主人,请记住我的声音,这是一段用来注册的长语音。"
|
||||
- 注意:要包含唤醒词,语句不要停顿,尽量大于1.5秒
|
||||
```bash
|
||||
cd ~/ros_learn/hivecore_robot_voice
|
||||
colcon build
|
||||
source install/setup.bash
|
||||
ros2 run robot_speaker register_speaker_node
|
||||
```
|
||||
|
||||
2. 主节点
|
||||
- 启动节点后每句交互包含唤醒词,唤醒词和语句之间不要有停顿
|
||||
- 二狗拍照看看开启图文交互
|
||||
- 支持已注册声纹用户打断
|
||||
```bash
|
||||
cd ~/ros_learn/hivecore_robot_voice
|
||||
colcon build
|
||||
source install/setup.bash
|
||||
ros2 launch robot_speaker voice.launch.py
|
||||
```
|
||||
|
||||
## 架构说明
|
||||
[录音线程] - 唯一实时线程
|
||||
├─ 麦克风采集 PCM
|
||||
├─ VAD + 能量检测
|
||||
├─ 检测到人声 → 立即中断TTS
|
||||
├─ 语音 PCM → ASR 音频队列
|
||||
└─ 语音 PCM → 声纹音频队列(旁路,不阻塞)
|
||||
|
||||
[ASR推理线程] - 只做 audio → text
|
||||
└─ 从 ASR 音频队列取音频→ 实时 / 流式 ASR → text → 文本队列
|
||||
|
||||
[声纹识别线程] - 非实时、低频(CAM++)
|
||||
├─ 通过回调函数接收音频chunk,写入缓冲区,等待 speech_end 事件触发处理
|
||||
├─ 累积 1~2 秒有效人声(VAD 后)
|
||||
├─ CAM++ 提取 speaker embedding
|
||||
├─ 声纹匹配 / 注册
|
||||
└─ 更新 current_speaker_id(共享状态,只写不控)
|
||||
声纹线程要求:不影响录音,不影响ASR,不控制TTS,只更新当前说话人是谁
|
||||
|
||||
[主线程/处理线程] - 处理业务逻辑
|
||||
├─ 从 文本队列 取 ASR 文本
|
||||
├─ 读取 current_speaker_id(只读)
|
||||
├─ 唤醒词处理(结合 speaker_id)
|
||||
├─ 权限 / 身份判断(是否允许继续)
|
||||
├─ VLM处理(文本 / 多模态)
|
||||
└─ TTS播放(启动TTS线程,不等待)
|
||||
|
||||
[TTS播放线程] - 只播放(可被中断)
|
||||
├─ 接收 TTS 音频流
|
||||
├─ 播放到输出设备
|
||||
└─ 响应中断标志(由录音线程触发)
|
||||
|
||||
|
||||
## 用到的命令
|
||||
1. 音频设备
|
||||
```bash
|
||||
# 1. 查看所有音频设备
|
||||
cat /proc/asound/cards
|
||||
# 2. 查看 card(1)的流信息(设备参数)
|
||||
cat /proc/asound/card1/stream0
|
||||
```
|
||||
|
||||
2. 相机设备
|
||||
```bash
|
||||
# 1. 查看相机所有基础信息(型号、固件版本、序列号等)
|
||||
rs-enumerate-devices -c
|
||||
```
|
||||
|
||||
3. 模型下载
|
||||
```bash
|
||||
modelscope download --model iic/speech_campplus_sv_zh-cn_16k-common --local_dir [指定路径]
|
||||
```
|
||||
|
||||
|
||||
18
config/knowledge.json
Normal file
18
config/knowledge.json
Normal file
@@ -0,0 +1,18 @@
|
||||
{
|
||||
"entries": [
|
||||
{
|
||||
"id": "robot_identity",
|
||||
"patterns": [
|
||||
"ni shi shei"
|
||||
],
|
||||
"answer": "我叫二狗,是蜂核科技的机器人,很高兴为你服务"
|
||||
},
|
||||
{
|
||||
"id": "wake_word",
|
||||
"patterns": [
|
||||
"ni de ming zi"
|
||||
],
|
||||
"answer": "我的名字是二狗"
|
||||
}
|
||||
]
|
||||
}
|
||||
599
config/speakers.json
Normal file
599
config/speakers.json
Normal file
@@ -0,0 +1,599 @@
|
||||
{
|
||||
"user_1768311644": {
|
||||
"embedding": [
|
||||
0.017083248123526573,
|
||||
-0.01032772846519947,
|
||||
0.0058503481559455395,
|
||||
0.11945011466741562,
|
||||
0.03864186629652977,
|
||||
-0.16047827899456024,
|
||||
0.008000967092812061,
|
||||
0.10669729858636856,
|
||||
0.13221754133701324,
|
||||
0.06365424394607544,
|
||||
-0.06943577527999878,
|
||||
0.08401959389448166,
|
||||
0.09903465211391449,
|
||||
0.0407508946955204,
|
||||
-0.07486417144536972,
|
||||
0.0010617832886055112,
|
||||
0.12097838521003723,
|
||||
-0.013734623789787292,
|
||||
-0.020789025351405144,
|
||||
-0.02113250270485878,
|
||||
0.008510188199579716,
|
||||
-0.05490498244762421,
|
||||
-0.17027714848518372,
|
||||
0.09569162130355835,
|
||||
-0.07379947602748871,
|
||||
0.05932804197072983,
|
||||
0.0839226171374321,
|
||||
0.004776939284056425,
|
||||
0.050190482288599014,
|
||||
-0.19962339103221893,
|
||||
-0.13987377285957336,
|
||||
0.041607797145843506,
|
||||
0.10067984461784363,
|
||||
0.0684289038181305,
|
||||
0.08163066953420639,
|
||||
-0.029243428260087967,
|
||||
-0.10118222236633301,
|
||||
-0.11619988083839417,
|
||||
-0.10121472179889679,
|
||||
-0.04290663078427315,
|
||||
-0.08373524248600006,
|
||||
0.03493887186050415,
|
||||
0.055566269904375076,
|
||||
-0.11284282803535461,
|
||||
-0.10970190167427063,
|
||||
0.03457016497850418,
|
||||
0.11647575348615646,
|
||||
-0.014930102974176407,
|
||||
-0.04663793370127678,
|
||||
0.0752566009759903,
|
||||
-0.06746217608451843,
|
||||
-0.07642832398414612,
|
||||
0.06518206000328064,
|
||||
0.07191824167966843,
|
||||
0.13557033240795135,
|
||||
0.04906972125172615,
|
||||
0.03679114207625389,
|
||||
0.07466751337051392,
|
||||
0.01071798987686634,
|
||||
-0.07979520410299301,
|
||||
-0.10039637982845306,
|
||||
0.004846179857850075,
|
||||
-0.07325125485658646,
|
||||
-0.08750395476818085,
|
||||
0.05332862585783005,
|
||||
0.10648373514413834,
|
||||
-0.035643525421619415,
|
||||
0.21233271062374115,
|
||||
0.011915713548660278,
|
||||
0.13632774353027344,
|
||||
0.10383394360542297,
|
||||
-0.053550489246845245,
|
||||
0.05719169229269028,
|
||||
0.04600509628653526,
|
||||
0.043678827583789825,
|
||||
-0.03646669536828995,
|
||||
0.08175459504127502,
|
||||
0.042513635009527206,
|
||||
-0.09215544164180756,
|
||||
-0.06402364373207092,
|
||||
-0.10830589383840561,
|
||||
0.03379691392183304,
|
||||
0.07699205726385117,
|
||||
-0.11046901345252991,
|
||||
-0.016612332314252853,
|
||||
-0.02984754927456379,
|
||||
0.00998819898813963,
|
||||
-0.05820641294121742,
|
||||
0.007753593847155571,
|
||||
-0.016712933778762817,
|
||||
0.0014505418948829174,
|
||||
-0.04807407408952713,
|
||||
-0.048170242458581924,
|
||||
-0.0531715452671051,
|
||||
0.019113507121801376,
|
||||
0.08439801633358002,
|
||||
0.010585008189082146,
|
||||
-0.07400234043598175,
|
||||
0.10156761854887009,
|
||||
-0.018891986459493637,
|
||||
-0.052156757563352585,
|
||||
0.1302887201309204,
|
||||
0.08590760082006454,
|
||||
0.13382190465927124,
|
||||
-0.1498136967420578,
|
||||
-0.030552342534065247,
|
||||
-0.09281301498413086,
|
||||
0.10279291868209839,
|
||||
0.015315898694097996,
|
||||
-0.014133274555206299,
|
||||
-0.01298056822270155,
|
||||
0.06241781264543533,
|
||||
0.017693962901830673,
|
||||
0.0007682808791287243,
|
||||
0.029756756499409676,
|
||||
0.12711282074451447,
|
||||
-0.0695323497056961,
|
||||
0.01649993099272251,
|
||||
0.08811338990926743,
|
||||
-0.06976141035556793,
|
||||
-0.0763985738158226,
|
||||
-0.10730905085802078,
|
||||
0.0256052203476429,
|
||||
0.05183263123035431,
|
||||
0.0947495624423027,
|
||||
0.007070058956742287,
|
||||
-0.0505177341401577,
|
||||
-0.009485805407166481,
|
||||
0.003954170271754265,
|
||||
0.014901814050972462,
|
||||
-0.08098141849040985,
|
||||
0.03615008667111397,
|
||||
-0.09673020988702774,
|
||||
0.06970252841711044,
|
||||
0.009914563037455082,
|
||||
-0.012040670961141586,
|
||||
-0.0008170561632141471,
|
||||
-0.06880783289670944,
|
||||
-0.053053151816129684,
|
||||
0.05272500216960907,
|
||||
0.021709589287638664,
|
||||
-0.09712725877761841,
|
||||
0.06947346031665802,
|
||||
-0.07973745465278625,
|
||||
-0.036861639469861984,
|
||||
-0.08714801073074341,
|
||||
0.05473816394805908,
|
||||
-0.006384482141584158,
|
||||
-0.03656519949436188,
|
||||
0.0605260394513607,
|
||||
0.0407724604010582,
|
||||
-0.1314084380865097,
|
||||
-0.05484895780682564,
|
||||
0.014381998218595982,
|
||||
-0.07414797693490982,
|
||||
-0.013259666971862316,
|
||||
-0.1076463982462883,
|
||||
-0.04896606504917145,
|
||||
0.050690483301877975,
|
||||
0.0719417929649353,
|
||||
0.04990950971841812,
|
||||
-0.049923382699489594,
|
||||
0.08706197887659073,
|
||||
-0.06278207153081894,
|
||||
-0.029196983203291893,
|
||||
-0.07312408834695816,
|
||||
0.01651231199502945,
|
||||
0.025062547996640205,
|
||||
-0.023919139057397842,
|
||||
0.05597180873155594,
|
||||
0.08446669578552246,
|
||||
-0.06616690754890442,
|
||||
0.011679486371576786,
|
||||
0.008357426151633263,
|
||||
-0.07388673722743988,
|
||||
0.03612314909696579,
|
||||
-0.055705588310956955,
|
||||
-0.008656222373247147,
|
||||
-0.06408344209194183,
|
||||
-0.05341912433505058,
|
||||
0.01561578270047903,
|
||||
0.002446901286020875,
|
||||
0.042539432644844055,
|
||||
0.12226217240095139,
|
||||
-0.03700198978185654,
|
||||
0.02393815666437149,
|
||||
-0.021217981353402138,
|
||||
0.04431416094303131,
|
||||
-0.09150857478380203,
|
||||
-0.004766684491187334,
|
||||
-0.06133556738495827,
|
||||
0.07721113413572311
|
||||
],
|
||||
"env": "near",
|
||||
"threshold": 0.4,
|
||||
"registered_at": 1768311644.5742264
|
||||
},
|
||||
"user_1768529827": {
|
||||
"embedding": [
|
||||
0.0077949948608875275,
|
||||
-0.012852567248046398,
|
||||
0.0014490776229649782,
|
||||
0.088177390396595,
|
||||
-0.052150458097457886,
|
||||
-0.1070166826248169,
|
||||
-0.051932964473962784,
|
||||
0.040730226784944534,
|
||||
0.09491471946239471,
|
||||
-0.10504328459501266,
|
||||
-0.17986123263835907,
|
||||
0.06056514009833336,
|
||||
0.0002809118013828993,
|
||||
-0.05353177338838577,
|
||||
-0.08724740147590637,
|
||||
-0.01057526096701622,
|
||||
-0.10766296088695526,
|
||||
0.024376090615987778,
|
||||
-0.11535818874835968,
|
||||
0.12653452157974243,
|
||||
-0.0063497889786958694,
|
||||
-0.02372283861041069,
|
||||
-0.049704890698194504,
|
||||
0.01079346239566803,
|
||||
-0.10683158040046692,
|
||||
0.00932641327381134,
|
||||
0.043871842324733734,
|
||||
0.04073511064052582,
|
||||
0.005968529265373945,
|
||||
0.05397576093673706,
|
||||
0.07122175395488739,
|
||||
0.06804963946342468,
|
||||
-0.058389563113451004,
|
||||
-0.03463176265358925,
|
||||
-0.06834574788808823,
|
||||
-0.09127284586429596,
|
||||
-0.09805246442556381,
|
||||
-0.015370666980743408,
|
||||
-0.07054834067821503,
|
||||
-0.07520422339439392,
|
||||
-0.0502505861222744,
|
||||
0.01580144092440605,
|
||||
0.04316972196102142,
|
||||
-0.010298517532646656,
|
||||
-0.09042523056268692,
|
||||
-0.03399325907230377,
|
||||
0.03738871216773987,
|
||||
0.09461583197116852,
|
||||
0.07643604278564453,
|
||||
-0.04089711233973503,
|
||||
0.14397914707660675,
|
||||
-0.03218085318803787,
|
||||
-0.03981873393058777,
|
||||
-0.05353623256087303,
|
||||
-0.06475386023521423,
|
||||
0.047925639897584915,
|
||||
0.008481102995574474,
|
||||
0.09522885829210281,
|
||||
0.05679373815655708,
|
||||
0.021448519080877304,
|
||||
0.04586802423000336,
|
||||
0.007880095392465591,
|
||||
-0.08111433684825897,
|
||||
-0.030093876644968987,
|
||||
0.18197935819625854,
|
||||
0.049670975655317307,
|
||||
-0.029350068420171738,
|
||||
0.1003178134560585,
|
||||
0.05890532210469246,
|
||||
-0.0418926365673542,
|
||||
-0.015124992467463017,
|
||||
-0.0016869385726749897,
|
||||
0.029022999107837677,
|
||||
0.10370466858148575,
|
||||
-0.07392475008964539,
|
||||
-0.041242245584726334,
|
||||
0.0948185846209526,
|
||||
0.0766805037856102,
|
||||
0.12104924768209457,
|
||||
0.07941737771034241,
|
||||
-0.024586958810687065,
|
||||
-0.005290709435939789,
|
||||
0.08198735862970352,
|
||||
-0.15709130465984344,
|
||||
0.11847008019685745,
|
||||
0.01280289888381958,
|
||||
0.09401026368141174,
|
||||
0.10199982672929764,
|
||||
0.00811630580574274,
|
||||
0.09336159378290176,
|
||||
-0.1219155564904213,
|
||||
0.00885648000985384,
|
||||
0.08536995947360992,
|
||||
-0.031735390424728394,
|
||||
-0.02445235848426819,
|
||||
0.17981232702732086,
|
||||
0.05046188458800316,
|
||||
-0.012413986958563328,
|
||||
-0.16514025628566742,
|
||||
-0.09369593858718872,
|
||||
0.03961285203695297,
|
||||
-0.024150250479578972,
|
||||
0.024869512766599655,
|
||||
0.009099201299250126,
|
||||
0.0023227918427437544,
|
||||
0.005291149020195007,
|
||||
-0.08285452425479889,
|
||||
0.02174258604645729,
|
||||
-0.00018321558309253305,
|
||||
-0.01761690340936184,
|
||||
-0.13327360153198242,
|
||||
0.07804469764232635,
|
||||
-0.03172646835446358,
|
||||
0.05993621423840523,
|
||||
-0.0034280805848538876,
|
||||
0.09203101694583893,
|
||||
0.04720155894756317,
|
||||
-0.12012632191181183,
|
||||
-0.028879230841994286,
|
||||
-0.04471825063228607,
|
||||
-0.08928379416465759,
|
||||
-0.055793069303035736,
|
||||
-0.0230169165879488,
|
||||
0.04459748789668083,
|
||||
-0.08481008559465408,
|
||||
0.09873232245445251,
|
||||
-0.057500336319208145,
|
||||
-0.05438977852463722,
|
||||
0.06309207528829575,
|
||||
-0.045493170619010925,
|
||||
-0.0636027380824089,
|
||||
-0.03580763190984726,
|
||||
-0.043026816099882126,
|
||||
0.04125182330608368,
|
||||
-0.06327074766159058,
|
||||
0.02830875851213932,
|
||||
-0.0697140172123909,
|
||||
-0.11324217170476913,
|
||||
-0.02744743973016739,
|
||||
-0.09659717977046967,
|
||||
-0.036915868520736694,
|
||||
0.06836548447608948,
|
||||
-0.19481360912322998,
|
||||
-0.08151774108409882,
|
||||
0.013570327311754227,
|
||||
-0.013908851891756058,
|
||||
-0.02302597463130951,
|
||||
-0.14017312228679657,
|
||||
-0.0654999315738678,
|
||||
0.0582318976521492,
|
||||
-0.023702487349510193,
|
||||
-0.046911414712667465,
|
||||
-0.02062028832733631,
|
||||
0.09885907918214798,
|
||||
-0.010111358016729355,
|
||||
-0.009303858503699303,
|
||||
-0.07802718877792358,
|
||||
0.09181840717792511,
|
||||
-0.00822418462485075,
|
||||
-0.024477459490299225,
|
||||
0.04909557104110718,
|
||||
0.024657243862748146,
|
||||
0.08074013143777847,
|
||||
0.10684694349765778,
|
||||
-0.009657780639827251,
|
||||
0.04053448513150215,
|
||||
-0.054968591779470444,
|
||||
0.09773849695920944,
|
||||
-0.019937219098210335,
|
||||
-0.11860335618257523,
|
||||
-0.12553851306438446,
|
||||
0.0016870739636942744,
|
||||
0.07446407526731491,
|
||||
-0.12183381617069244,
|
||||
-0.07524612545967102,
|
||||
0.06794209778308868,
|
||||
-0.04324038699269295,
|
||||
-0.018201345577836037,
|
||||
-0.08356837183237076,
|
||||
0.08218713104724884,
|
||||
-0.1253940612077713,
|
||||
-0.05880133807659149,
|
||||
0.11516888439655304,
|
||||
-0.007864559069275856,
|
||||
0.06438153237104416,
|
||||
-0.06551646441221237,
|
||||
0.11812424659729004,
|
||||
-0.07544125616550446,
|
||||
0.033888354897499084,
|
||||
0.02552076056599617,
|
||||
0.019394448027014732,
|
||||
-0.009682931937277317
|
||||
],
|
||||
"env": "near",
|
||||
"threshold": 0.55,
|
||||
"registered_at": 1768529827.4784193
|
||||
},
|
||||
"user_1768530001": {
|
||||
"embedding": [
|
||||
-0.02827363647520542,
|
||||
0.04181317239999771,
|
||||
-0.07721243053674698,
|
||||
0.031220311298966408,
|
||||
-0.006549456622451544,
|
||||
-0.045262161642313004,
|
||||
-0.06796529144048691,
|
||||
0.10546170920133591,
|
||||
-0.054266564548015594,
|
||||
-0.04982651397585869,
|
||||
0.008982052095234394,
|
||||
0.0887555256485939,
|
||||
-0.03736695274710655,
|
||||
-0.027568811550736427,
|
||||
-0.01881324127316475,
|
||||
-0.030173255130648613,
|
||||
-0.03817622363567352,
|
||||
-0.027703644707798958,
|
||||
-0.020354237407445908,
|
||||
0.08958664536476135,
|
||||
0.027346525341272354,
|
||||
-0.007979321293532848,
|
||||
-0.01638970896601677,
|
||||
0.14815205335617065,
|
||||
-0.029478076845407486,
|
||||
0.0968138799071312,
|
||||
0.011266525834798813,
|
||||
0.10481037944555283,
|
||||
0.006314543075859547,
|
||||
-0.07480890303850174,
|
||||
-0.126618891954422,
|
||||
0.054260920733213425,
|
||||
-0.054261378943920135,
|
||||
0.02066616155207157,
|
||||
0.056972429156303406,
|
||||
-0.02620418183505535,
|
||||
-0.08435375243425369,
|
||||
-0.06768523901700974,
|
||||
-0.001804384752176702,
|
||||
-0.03350691497325897,
|
||||
-0.06783927977085114,
|
||||
0.09583555907011032,
|
||||
0.042077258229255676,
|
||||
-0.03811662644147873,
|
||||
-0.09298640489578247,
|
||||
0.11314687132835388,
|
||||
0.06972789764404297,
|
||||
-0.10421980172395706,
|
||||
0.02739877998828888,
|
||||
-0.06242597475647926,
|
||||
0.06683704257011414,
|
||||
0.030034003779292107,
|
||||
-0.04094783961772919,
|
||||
0.08657337725162506,
|
||||
0.02882716991007328,
|
||||
0.07672230899333954,
|
||||
-0.0162385031580925,
|
||||
0.12335177510976791,
|
||||
-0.07505486160516739,
|
||||
0.05924128741025925,
|
||||
0.02278822474181652,
|
||||
0.051575034856796265,
|
||||
-0.07616295665502548,
|
||||
-0.049982234835624695,
|
||||
-0.021159915253520012,
|
||||
0.023469945415854454,
|
||||
-0.008445728570222855,
|
||||
0.18868982791900635,
|
||||
0.10217619687318802,
|
||||
0.0029947187285870314,
|
||||
0.003596147522330284,
|
||||
-0.010885344818234444,
|
||||
0.002336243400350213,
|
||||
-0.06228164955973625,
|
||||
-0.09452632069587708,
|
||||
0.06288570165634155,
|
||||
0.09799493104219437,
|
||||
0.05772380530834198,
|
||||
-0.012649190612137318,
|
||||
0.037833958864212036,
|
||||
-0.07815677672624588,
|
||||
0.11595622450113297,
|
||||
-0.006132716778665781,
|
||||
-0.047689273953437805,
|
||||
0.10451581329107285,
|
||||
0.12618094682693481,
|
||||
-0.012135603465139866,
|
||||
-0.14452683925628662,
|
||||
-0.011882219463586807,
|
||||
0.05687599256634712,
|
||||
-0.10221579670906067,
|
||||
0.09555421024560928,
|
||||
0.050166770815849304,
|
||||
0.026791365817189217,
|
||||
0.0343380831182003,
|
||||
0.0643647089600563,
|
||||
-0.09814899414777756,
|
||||
-0.01735001988708973,
|
||||
0.0002968672488350421,
|
||||
-0.16691210865974426,
|
||||
-0.044747937470674515,
|
||||
0.10229559987783432,
|
||||
0.01551489345729351,
|
||||
0.0614253506064415,
|
||||
-0.012457458302378654,
|
||||
-0.059297215193510056,
|
||||
-0.0662546306848526,
|
||||
0.06900843977928162,
|
||||
-0.15012530982494354,
|
||||
0.14357514679431915,
|
||||
-0.08563537150621414,
|
||||
0.1512402445077896,
|
||||
-0.05548126623034477,
|
||||
-0.13191379606723785,
|
||||
0.02588576264679432,
|
||||
-0.007292638067156076,
|
||||
-0.033004030585289,
|
||||
-0.08764250576496124,
|
||||
-0.04006534814834595,
|
||||
0.001069005811586976,
|
||||
0.0708790197968483,
|
||||
-0.11471016705036163,
|
||||
-0.08249906450510025,
|
||||
-0.07923658937215805,
|
||||
-0.029890256002545357,
|
||||
0.027568599209189415,
|
||||
-0.00042784016113728285,
|
||||
0.01911524124443531,
|
||||
0.002947323489934206,
|
||||
-0.058468904346227646,
|
||||
0.0006662740488536656,
|
||||
-0.09472604095935822,
|
||||
-0.07827164232730865,
|
||||
0.05823435261845589,
|
||||
-0.022661248221993446,
|
||||
0.007729553151875734,
|
||||
0.044511985033750534,
|
||||
-0.17424426972866058,
|
||||
-0.054321326315402985,
|
||||
-0.010871038772165775,
|
||||
-0.04280569776892662,
|
||||
0.01373684499412775,
|
||||
-0.03464324399828911,
|
||||
0.0012510031228885055,
|
||||
-0.13786448538303375,
|
||||
0.13943427801132202,
|
||||
0.07161138951778412,
|
||||
-0.0017689999658614397,
|
||||
-0.0330035537481308,
|
||||
0.01767006888985634,
|
||||
-0.06832484155893326,
|
||||
-0.16906532645225525,
|
||||
-0.08673631399869919,
|
||||
0.016205811873078346,
|
||||
-0.040736377239227295,
|
||||
-0.053034041076898575,
|
||||
-0.057571377605199814,
|
||||
-0.018383856862783432,
|
||||
0.029812879860401154,
|
||||
-0.005708644632250071,
|
||||
0.07977750152349472,
|
||||
0.03715944290161133,
|
||||
0.029830463230609894,
|
||||
-0.15909501910209656,
|
||||
0.10081987082958221,
|
||||
0.07019384205341339,
|
||||
0.05683498457074165,
|
||||
0.008955223485827446,
|
||||
-0.06697771698236465,
|
||||
0.044268134981393814,
|
||||
0.08812808990478516,
|
||||
-0.17523430287837982,
|
||||
0.05148027464747429,
|
||||
-0.11579684168100357,
|
||||
-0.06281758099794388,
|
||||
-0.08106749504804611,
|
||||
-0.07915353775024414,
|
||||
0.03760797902941704,
|
||||
-0.059639666229486465,
|
||||
0.012170189991593361,
|
||||
-0.028386766090989113,
|
||||
-0.043592486530542374,
|
||||
0.029122747480869293,
|
||||
0.052276406437158585,
|
||||
0.06929390132427216,
|
||||
-0.10774848610162735,
|
||||
0.06797030568122864,
|
||||
-0.017512541264295578,
|
||||
0.07446594536304474,
|
||||
-0.07573172450065613,
|
||||
-0.15186654031276703,
|
||||
-0.03710319101810455
|
||||
],
|
||||
"env": "near",
|
||||
"threshold": 0.55,
|
||||
"registered_at": 1768530001.2158406
|
||||
}
|
||||
}
|
||||
70
config/voice.yaml
Normal file
70
config/voice.yaml
Normal file
@@ -0,0 +1,70 @@
|
||||
# ROS 语音包配置文件
|
||||
|
||||
dashscope:
|
||||
api_key: "sk-7215a5ab7a00469db4072e1672a0661e"
|
||||
asr:
|
||||
model: "qwen3-asr-flash-realtime"
|
||||
url: "wss://dashscope.aliyuncs.com/api-ws/v1/realtime"
|
||||
llm:
|
||||
model: "qwen3-vl-flash"
|
||||
base_url: "https://dashscope.aliyuncs.com/compatible-mode/v1"
|
||||
temperature: 0.7
|
||||
max_tokens: 4096
|
||||
max_history: 10
|
||||
summary_trigger: 3
|
||||
tts:
|
||||
model: "cosyvoice-v3-flash"
|
||||
voice: "longanyang"
|
||||
|
||||
audio:
|
||||
microphone:
|
||||
device_index: 3 # 指向 iFLYTEK-M2 (hw:1,0)
|
||||
sample_rate: 48000 # 尝试使用硬件原生采样率 48kHz,避免重采样可能导致的问题
|
||||
channels: 1 # 输入声道数:单声道(MONO,适合语音采集)
|
||||
chunk: 1024
|
||||
heartbeat_interval: 2.0 # 心跳间隔(秒),用于定期输出录音状态
|
||||
soundcard:
|
||||
card_index: 1 # USB Audio Device (card 1)
|
||||
device_index: 0 # USB Audio [USB Audio] (device 0)
|
||||
# card_index: -1 # 使用默认声卡
|
||||
# device_index: -1 # 使用默认输出设备
|
||||
sample_rate: 48000 # 输出采样率:48kHz(iFLYTEK 支持 48000)
|
||||
channels: 2 # 输出声道数:立体声(2声道,FL+FR)
|
||||
volume: 1.0 # 音量比例(0.0-1.0,0.2表示20%音量)
|
||||
echo_cancellation:
|
||||
enabled: false # 是否启用回声消除(true/false)
|
||||
max_duration_ms: 500 # 参考信号缓冲区最大时长(毫秒)
|
||||
tts:
|
||||
source_sample_rate: 22050 # TTS服务固定输出采样率(DashScope服务固定值,不可修改)
|
||||
source_channels: 1 # TTS服务固定输出声道数(DashScope服务固定值,不可修改)
|
||||
ffmpeg_thread_queue_size: 4096 # ffmpeg输入线程队列大小(增大以减少卡顿)
|
||||
|
||||
vad:
|
||||
vad_mode: 3 # VAD模式:0-3,3最严格
|
||||
silence_duration_ms: 1000 # 静音持续时长(毫秒)
|
||||
min_energy_threshold: 300 # 最小能量阈值
|
||||
|
||||
system:
|
||||
use_llm: true # 是否使用LLM
|
||||
use_wake_word: true # 是否启用唤醒词检测
|
||||
wake_word: "er gou" # 唤醒词(拼音)
|
||||
session_timeout: 3.0 # 会话超时时间(秒)
|
||||
shutup_keywords: "bi zui" # 闭嘴指令关键词(拼音,逗号分隔)
|
||||
interrupt_command_queue_depth: 10 # 中断命令订阅的队列深度(QoS)
|
||||
sv_enabled: true # 是否启用声纹识别
|
||||
sv_model_path: "~/hivecore_robot_os1/voice_model" # 声纹模型路径
|
||||
sv_threshold: 0.55 # 声纹识别阈值(0.0-1.0,值越小越宽松,值越大越严格)
|
||||
sv_speaker_db_path: "~/hivecore_robot_os1/config/speakers.json" # 声纹数据库保存路径(JSON格式,相对于ROS2包share目录)
|
||||
sv_buffer_size: 240000 # 声纹验证录音缓冲区大小(样本数,48kHz下5秒=240000)
|
||||
sv_registration_silence_threshold_ms: 500 # 声纹注册状态下的静音阈值(毫秒)
|
||||
|
||||
camera:
|
||||
serial_number: "405622075404" # 相机序列号(Intel RealSense D435)
|
||||
rgb:
|
||||
width: 640 # 图像宽度
|
||||
height: 480 # 图像高度
|
||||
fps: 30 # 帧率(支持:6, 10, 15, 30, 60)
|
||||
format: "RGB8" # 图像格式:RGB8, BGR8
|
||||
image:
|
||||
jpeg_quality: 85 # JPEG压缩质量(0-100,85是质量和大小平衡点)
|
||||
max_size: "1280x720" # 最大尺寸
|
||||
17
launch/voice.launch.py
Normal file
17
launch/voice.launch.py
Normal file
@@ -0,0 +1,17 @@
|
||||
from launch import LaunchDescription
|
||||
from launch_ros.actions import Node
|
||||
|
||||
|
||||
def generate_launch_description():
|
||||
"""启动语音交互节点,所有参数从 voice.yaml 读取"""
|
||||
return LaunchDescription([
|
||||
Node(
|
||||
package='robot_speaker',
|
||||
executable='robot_speaker_node',
|
||||
name='robot_speaker_node',
|
||||
output='screen'
|
||||
),
|
||||
])
|
||||
|
||||
|
||||
|
||||
15
package.xml
15
package.xml
@@ -2,13 +2,22 @@
|
||||
<?xml-model href="http://download.ros.org/schema/package_format3.xsd" schematypens="http://www.w3.org/2001/XMLSchema"?>
|
||||
<package format="3">
|
||||
<name>robot_speaker</name>
|
||||
<version>0.0.0</version>
|
||||
<description>TODO: Package description</description>
|
||||
<version>0.0.1</version>
|
||||
<description>语音识别和合成ROS2包</description>
|
||||
<maintainer email="mzebra@foxmail.com">mzebra</maintainer>
|
||||
<license>Apache-2.0</license>
|
||||
|
||||
<depend>rclpy</depend>
|
||||
<depend>example_interfaces</depend>
|
||||
<depend>std_msgs</depend>
|
||||
<depend>ament_index_python</depend>
|
||||
<depend>interfaces</depend>
|
||||
|
||||
<exec_depend>python3-pyaudio</exec_depend>
|
||||
<exec_depend>python3-requests</exec_depend>
|
||||
<exec_depend>python3-edge-tts</exec_depend>
|
||||
<exec_depend>python3-webrtcvad</exec_depend>
|
||||
<exec_depend>python3-yaml</exec_depend>
|
||||
<exec_depend>python3-pypinyin</exec_depend>
|
||||
|
||||
<test_depend>ament_copyright</test_depend>
|
||||
<test_depend>ament_flake8</test_depend>
|
||||
|
||||
17
requirements.txt
Normal file
17
requirements.txt
Normal file
@@ -0,0 +1,17 @@
|
||||
dashscope>=1.20.0
|
||||
openai>=1.0.0
|
||||
pyaudio>=0.2.11
|
||||
webrtcvad>=2.0.10
|
||||
pypinyin>=0.49.0
|
||||
rclpy>=3.0.0
|
||||
pyrealsense2>=2.54.0
|
||||
Pillow>=10.0.0
|
||||
numpy>=1.24.0
|
||||
PyYAML>=6.0
|
||||
aec-audio-processing
|
||||
modelscope>=1.33.0
|
||||
funasr>=1.0.0
|
||||
datasets==3.6.0
|
||||
|
||||
|
||||
|
||||
@@ -0,0 +1,6 @@
|
||||
# robot_speaker package
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
2
robot_speaker/bridge/__init__.py
Normal file
2
robot_speaker/bridge/__init__.py
Normal file
@@ -0,0 +1,2 @@
|
||||
# Bridge package for connecting LLM outputs to brain execution.
|
||||
|
||||
136
robot_speaker/bridge/skill_bridge_node.py
Normal file
136
robot_speaker/bridge/skill_bridge_node.py
Normal file
@@ -0,0 +1,136 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
桥接LLM技能序列到小脑ExecuteBtAction,并转发反馈/结果。
|
||||
"""
|
||||
import json
|
||||
import os
|
||||
import re
|
||||
|
||||
import rclpy
|
||||
from rclpy.node import Node
|
||||
from rclpy.action import ActionClient
|
||||
from std_msgs.msg import String
|
||||
from ament_index_python.packages import get_package_share_directory
|
||||
|
||||
from interfaces.action import ExecuteBtAction
|
||||
|
||||
|
||||
class SkillBridgeNode(Node):
|
||||
def __init__(self):
|
||||
super().__init__('skill_bridge_node')
|
||||
self._action_client = ActionClient(self, ExecuteBtAction, '/execute_bt_action')
|
||||
self._current_epoch = 1
|
||||
self._allowed_skills = self._load_allowed_skills()
|
||||
|
||||
self.skill_seq_sub = self.create_subscription(
|
||||
String, '/llm_skill_sequence', self._on_skill_sequence_received, 10
|
||||
)
|
||||
self.feedback_pub = self.create_publisher(String, '/skill_execution_feedback', 10)
|
||||
self.result_pub = self.create_publisher(String, '/skill_execution_result', 10)
|
||||
|
||||
self.get_logger().info('SkillBridgeNode started')
|
||||
|
||||
def _on_skill_sequence_received(self, msg: String):
|
||||
raw = (msg.data or "").strip()
|
||||
if not raw:
|
||||
return
|
||||
if not self._allowed_skills:
|
||||
self.get_logger().warning("No skill whitelist loaded; reject all sequences")
|
||||
return
|
||||
sequence, invalid = self._extract_skill_sequence(raw)
|
||||
if invalid:
|
||||
self.get_logger().warning(f"Rejected sequence with invalid skills: {invalid}")
|
||||
return
|
||||
if not sequence:
|
||||
self.get_logger().warning(f"Invalid skill sequence: {raw}")
|
||||
return
|
||||
self._send_skill_sequence(sequence)
|
||||
|
||||
def _load_allowed_skills(self) -> set[str]:
|
||||
try:
|
||||
brain_share = get_package_share_directory("brain")
|
||||
skill_path = os.path.join(brain_share, "config", "robot_skills.yaml")
|
||||
if not os.path.exists(skill_path):
|
||||
return set()
|
||||
import yaml
|
||||
with open(skill_path, "r", encoding="utf-8") as f:
|
||||
data = yaml.safe_load(f) or []
|
||||
return {str(entry["name"]) for entry in data if isinstance(entry, dict) and entry.get("name")}
|
||||
except Exception as e:
|
||||
self.get_logger().warning(f"Load skills failed: {e}")
|
||||
return set()
|
||||
|
||||
def _extract_skill_sequence(self, text: str) -> tuple[str, list[str]]:
|
||||
# Accept CSV/space/semicolon and filter by CamelCase tokens
|
||||
tokens = re.split(r'[,\s;]+', text.strip())
|
||||
skills = [t for t in tokens if re.match(r'^[A-Z][A-Za-z0-9]*$', t)]
|
||||
if not skills:
|
||||
return "", []
|
||||
invalid = [s for s in skills if s not in self._allowed_skills]
|
||||
return ",".join(skills), invalid
|
||||
|
||||
def _send_skill_sequence(self, skill_sequence: str):
|
||||
if not self._action_client.wait_for_server(timeout_sec=2.0):
|
||||
self.get_logger().error('ExecuteBtAction server unavailable')
|
||||
return
|
||||
goal = ExecuteBtAction.Goal()
|
||||
goal.epoch = self._current_epoch
|
||||
self._current_epoch += 1
|
||||
goal.action_name = skill_sequence
|
||||
goal.calls = []
|
||||
|
||||
self.get_logger().info(f"Dispatch skill sequence: {skill_sequence}")
|
||||
send_future = self._action_client.send_goal_async(goal, feedback_callback=self._feedback_callback)
|
||||
rclpy.spin_until_future_complete(self, send_future, timeout_sec=5.0)
|
||||
if not send_future.done():
|
||||
self.get_logger().warning("Send goal timed out")
|
||||
return
|
||||
goal_handle = send_future.result()
|
||||
if not goal_handle or not goal_handle.accepted:
|
||||
self.get_logger().error("Goal rejected")
|
||||
return
|
||||
result_future = goal_handle.get_result_async()
|
||||
rclpy.spin_until_future_complete(self, result_future)
|
||||
if result_future.done():
|
||||
self._handle_result(result_future.result())
|
||||
|
||||
def _feedback_callback(self, feedback_msg):
|
||||
fb = feedback_msg.feedback
|
||||
payload = {
|
||||
"stage": fb.stage,
|
||||
"current_skill": fb.current_skill,
|
||||
"progress": float(fb.progress),
|
||||
"detail": fb.detail,
|
||||
"epoch": int(fb.epoch),
|
||||
}
|
||||
msg = String()
|
||||
msg.data = json.dumps(payload, ensure_ascii=True)
|
||||
self.feedback_pub.publish(msg)
|
||||
|
||||
def _handle_result(self, result_wrapper):
|
||||
result = result_wrapper.result
|
||||
if not result:
|
||||
return
|
||||
payload = {
|
||||
"success": bool(result.success),
|
||||
"message": result.message,
|
||||
"total_skills": int(result.total_skills),
|
||||
"succeeded_skills": int(result.succeeded_skills),
|
||||
}
|
||||
msg = String()
|
||||
msg.data = json.dumps(payload, ensure_ascii=True)
|
||||
self.result_pub.publish(msg)
|
||||
|
||||
|
||||
|
||||
def main(args=None):
|
||||
rclpy.init(args=args)
|
||||
node = SkillBridgeNode()
|
||||
rclpy.spin(node)
|
||||
node.destroy_node()
|
||||
rclpy.shutdown()
|
||||
|
||||
|
||||
if __name__ == '__main__':
|
||||
main()
|
||||
|
||||
5
robot_speaker/core/__init__.py
Normal file
5
robot_speaker/core/__init__.py
Normal file
@@ -0,0 +1,5 @@
|
||||
"""核心模块"""
|
||||
|
||||
|
||||
|
||||
|
||||
10
robot_speaker/core/conversation_state.py
Normal file
10
robot_speaker/core/conversation_state.py
Normal file
@@ -0,0 +1,10 @@
|
||||
from enum import Enum
|
||||
|
||||
|
||||
class ConversationState(Enum):
|
||||
"""会话状态机"""
|
||||
IDLE = "idle" # 等待用户唤醒或声音
|
||||
CHECK_VOICE = "check_voice" # 用户说话 → 检查声纹
|
||||
AUTHORIZED = "authorized" # 已注册用户
|
||||
|
||||
|
||||
158
robot_speaker/core/intent_router.py
Normal file
158
robot_speaker/core/intent_router.py
Normal file
@@ -0,0 +1,158 @@
|
||||
from dataclasses import dataclass
|
||||
from typing import Optional
|
||||
import os
|
||||
import yaml
|
||||
from ament_index_python.packages import get_package_share_directory
|
||||
|
||||
from pypinyin import pinyin, Style
|
||||
|
||||
|
||||
@dataclass
|
||||
class IntentResult:
|
||||
intent: str # "skill_sequence" | "kb_qa" | "chat_text" | "chat_camera"
|
||||
text: str
|
||||
need_camera: bool
|
||||
camera_mode: Optional[str] # "head" | "left_hand" | "right_hand" | None
|
||||
system_prompt: Optional[str]
|
||||
|
||||
|
||||
class IntentRouter:
|
||||
def __init__(self):
|
||||
self.camera_capture_keywords = [
|
||||
"pai zhao", "pai ge zhao", "pai zhang zhao"
|
||||
]
|
||||
self.skill_keywords = [
|
||||
"ban xiang zi"
|
||||
]
|
||||
self.kb_keywords = [
|
||||
"ni shi shei", "ni de ming zi"
|
||||
]
|
||||
self._cached_skill_names: list[str] | None = None
|
||||
|
||||
def _load_brain_skill_names(self) -> list[str]:
|
||||
if self._cached_skill_names is not None:
|
||||
return self._cached_skill_names
|
||||
skill_names: list[str] = []
|
||||
try:
|
||||
brain_share = get_package_share_directory("brain")
|
||||
skill_path = os.path.join(brain_share, "config", "robot_skills.yaml")
|
||||
with open(skill_path, "r", encoding="utf-8") as f:
|
||||
data = yaml.safe_load(f) or []
|
||||
for entry in data:
|
||||
if isinstance(entry, dict) and entry.get("name"):
|
||||
skill_names.append(str(entry["name"]))
|
||||
except Exception:
|
||||
skill_names = []
|
||||
self._cached_skill_names = skill_names
|
||||
return skill_names
|
||||
|
||||
def to_pinyin(self, text: str) -> str:
|
||||
chars = [c for c in text if '\u4e00' <= c <= '\u9fa5']
|
||||
if not chars:
|
||||
return ""
|
||||
py_list = pinyin(''.join(chars), style=Style.NORMAL)
|
||||
return ' '.join([item[0] for item in py_list]).lower().strip()
|
||||
|
||||
def is_skill_sequence_intent(self, text: str) -> bool:
|
||||
text_pinyin = self.to_pinyin(text)
|
||||
return any(k in text_pinyin for k in self.skill_keywords)
|
||||
|
||||
|
||||
def check_camera_command(self, text: str) -> tuple[bool, Optional[str]]:
|
||||
if not text:
|
||||
return False, None
|
||||
text_pinyin = self.to_pinyin(text)
|
||||
for keyword in self.camera_capture_keywords:
|
||||
if keyword in text_pinyin:
|
||||
return True, self.detect_camera_mode(text)
|
||||
return False, None
|
||||
|
||||
def detect_camera_mode(self, text: str) -> str:
|
||||
text_pinyin = self.to_pinyin(text)
|
||||
left_keys = ["zuo shou", "zuo bi", "zuo bian"]
|
||||
right_keys = ["you shou", "you bi", "you bian"]
|
||||
head_keys = ["tou", "nao dai"]
|
||||
for kw in left_keys:
|
||||
if kw in text_pinyin:
|
||||
return "left_hand"
|
||||
for kw in right_keys:
|
||||
if kw in text_pinyin:
|
||||
return "right_hand"
|
||||
for kw in head_keys:
|
||||
if kw in text_pinyin:
|
||||
return "head"
|
||||
return "head"
|
||||
|
||||
def build_skill_prompt(self) -> str:
|
||||
skills = self._load_brain_skill_names()
|
||||
skills_text = ", ".join(skills) if skills else ""
|
||||
skill_guard = (
|
||||
"【技能限制】只能使用以下技能名称:" + skills_text
|
||||
if skills_text
|
||||
else "【技能限制】技能列表不可用,请不要输出任何技能名称。"
|
||||
)
|
||||
return (
|
||||
"你是机器人任务规划器。\n"
|
||||
"本任务必须拍照。请根据用户请求选择使用哪个相机拍照(默认头部相机),并结合当前环境信息生成简洁、可执行的技能序列。\n"
|
||||
"【重要】如果对话历史中包含【执行结果】或【执行状态】,请参考上一轮技能序列的执行情况,根据成功/失败信息调整本次技能序列。\n"
|
||||
"【输出格式要求】只输出逗号分隔的技能名称,不要任何解释说明。\n"
|
||||
+ skill_guard
|
||||
)
|
||||
|
||||
def build_chat_prompt(self, need_camera: bool) -> str:
|
||||
if need_camera:
|
||||
return (
|
||||
"你是一个智能语音助手。\n"
|
||||
"请结合图片内容简短回答。"
|
||||
)
|
||||
return (
|
||||
"你是一个智能语音助手。\n"
|
||||
"请自然、简短地与用户对话。"
|
||||
)
|
||||
|
||||
def build_kb_prompt(self) -> str:
|
||||
return (
|
||||
"你是蜂核科技的员工。\n"
|
||||
"请基于知识库信息回答用户问题,回答要准确简洁。"
|
||||
)
|
||||
|
||||
def build_default_system_prompt(self) -> str:
|
||||
return (
|
||||
"你是一个智能语音助手。\n"
|
||||
"- 当用户发送图片时,请仔细观察图片内容,结合用户的问题或描述,提供简短、专业的回答。\n"
|
||||
"- 当用户没有发送图片时,请自然、友好地与用户对话。\n"
|
||||
"请根据对话模式调整你的回答风格。"
|
||||
)
|
||||
|
||||
def route(self, text: str) -> IntentResult:
|
||||
need_camera, camera_mode = self.check_camera_command(text)
|
||||
text_pinyin = self.to_pinyin(text)
|
||||
|
||||
if self.is_skill_sequence_intent(text):
|
||||
if camera_mode is None:
|
||||
camera_mode = "head"
|
||||
return IntentResult(
|
||||
intent="skill_sequence",
|
||||
text=text,
|
||||
need_camera=True,
|
||||
camera_mode=camera_mode,
|
||||
system_prompt=self.build_skill_prompt()
|
||||
)
|
||||
|
||||
if any(k in text_pinyin for k in self.kb_keywords):
|
||||
return IntentResult(
|
||||
intent="kb_qa",
|
||||
text=text,
|
||||
need_camera=False,
|
||||
camera_mode=None,
|
||||
system_prompt=self.build_kb_prompt()
|
||||
)
|
||||
|
||||
return IntentResult(
|
||||
intent="chat_camera" if need_camera else "chat_text",
|
||||
text=text,
|
||||
need_camera=need_camera,
|
||||
camera_mode=camera_mode,
|
||||
system_prompt=self.build_chat_prompt(need_camera)
|
||||
)
|
||||
|
||||
246
robot_speaker/core/node_callbacks.py
Normal file
246
robot_speaker/core/node_callbacks.py
Normal file
@@ -0,0 +1,246 @@
|
||||
import threading
|
||||
import numpy as np
|
||||
|
||||
from robot_speaker.core.conversation_state import ConversationState
|
||||
from robot_speaker.perception.speaker_verifier import SpeakerState
|
||||
|
||||
|
||||
class NodeCallbacks:
|
||||
# ==================== 初始化与内部工具 ====================
|
||||
def __init__(self, node):
|
||||
self.node = node
|
||||
|
||||
def _mark_utterance_processed(self) -> bool:
|
||||
node = self.node
|
||||
with node.utterance_lock:
|
||||
if node.current_utterance_id == node.last_processed_utterance_id:
|
||||
return False
|
||||
node.last_processed_utterance_id = node.current_utterance_id
|
||||
return True
|
||||
|
||||
def _trigger_sv_for_check_voice(self, source: str):
|
||||
node = self.node
|
||||
if not (node.sv_enabled and node.sv_client):
|
||||
return
|
||||
if not self._mark_utterance_processed():
|
||||
return
|
||||
if node._handle_empty_speaker_db():
|
||||
node.get_logger().info(f"[声纹] CHECK_VOICE状态,数据库为空,跳过声纹验证(来源: {source})")
|
||||
return
|
||||
if not node.sv_speech_end_event.is_set():
|
||||
with node.sv_lock:
|
||||
node.sv_recording = False
|
||||
buffer_size = len(node.sv_audio_buffer)
|
||||
node.get_logger().info(f"[声纹] {source}触发验证,缓冲区大小: {buffer_size} 样本({buffer_size/node.sample_rate:.2f}秒)")
|
||||
if buffer_size > 0:
|
||||
node.sv_speech_end_event.set()
|
||||
else:
|
||||
node.get_logger().debug(f"[声纹] 声纹验证已触发,跳过(来源: {source})")
|
||||
|
||||
# ==================== 业务逻辑代理 ====================
|
||||
def handle_interrupt_command(self, msg):
|
||||
return self.node._handle_interrupt_command(msg)
|
||||
|
||||
def check_interrupt_and_cancel_turn(self) -> bool:
|
||||
return self.node._check_interrupt_and_cancel_turn()
|
||||
|
||||
def handle_wake_word(self, text: str) -> str:
|
||||
return self.node._handle_wake_word(text)
|
||||
|
||||
def check_shutup_command(self, text: str) -> bool:
|
||||
return self.node._check_shutup_command(text)
|
||||
|
||||
def check_camera_command(self, text: str):
|
||||
return self.node.intent_router.check_camera_command(text)
|
||||
|
||||
def llm_process_stream_with_camera(self, user_text: str, need_camera: bool) -> str:
|
||||
return self.node._llm_process_stream_with_camera(user_text, need_camera)
|
||||
|
||||
def put_tts_text(self, text: str):
|
||||
return self.node._put_tts_text(text)
|
||||
|
||||
def force_stop_tts(self):
|
||||
return self.node._force_stop_tts()
|
||||
|
||||
def drain_queue(self, q):
|
||||
return self.node._drain_queue(q)
|
||||
|
||||
# ==================== 录音/VAD回调 ====================
|
||||
def get_silence_threshold(self) -> int:
|
||||
"""获取动态静音阈值(毫秒)"""
|
||||
node = self.node
|
||||
return node.silence_duration_ms
|
||||
|
||||
def should_put_audio_to_queue(self) -> bool:
|
||||
"""
|
||||
检查是否应该将音频放入队列(用于ASR),根据状态机决定是否允许ASR
|
||||
"""
|
||||
node = self.node
|
||||
state = node._get_state()
|
||||
if state in [ConversationState.IDLE, ConversationState.CHECK_VOICE,
|
||||
ConversationState.AUTHORIZED]:
|
||||
return True
|
||||
return False
|
||||
|
||||
def on_speech_start(self):
|
||||
"""录音线程检测到人声开始"""
|
||||
node = self.node
|
||||
node.get_logger().info("[录音线程] 检测到人声,开始录音")
|
||||
|
||||
with node.utterance_lock:
|
||||
node.current_utterance_id += 1
|
||||
|
||||
state = node._get_state()
|
||||
|
||||
if state == ConversationState.IDLE:
|
||||
# Idle -> CheckVoice
|
||||
if node.sv_enabled and node.sv_client:
|
||||
# 开始录音用于声纹验证
|
||||
with node.sv_lock:
|
||||
node.sv_recording = True
|
||||
node.sv_audio_buffer.clear()
|
||||
node.get_logger().debug("[声纹] 开始录音用于声纹验证")
|
||||
node._change_state(ConversationState.CHECK_VOICE, "检测到语音,开始检查声纹")
|
||||
else:
|
||||
node._change_state(ConversationState.AUTHORIZED, "未启用声纹,直接授权")
|
||||
|
||||
elif state == ConversationState.CHECK_VOICE:
|
||||
# CheckVoice状态,继续录音用于声纹验证
|
||||
if node.sv_enabled:
|
||||
with node.sv_lock:
|
||||
node.sv_recording = True
|
||||
node.sv_audio_buffer.clear()
|
||||
node.get_logger().debug("[声纹] 继续录音用于声纹验证")
|
||||
|
||||
elif state == ConversationState.AUTHORIZED:
|
||||
# Authorized状态,开始录音用于声纹验证(验证当前用户)
|
||||
if node.sv_enabled:
|
||||
with node.sv_lock:
|
||||
node.sv_recording = True
|
||||
node.sv_audio_buffer.clear()
|
||||
node.get_logger().debug("[声纹] 开始录音用于声纹验证")
|
||||
|
||||
def on_audio_chunk_for_sv(self, audio_chunk: bytes):
|
||||
"""录音线程音频chunk回调 - 仅在需要时录音到声纹缓冲区"""
|
||||
node = self.node
|
||||
state = node._get_state()
|
||||
|
||||
# 声纹验证录音(CHECK_VOICE, AUTHORIZED状态)
|
||||
if node.sv_enabled and node.sv_recording:
|
||||
try:
|
||||
audio_array = np.frombuffer(audio_chunk, dtype=np.int16)
|
||||
with node.sv_lock:
|
||||
node.sv_audio_buffer.extend(audio_array)
|
||||
except Exception as e:
|
||||
node.get_logger().debug(f"[声纹] 录音失败: {e}")
|
||||
|
||||
def on_speech_end(self):
|
||||
"""录音线程检测到说话结束(静音一段时间)"""
|
||||
node = self.node
|
||||
node.get_logger().info("[录音线程] 检测到说话结束")
|
||||
|
||||
state = node._get_state()
|
||||
node.get_logger().info(f"[录音线程] 说话结束时的状态: {state}")
|
||||
|
||||
if state == ConversationState.CHECK_VOICE:
|
||||
if node.asr_client and node.asr_client.running:
|
||||
node.asr_client.stop_current_recognition()
|
||||
self._trigger_sv_for_check_voice("VAD")
|
||||
return
|
||||
|
||||
elif state == ConversationState.AUTHORIZED:
|
||||
if node.asr_client and node.asr_client.running:
|
||||
node.asr_client.stop_current_recognition()
|
||||
if node.sv_enabled:
|
||||
with node.sv_lock:
|
||||
node.sv_recording = False
|
||||
buffer_size = len(node.sv_audio_buffer)
|
||||
node.get_logger().debug(f"[声纹] 停止录音,缓冲区大小: {buffer_size}")
|
||||
node.sv_speech_end_event.set()
|
||||
|
||||
# 如果TTS正在播放,异步等待声纹验证结果,如果通过才中断TTS
|
||||
# 使用独立线程避免阻塞录音线程,影响TTS播放
|
||||
if node.tts_playing_event.is_set():
|
||||
node.get_logger().info("[打断] TTS播放中,用户说话结束,异步等待声纹验证结果...")
|
||||
def _check_sv_and_interrupt():
|
||||
# 等待声纹验证结果(最多等待2秒)
|
||||
with node.sv_result_cv:
|
||||
current_seq = node.sv_result_seq
|
||||
if node.sv_result_cv.wait_for(
|
||||
lambda: node.sv_result_seq > current_seq,
|
||||
timeout=2.0
|
||||
):
|
||||
# 声纹验证完成,检查结果
|
||||
with node.sv_lock:
|
||||
speaker_id = node.current_speaker_id
|
||||
speaker_state = node.current_speaker_state
|
||||
|
||||
if speaker_id and speaker_state == SpeakerState.VERIFIED:
|
||||
node.get_logger().info(f"[打断] 声纹验证通过({speaker_id}),中断TTS播放")
|
||||
node._interrupt_tts("检测到人声(已授权用户,说话结束)")
|
||||
else:
|
||||
node.get_logger().debug(f"[打断] 声纹验证未通过,不中断TTS(状态: {speaker_state.value})")
|
||||
else:
|
||||
node.get_logger().warning("[打断] 声纹验证超时,不中断TTS")
|
||||
# 在独立线程中等待,避免阻塞录音线程
|
||||
threading.Thread(target=_check_sv_and_interrupt, daemon=True, name="SVInterruptCheck").start()
|
||||
return
|
||||
|
||||
def on_new_segment(self):
|
||||
"""录音线程检测到新的已授权用户声段,开始录音用于声纹验证(不立即中断)"""
|
||||
node = self.node
|
||||
state = node._get_state()
|
||||
if state == ConversationState.AUTHORIZED:
|
||||
# TTS播放期间,检测到人声时不立即中断,而是开始录音用于声纹验证
|
||||
# 等待用户说话结束(speech_end)后,如果声纹验证通过,才中断TTS
|
||||
# 这样可以避免TTS回声误触发,但支持真正的用户打断
|
||||
if node.tts_playing_event.is_set():
|
||||
node.get_logger().debug("[打断] TTS播放中,检测到人声,开始录音用于声纹验证(等待说话结束后验证)")
|
||||
# 录音已经在 on_speech_start 中开始了,这里不需要额外操作
|
||||
else:
|
||||
# TTS未播放时,检查声纹验证结果并立即中断
|
||||
if node.sv_enabled and node.sv_client:
|
||||
with node.sv_lock:
|
||||
current_speaker_id = node.current_speaker_id
|
||||
speaker_state = node.current_speaker_state
|
||||
if speaker_state == SpeakerState.VERIFIED and current_speaker_id:
|
||||
node._interrupt_tts("检测到人声(已授权用户)")
|
||||
node.get_logger().info(f"[打断] 已授权用户({current_speaker_id})发言,中断TTS播放")
|
||||
else:
|
||||
node.get_logger().debug(f"[打断] 检测到人声,但声纹未验证或未匹配,不中断TTS(当前状态: {speaker_state.value})")
|
||||
else:
|
||||
# 未启用声纹,直接中断(保持原有行为)
|
||||
node._interrupt_tts("检测到人声(未启用声纹)")
|
||||
node.get_logger().info("[打断] 检测到人声,中断TTS播放")
|
||||
else:
|
||||
node.get_logger().debug(f"[打断] 检测到人声,但当前状态为 {state.value},非已授权用户,不允许打断TTS")
|
||||
|
||||
def on_heartbeat(self):
|
||||
"""录音线程静音心跳回调"""
|
||||
self.node.get_logger().info("[录音线程] 静音中")
|
||||
|
||||
# ==================== ASR回调 ====================
|
||||
def on_asr_sentence_end(self, text: str):
|
||||
"""ASR sentence_end回调 - 将文本放入队列"""
|
||||
node = self.node
|
||||
if not text or not text.strip():
|
||||
return
|
||||
text_clean = text.strip()
|
||||
node.get_logger().info(f"[ASR] 识别完成: {text_clean}")
|
||||
|
||||
state = node._get_state()
|
||||
|
||||
# 规则2:CHECK_VOICE状态下,如果ASR识别完成但VAD还没有触发speech_end,主动触发声纹验证
|
||||
if state == ConversationState.CHECK_VOICE:
|
||||
if node.sv_enabled and node.sv_client:
|
||||
node.get_logger().info("[ASR] CHECK_VOICE状态,ASR识别完成,主动触发声纹验证")
|
||||
self._trigger_sv_for_check_voice("ASR")
|
||||
|
||||
# 其他状态,将文本放入队列
|
||||
node.text_queue.put(text_clean, timeout=1.0)
|
||||
|
||||
def on_asr_text_update(self, text: str):
|
||||
"""ASR 实时文本更新回调 - 用于多轮提示"""
|
||||
if not text or not text.strip():
|
||||
return
|
||||
self.node.get_logger().debug(f"[ASR] 识别中: {text.strip()}")
|
||||
188
robot_speaker/core/node_workers.py
Normal file
188
robot_speaker/core/node_workers.py
Normal file
@@ -0,0 +1,188 @@
|
||||
import queue
|
||||
import time
|
||||
import numpy as np
|
||||
|
||||
from robot_speaker.core.conversation_state import ConversationState
|
||||
from robot_speaker.perception.speaker_verifier import SpeakerState
|
||||
|
||||
|
||||
class NodeWorkers:
|
||||
def __init__(self, node):
|
||||
self.node = node
|
||||
|
||||
def recording_worker(self):
|
||||
"""线程1: 录音线程 - 唯一实时线程"""
|
||||
node = self.node
|
||||
node.get_logger().info("[录音线程] 启动")
|
||||
node.audio_recorder.record_with_vad()
|
||||
|
||||
def asr_worker(self):
|
||||
"""线程2: ASR推理线程 - 只做 audio → text"""
|
||||
node = self.node
|
||||
node.get_logger().info("[ASR推理线程] 启动")
|
||||
while not node.stop_event.is_set():
|
||||
try:
|
||||
audio_chunk = node.audio_queue.get(timeout=0.1)
|
||||
except queue.Empty:
|
||||
continue
|
||||
if node.interrupt_event.is_set():
|
||||
continue
|
||||
if node.callbacks.should_put_audio_to_queue() and node.asr_client and node.asr_client.running:
|
||||
node.asr_client.send_audio(audio_chunk)
|
||||
|
||||
def process_worker(self):
|
||||
"""线程3: 主线程 - 处理业务逻辑"""
|
||||
node = self.node
|
||||
node.get_logger().info("[主线程] 启动")
|
||||
while not node.stop_event.is_set():
|
||||
try:
|
||||
text = node.text_queue.get(timeout=0.1)
|
||||
except queue.Empty:
|
||||
continue
|
||||
|
||||
node.get_logger().info(f"[主线程] 收到识别文本: {text}")
|
||||
current_state = node._get_state()
|
||||
|
||||
if current_state == ConversationState.CHECK_VOICE:
|
||||
if node.use_wake_word:
|
||||
node.get_logger().info(f"[主线程] CHECK_VOICE状态,检查唤醒词,文本: {text}")
|
||||
processed_text = node.callbacks.handle_wake_word(text)
|
||||
if not processed_text:
|
||||
node.get_logger().info(f"[主线程] 未检测到唤醒词(唤醒词配置: '{node.wake_word}'),回到Idle状态")
|
||||
node._change_state(ConversationState.IDLE, "未检测到唤醒词")
|
||||
continue
|
||||
node.get_logger().info(f"[主线程] 检测到唤醒词,处理后的文本: {processed_text}")
|
||||
text = processed_text
|
||||
|
||||
if node.sv_enabled and node.sv_client:
|
||||
node.get_logger().info("[主线程] CHECK_VOICE状态:等待声纹验证结果...")
|
||||
with node.sv_result_cv:
|
||||
current_seq = node.sv_result_seq
|
||||
if not node.sv_result_cv.wait_for(
|
||||
lambda: node.sv_result_seq > current_seq,
|
||||
timeout=15.0
|
||||
):
|
||||
node.get_logger().warning("[主线程] CHECK_VOICE状态:声纹结果未ready(超时15秒),拒绝本轮")
|
||||
with node.sv_lock:
|
||||
node.sv_audio_buffer.clear()
|
||||
node._change_state(ConversationState.IDLE, "声纹结果未ready")
|
||||
continue
|
||||
node.get_logger().info("[主线程] CHECK_VOICE状态:声纹结果ready,继续处理")
|
||||
|
||||
with node.sv_lock:
|
||||
speaker_id = node.current_speaker_id
|
||||
speaker_state = node.current_speaker_state
|
||||
score = node.current_speaker_score
|
||||
|
||||
if speaker_id and speaker_state == SpeakerState.VERIFIED:
|
||||
node.get_logger().info(f"[主线程] 声纹验证成功: {speaker_id}, 得分: {score:.4f}")
|
||||
node._change_state(ConversationState.AUTHORIZED, "声纹验证成功")
|
||||
else:
|
||||
node.get_logger().info(f"[主线程] 声纹验证失败,得分: {score:.4f}")
|
||||
node.callbacks.put_tts_text("声纹验证失败")
|
||||
node._change_state(ConversationState.IDLE, "声纹验证失败")
|
||||
continue
|
||||
else:
|
||||
node._change_state(ConversationState.AUTHORIZED, "未启用声纹")
|
||||
|
||||
elif current_state == ConversationState.AUTHORIZED:
|
||||
if node.tts_playing_event.is_set():
|
||||
node.get_logger().debug("[主线程] AUTHORIZED状态,TTS播放中,忽略ASR识别结果(只有VAD检测到已授权用户人声才能中断)")
|
||||
continue
|
||||
|
||||
elif current_state == ConversationState.IDLE:
|
||||
node.get_logger().warning("[主线程] Idle状态收到文本,忽略")
|
||||
continue
|
||||
|
||||
if node.use_wake_word and current_state == ConversationState.AUTHORIZED:
|
||||
processed_text = node.callbacks.handle_wake_word(text)
|
||||
if not processed_text:
|
||||
node._change_state(ConversationState.IDLE, "未检测到唤醒词")
|
||||
continue
|
||||
text = processed_text
|
||||
|
||||
if node.callbacks.check_shutup_command(text):
|
||||
node.get_logger().info("[主线程] 检测到闭嘴指令")
|
||||
node.interrupt_event.set()
|
||||
node.callbacks.force_stop_tts()
|
||||
node._change_state(ConversationState.IDLE, "用户闭嘴指令")
|
||||
continue
|
||||
|
||||
intent_payload = node.intent_router.route(text)
|
||||
node._handle_intent(intent_payload)
|
||||
|
||||
if current_state == ConversationState.AUTHORIZED:
|
||||
node.session_start_time = time.time()
|
||||
|
||||
def sv_worker(self):
|
||||
"""线程5: 声纹识别线程 - 非实时、低频(CAM++)"""
|
||||
node = self.node
|
||||
node.get_logger().info("[声纹识别线程] 启动")
|
||||
|
||||
# 动态计算最小音频样本数,确保降采样到16kHz后≥0.5秒
|
||||
target_sr = 16000 # CAM++模型目标采样率
|
||||
min_duration_seconds = 0.5
|
||||
min_samples_at_target_sr = int(target_sr * min_duration_seconds) # 8000样本@16kHz
|
||||
|
||||
if node.sample_rate >= target_sr:
|
||||
downsample_step = int(node.sample_rate / target_sr)
|
||||
min_audio_samples = min_samples_at_target_sr * downsample_step
|
||||
else:
|
||||
min_audio_samples = int(node.sample_rate * min_duration_seconds)
|
||||
|
||||
while not node.stop_event.is_set():
|
||||
try:
|
||||
if node.sv_speech_end_event.wait(timeout=0.1):
|
||||
node.sv_speech_end_event.clear()
|
||||
with node.sv_lock:
|
||||
audio_list = list(node.sv_audio_buffer)
|
||||
buffer_size = len(audio_list)
|
||||
node.sv_audio_buffer.clear()
|
||||
|
||||
node.get_logger().info(f"[声纹识别] 收到speech_end事件,录音长度: {buffer_size} 样本({buffer_size/node.sample_rate:.2f}秒)")
|
||||
|
||||
if node._handle_empty_speaker_db():
|
||||
node.get_logger().info("[声纹识别] 数据库为空,跳过验证,直接设置UNKNOWN状态")
|
||||
continue
|
||||
|
||||
if buffer_size >= min_audio_samples:
|
||||
audio_array = np.array(audio_list, dtype=np.int16)
|
||||
embedding, success = node.sv_client.extract_embedding(
|
||||
audio_array,
|
||||
sample_rate=node.sample_rate
|
||||
)
|
||||
|
||||
if not success or embedding is None:
|
||||
node.get_logger().debug("[声纹识别] 提取embedding失败")
|
||||
with node.sv_lock:
|
||||
node.current_speaker_id = None
|
||||
node.current_speaker_state = SpeakerState.ERROR
|
||||
node.current_speaker_score = 0.0
|
||||
else:
|
||||
speaker_id, match_state, score, _ = node.sv_client.match_speaker(embedding)
|
||||
with node.sv_lock:
|
||||
node.current_speaker_id = speaker_id
|
||||
node.current_speaker_state = match_state
|
||||
node.current_speaker_score = score
|
||||
|
||||
if match_state == SpeakerState.VERIFIED:
|
||||
node.get_logger().info(f"[声纹识别] 识别到说话人: {speaker_id}, 相似度: {score:.4f}")
|
||||
elif match_state == SpeakerState.REJECTED:
|
||||
node.get_logger().info(f"[声纹识别] 未匹配到已知说话人(相似度不足), 相似度: {score:.4f}")
|
||||
else:
|
||||
node.get_logger().info(f"[声纹识别] 状态: {match_state.value}, 相似度: {score:.4f}")
|
||||
else:
|
||||
node.get_logger().debug(f"[声纹识别] 录音太短: {buffer_size} < {min_audio_samples},跳过处理")
|
||||
with node.sv_lock:
|
||||
node.current_speaker_id = None
|
||||
node.current_speaker_state = SpeakerState.UNKNOWN
|
||||
node.current_speaker_score = 0.0
|
||||
|
||||
with node.sv_result_cv:
|
||||
node.sv_result_seq += 1
|
||||
node.sv_result_cv.notify_all()
|
||||
|
||||
except Exception as e:
|
||||
node.get_logger().error(f"[声纹识别线程] 错误: {e}")
|
||||
time.sleep(0.1)
|
||||
|
||||
463
robot_speaker/core/register_speaker_node.py
Normal file
463
robot_speaker/core/register_speaker_node.py
Normal file
@@ -0,0 +1,463 @@
|
||||
"""
|
||||
声纹注册独立节点:运行完成后退出
|
||||
"""
|
||||
import collections
|
||||
import os
|
||||
import queue
|
||||
import threading
|
||||
import time
|
||||
import yaml
|
||||
|
||||
import numpy as np
|
||||
import rclpy
|
||||
from rclpy.node import Node
|
||||
from ament_index_python.packages import get_package_share_directory
|
||||
|
||||
from robot_speaker.perception.audio_pipeline import VADDetector, AudioRecorder
|
||||
from robot_speaker.perception.speaker_verifier import SpeakerVerificationClient
|
||||
from robot_speaker.perception.echo_cancellation import ReferenceSignalBuffer
|
||||
from robot_speaker.models.asr.dashscope import DashScopeASR
|
||||
from robot_speaker.models.tts.dashscope import DashScopeTTSClient
|
||||
from robot_speaker.core.types import TTSRequest
|
||||
from pypinyin import pinyin, Style
|
||||
|
||||
|
||||
class RegisterSpeakerNode(Node):
|
||||
def __init__(self):
|
||||
super().__init__('register_speaker_node')
|
||||
self._load_config()
|
||||
|
||||
self.stop_event = threading.Event()
|
||||
self.processing = False
|
||||
self.buffer_lock = threading.Lock()
|
||||
self.audio_buffer = collections.deque(maxlen=self.sv_buffer_size)
|
||||
|
||||
# 状态:等待唤醒词 -> 等待声纹语音
|
||||
self.waiting_for_wake_word = True
|
||||
self.waiting_for_voiceprint = False
|
||||
|
||||
# 音频队列和文本队列(用于ASR)
|
||||
self.audio_queue = queue.Queue()
|
||||
self.text_queue = queue.Queue()
|
||||
|
||||
self.vad_detector = VADDetector(
|
||||
mode=self.vad_mode,
|
||||
sample_rate=self.sample_rate
|
||||
)
|
||||
|
||||
# 创建参考信号缓冲区(用于回声消除)
|
||||
self.reference_signal_buffer = ReferenceSignalBuffer(
|
||||
max_duration_ms=self.audio_echo_cancellation_max_duration_ms,
|
||||
sample_rate=self.sample_rate,
|
||||
channels=self.output_channels
|
||||
) if self.audio_echo_cancellation_enabled else None
|
||||
|
||||
self.audio_recorder = AudioRecorder(
|
||||
device_index=self.input_device_index,
|
||||
sample_rate=self.sample_rate,
|
||||
channels=self.channels,
|
||||
chunk=self.chunk,
|
||||
vad_detector=self.vad_detector,
|
||||
audio_queue=self.audio_queue, # 送ASR用于唤醒词检测
|
||||
silence_duration_ms=self.silence_duration_ms,
|
||||
min_energy_threshold=self.min_energy_threshold,
|
||||
heartbeat_interval=self.audio_microphone_heartbeat_interval,
|
||||
on_heartbeat=self._on_heartbeat,
|
||||
is_playing=lambda: False,
|
||||
on_new_segment=None,
|
||||
on_speech_start=self._on_speech_start,
|
||||
on_speech_end=self._on_speech_end,
|
||||
stop_flag=self.stop_event.is_set,
|
||||
on_audio_chunk=self._on_audio_chunk,
|
||||
should_put_to_queue=self._should_put_to_queue,
|
||||
get_silence_threshold=lambda: self.silence_duration_ms,
|
||||
enable_echo_cancellation=self.audio_echo_cancellation_enabled, # 启用回声消除,保持与主程序一致
|
||||
reference_signal_buffer=self.reference_signal_buffer,
|
||||
logger=self.get_logger()
|
||||
)
|
||||
|
||||
# ASR客户端 - 用于唤醒词检测
|
||||
self.asr_client = DashScopeASR(
|
||||
api_key=self.dashscope_api_key,
|
||||
sample_rate=self.sample_rate,
|
||||
model=self.asr_model,
|
||||
url=self.asr_url,
|
||||
logger=self.get_logger()
|
||||
)
|
||||
self.asr_client.on_sentence_end = self._on_asr_sentence_end
|
||||
self.asr_client.start()
|
||||
|
||||
# ASR处理线程
|
||||
self.asr_thread = threading.Thread(
|
||||
target=self._asr_worker,
|
||||
name="RegisterASRThread",
|
||||
daemon=True
|
||||
)
|
||||
self.asr_thread.start()
|
||||
|
||||
# 文本处理线程
|
||||
self.text_thread = threading.Thread(
|
||||
target=self._text_worker,
|
||||
name="RegisterTextThread",
|
||||
daemon=True
|
||||
)
|
||||
self.text_thread.start()
|
||||
|
||||
self.sv_client = SpeakerVerificationClient(
|
||||
model_path=self.sv_model_path,
|
||||
threshold=self.sv_threshold,
|
||||
speaker_db_path=self.sv_speaker_db_path,
|
||||
logger=self.get_logger()
|
||||
)
|
||||
|
||||
self.tts_client = DashScopeTTSClient(
|
||||
api_key=self.dashscope_api_key,
|
||||
model=self.tts_model,
|
||||
voice=self.tts_voice,
|
||||
card_index=self.output_card_index,
|
||||
device_index=self.output_device_index,
|
||||
output_sample_rate=self.output_sample_rate,
|
||||
output_channels=self.output_channels,
|
||||
output_volume=self.output_volume,
|
||||
tts_source_sample_rate=self.audio_tts_source_sample_rate,
|
||||
tts_source_channels=self.audio_tts_source_channels,
|
||||
tts_ffmpeg_thread_queue_size=self.audio_tts_ffmpeg_thread_queue_size,
|
||||
reference_signal_buffer=self.reference_signal_buffer,
|
||||
logger=self.get_logger()
|
||||
)
|
||||
|
||||
self.get_logger().info("声纹注册节点启动,请说'er gou......'唤醒注册")
|
||||
self.recording_thread = threading.Thread(
|
||||
target=self.audio_recorder.record_with_vad,
|
||||
name="RegisterRecordingThread",
|
||||
daemon=True
|
||||
)
|
||||
self.recording_thread.start()
|
||||
|
||||
self.timer = self.create_timer(0.2, self._check_done)
|
||||
|
||||
def _load_config(self):
|
||||
config_file = os.path.join(
|
||||
get_package_share_directory('robot_speaker'),
|
||||
'config',
|
||||
'voice.yaml'
|
||||
)
|
||||
with open(config_file, 'r') as f:
|
||||
config = yaml.safe_load(f)
|
||||
|
||||
dashscope = config['dashscope']
|
||||
audio = config['audio']
|
||||
mic = audio['microphone']
|
||||
soundcard = audio['soundcard']
|
||||
vad = config['vad']
|
||||
system = config['system']
|
||||
|
||||
self.dashscope_api_key = dashscope['api_key']
|
||||
self.asr_model = dashscope['asr']['model']
|
||||
self.asr_url = dashscope['asr']['url']
|
||||
self.tts_model = dashscope['tts']['model']
|
||||
self.tts_voice = dashscope['tts']['voice']
|
||||
|
||||
self.input_device_index = mic['device_index']
|
||||
self.sample_rate = mic['sample_rate']
|
||||
self.channels = mic['channels']
|
||||
self.chunk = mic['chunk']
|
||||
self.audio_microphone_heartbeat_interval = mic['heartbeat_interval']
|
||||
|
||||
self.output_card_index = soundcard['card_index']
|
||||
self.output_device_index = soundcard['device_index']
|
||||
self.output_sample_rate = soundcard['sample_rate']
|
||||
self.output_channels = soundcard['channels']
|
||||
self.output_volume = soundcard['volume']
|
||||
|
||||
echo = audio.get('echo_cancellation', {})
|
||||
self.audio_echo_cancellation_enabled = echo.get('enabled', True) # 默认启用
|
||||
self.audio_echo_cancellation_max_duration_ms = echo.get('max_duration_ms', 200)
|
||||
|
||||
tts_audio = audio.get('tts', {})
|
||||
self.audio_tts_source_sample_rate = tts_audio.get('source_sample_rate', 22050)
|
||||
self.audio_tts_source_channels = tts_audio.get('source_channels', 1)
|
||||
self.audio_tts_ffmpeg_thread_queue_size = tts_audio.get('ffmpeg_thread_queue_size', 5)
|
||||
|
||||
self.vad_mode = vad['vad_mode']
|
||||
self.silence_duration_ms = vad['silence_duration_ms']
|
||||
self.min_energy_threshold = vad['min_energy_threshold']
|
||||
|
||||
self.sv_model_path = os.path.expanduser(system['sv_model_path'])
|
||||
self.sv_threshold = system['sv_threshold']
|
||||
self.sv_speaker_db_path = os.path.expanduser(system['sv_speaker_db_path'])
|
||||
self.sv_buffer_size = system['sv_buffer_size']
|
||||
self.wake_word = system['wake_word']
|
||||
|
||||
def _should_put_to_queue(self) -> bool:
|
||||
"""判断是否应该将音频放入ASR队列(仅在等待唤醒词时)"""
|
||||
return self.waiting_for_wake_word
|
||||
|
||||
def _on_heartbeat(self):
|
||||
if self.waiting_for_wake_word:
|
||||
self.get_logger().info("[注册录音] 等待唤醒词'er gou'...")
|
||||
elif self.waiting_for_voiceprint:
|
||||
self.get_logger().info("[注册录音] 等待声纹语音...")
|
||||
|
||||
def _on_speech_start(self):
|
||||
if self.waiting_for_wake_word:
|
||||
# 等待唤醒词时,开始录音(可能包含唤醒词)
|
||||
self.get_logger().info("[注册录音] 检测到人声,开始录音")
|
||||
elif self.waiting_for_voiceprint:
|
||||
self.get_logger().info("[注册录音] 检测到人声,继续录音(用于声纹注册)")
|
||||
# 注意:不清空缓冲区,保留包含唤醒词的音频
|
||||
|
||||
def _on_audio_chunk(self, audio_chunk: bytes):
|
||||
# 记录所有音频(包括唤醒词),用于声纹注册
|
||||
try:
|
||||
audio_array = np.frombuffer(audio_chunk, dtype=np.int16)
|
||||
with self.buffer_lock:
|
||||
self.audio_buffer.extend(audio_array)
|
||||
except Exception as e:
|
||||
self.get_logger().debug(f"[注册录音] 录音失败: {e}")
|
||||
|
||||
def _on_speech_end(self):
|
||||
# 如果还在等待唤醒词,不处理
|
||||
if self.waiting_for_wake_word:
|
||||
return
|
||||
# 如果已经在处理,不重复处理
|
||||
if self.processing:
|
||||
return
|
||||
|
||||
# 等待声纹语音时,用户说话结束,使用当前音频(即使不足3秒)
|
||||
if self.waiting_for_voiceprint:
|
||||
self._process_voiceprint_audio(use_current_audio_if_short=True)
|
||||
return # 处理完毕后直接返回,防止重复调用
|
||||
|
||||
def _process_voiceprint_audio(self, use_current_audio_if_short: bool = False):
|
||||
"""处理声纹音频:使用用户完整的第一段语音进行注册
|
||||
|
||||
Args:
|
||||
use_current_audio_if_short: 如果音频不足3秒,是否使用当前音频(用于用户已说完的情况)
|
||||
"""
|
||||
if self.processing:
|
||||
return
|
||||
self.processing = True
|
||||
with self.buffer_lock:
|
||||
audio_list = list(self.audio_buffer)
|
||||
|
||||
buffer_size = len(audio_list)
|
||||
buffer_sec = buffer_size / self.sample_rate
|
||||
self.get_logger().info(f"[注册录音] 当前音频长度: {buffer_sec:.2f}秒")
|
||||
|
||||
required_samples = int(self.sample_rate * 3)
|
||||
|
||||
# 如果音频不足3秒
|
||||
if buffer_size < required_samples:
|
||||
if use_current_audio_if_short:
|
||||
# 用户已经说完了,使用当前音频(即使不足3秒)
|
||||
self.get_logger().info(f"[注册录音] 音频不足3秒(当前{buffer_sec:.2f}秒),但用户已说完,使用当前音频进行注册")
|
||||
audio_to_use = audio_list
|
||||
else:
|
||||
# 等待继续录音
|
||||
self.get_logger().info(f"[注册录音] 音频不足3秒(当前{buffer_sec:.2f}秒),等待继续录音...")
|
||||
self.processing = False
|
||||
return
|
||||
else:
|
||||
# 策略优化:不再强行截取最后3秒,因为唤醒词检测有延迟,
|
||||
# "er gou" 可能在缓冲区的中间偏后位置。
|
||||
# 为了防止截取到尾部的静音,并在包含完整唤醒词,
|
||||
# 我们截取最近的 3.0 秒(或者全部,如果不足3秒),
|
||||
# 这样能最大程度包含有效语音 "二狗"。
|
||||
target_samples = int(self.sample_rate * 3.0)
|
||||
if buffer_size > target_samples:
|
||||
audio_to_use = audio_list[-target_samples:]
|
||||
else:
|
||||
audio_to_use = audio_list
|
||||
|
||||
duration = len(audio_to_use) / self.sample_rate
|
||||
self.get_logger().info(f"[注册录音] 使用最近 {duration:.2f} 秒音频用于注册(覆盖唤醒词)")
|
||||
|
||||
# 清空缓冲区
|
||||
with self.buffer_lock:
|
||||
self.audio_buffer.clear()
|
||||
|
||||
try:
|
||||
audio_array = np.array(audio_to_use, dtype=np.int16)
|
||||
embedding, success = self.sv_client.extract_embedding(
|
||||
audio_array,
|
||||
sample_rate=self.sample_rate
|
||||
)
|
||||
if not success or embedding is None:
|
||||
self.get_logger().error("[注册录音] 提取embedding失败")
|
||||
self.processing = False
|
||||
return
|
||||
|
||||
speaker_id = f"user_{int(time.time())}"
|
||||
if self.sv_client.register_speaker(speaker_id, embedding):
|
||||
self.get_logger().info(f"[注册录音] 注册成功,用户ID: {speaker_id},准备退出")
|
||||
|
||||
# 播放成功提示
|
||||
try:
|
||||
self.get_logger().info("[注册录音] 播放注册成功提示")
|
||||
request = TTSRequest(text="声纹注册成功", voice=self.tts_voice)
|
||||
self.tts_client.synthesize(request)
|
||||
time.sleep(5)
|
||||
except Exception as e:
|
||||
self.get_logger().error(f"[注册录音] 播放提示失败: {e}")
|
||||
|
||||
self.stop_event.set()
|
||||
else:
|
||||
self.get_logger().error("[注册录音] 注册失败")
|
||||
self.processing = False
|
||||
except Exception as e:
|
||||
self.get_logger().error(f"[注册录音] 注册异常: {e}")
|
||||
self.processing = False
|
||||
|
||||
def _extract_speech_segments(self, audio_array: np.ndarray, frame_size: int = 1024) -> list:
|
||||
"""使用能量检测提取人声片段(过滤静音)"""
|
||||
speech_segments = []
|
||||
frame_samples = frame_size
|
||||
total_frames = 0
|
||||
speech_frames = 0
|
||||
|
||||
for i in range(0, len(audio_array), frame_samples):
|
||||
frame = audio_array[i:i + frame_samples]
|
||||
if len(frame) < frame_samples:
|
||||
break
|
||||
|
||||
total_frames += 1
|
||||
# 计算帧的能量(RMS,对于int16音频)
|
||||
frame_float = frame.astype(np.float32)
|
||||
energy = np.sqrt(np.mean(frame_float ** 2))
|
||||
|
||||
# 使用更低的阈值来检测人声(降低阈值,避免误判静音)
|
||||
# 阈值可以动态调整,或者使用自适应阈值
|
||||
threshold = self.min_energy_threshold * 0.50 # 降低阈值到原来的50%
|
||||
|
||||
# 如果能量超过阈值,认为是人声
|
||||
if energy >= threshold:
|
||||
speech_segments.append((i, i + frame_samples))
|
||||
speech_frames += 1
|
||||
|
||||
# 调试信息
|
||||
if total_frames > 0:
|
||||
speech_ratio = speech_frames / total_frames
|
||||
self.get_logger().debug(f"[注册录音] 能量检测: 总帧数={total_frames}, 人声帧数={speech_frames}, 人声比例={speech_ratio:.2%}, 阈值={self.min_energy_threshold}")
|
||||
|
||||
return speech_segments
|
||||
|
||||
def _merge_speech_segments(self, audio_array: np.ndarray, segments: list, min_samples: int) -> np.ndarray:
|
||||
"""合并人声片段,返回连续的人声音频"""
|
||||
if not segments:
|
||||
return np.array([], dtype=np.int16)
|
||||
|
||||
# 合并相邻的片段
|
||||
merged_segments = []
|
||||
current_start, current_end = segments[0]
|
||||
|
||||
for start, end in segments[1:]:
|
||||
if start <= current_end + 1024: # 允许小间隙(1帧)
|
||||
current_end = end
|
||||
else:
|
||||
merged_segments.append((current_start, current_end))
|
||||
current_start, current_end = start, end
|
||||
merged_segments.append((current_start, current_end))
|
||||
|
||||
# 从后往前选择片段,直到达到3秒
|
||||
selected_audio = []
|
||||
total_samples = 0
|
||||
|
||||
for start, end in reversed(merged_segments):
|
||||
segment_audio = audio_array[start:end]
|
||||
selected_audio.insert(0, segment_audio)
|
||||
total_samples += len(segment_audio)
|
||||
if total_samples >= min_samples:
|
||||
break
|
||||
|
||||
if not selected_audio:
|
||||
return np.array([], dtype=np.int16)
|
||||
|
||||
return np.concatenate(selected_audio)
|
||||
|
||||
def _asr_worker(self):
|
||||
"""ASR处理线程"""
|
||||
while not self.stop_event.is_set():
|
||||
try:
|
||||
audio_chunk = self.audio_queue.get(timeout=0.1)
|
||||
if self.asr_client and self.asr_client.running:
|
||||
self.asr_client.send_audio(audio_chunk)
|
||||
except queue.Empty:
|
||||
continue
|
||||
except Exception as e:
|
||||
self.get_logger().error(f"[注册ASR] 处理异常: {e}")
|
||||
|
||||
def _on_asr_sentence_end(self, text: str):
|
||||
"""ASR识别完成回调"""
|
||||
if text and text.strip():
|
||||
self.text_queue.put(text.strip())
|
||||
|
||||
def _text_worker(self):
|
||||
"""文本处理线程:检测唤醒词"""
|
||||
while not self.stop_event.is_set():
|
||||
try:
|
||||
text = self.text_queue.get(timeout=0.1)
|
||||
if self.waiting_for_wake_word:
|
||||
self._check_wake_word(text)
|
||||
except queue.Empty:
|
||||
continue
|
||||
except Exception as e:
|
||||
self.get_logger().error(f"[注册文本] 处理异常: {e}")
|
||||
|
||||
def _to_pinyin(self, text: str) -> str:
|
||||
"""将中文文本转换为拼音"""
|
||||
chars = [c for c in text if '\u4e00' <= c <= '\u9fa5']
|
||||
if not chars:
|
||||
return ""
|
||||
py_list = pinyin(chars, style=Style.NORMAL)
|
||||
return ' '.join([item[0] for item in py_list]).lower().strip()
|
||||
|
||||
def _check_wake_word(self, text: str):
|
||||
"""检查是否包含唤醒词"""
|
||||
text_pinyin = self._to_pinyin(text)
|
||||
wake_word_pinyin = self.wake_word.lower().strip()
|
||||
self.get_logger().info(f"[注册唤醒词] 原始文本: {text}, 文本拼音: {text_pinyin}, 唤醒词拼音: {wake_word_pinyin}")
|
||||
|
||||
if not wake_word_pinyin:
|
||||
return
|
||||
|
||||
text_pinyin_parts = text_pinyin.split() if text_pinyin else []
|
||||
wake_word_parts = wake_word_pinyin.split()
|
||||
|
||||
# 检查是否包含唤醒词
|
||||
for i in range(len(text_pinyin_parts) - len(wake_word_parts) + 1):
|
||||
if text_pinyin_parts[i:i + len(wake_word_parts)] == wake_word_parts:
|
||||
self.get_logger().info(f"[注册唤醒词] 检测到唤醒词 '{self.wake_word}'")
|
||||
self.get_logger().info("=" * 50)
|
||||
self.get_logger().info("[声纹注册] 开始注册声纹,将截取3秒音频用于注册")
|
||||
self.get_logger().info("=" * 50)
|
||||
self.waiting_for_wake_word = False
|
||||
self.waiting_for_voiceprint = True
|
||||
# 停止ASR,不再需要识别
|
||||
if self.asr_client:
|
||||
self.asr_client.stop_current_recognition()
|
||||
# 立即处理当前音频缓冲区中的完整音频
|
||||
# 用户可能已经说完了(包含唤醒词的整段语音)
|
||||
self._process_voiceprint_audio()
|
||||
return
|
||||
|
||||
def _check_done(self):
|
||||
if self.stop_event.is_set():
|
||||
self.get_logger().info("注册完成,节点退出")
|
||||
# 清理资源
|
||||
if self.asr_client:
|
||||
self.asr_client.stop()
|
||||
self.destroy_node()
|
||||
rclpy.shutdown()
|
||||
|
||||
|
||||
def main(args=None):
|
||||
rclpy.init(args=args)
|
||||
node = RegisterSpeakerNode()
|
||||
rclpy.spin(node)
|
||||
|
||||
|
||||
if __name__ == '__main__':
|
||||
main()
|
||||
|
||||
|
||||
858
robot_speaker/core/robot_speaker_node.py
Normal file
858
robot_speaker/core/robot_speaker_node.py
Normal file
@@ -0,0 +1,858 @@
|
||||
"""
|
||||
语音交互节点
|
||||
"""
|
||||
import rclpy
|
||||
from rclpy.node import Node
|
||||
from std_msgs.msg import String
|
||||
import threading
|
||||
import queue
|
||||
import time
|
||||
import re
|
||||
import base64
|
||||
import io
|
||||
import numpy as np
|
||||
from PIL import Image
|
||||
import subprocess
|
||||
import collections
|
||||
import os
|
||||
import yaml
|
||||
import json
|
||||
from ament_index_python.packages import get_package_share_directory
|
||||
from robot_speaker.perception.audio_pipeline import VADDetector, AudioRecorder
|
||||
from robot_speaker.models.asr.dashscope import DashScopeASR
|
||||
from robot_speaker.models.tts.dashscope import DashScopeTTSClient
|
||||
from robot_speaker.models.llm.dashscope import DashScopeLLM
|
||||
from robot_speaker.understanding.context_manager import ConversationHistory
|
||||
from robot_speaker.core.types import LLMMessage, TTSRequest
|
||||
from robot_speaker.perception.camera_client import CameraClient
|
||||
from robot_speaker.perception.speaker_verifier import SpeakerVerificationClient, SpeakerState
|
||||
from robot_speaker.perception.echo_cancellation import ReferenceSignalBuffer
|
||||
from robot_speaker.core.conversation_state import ConversationState
|
||||
from robot_speaker.core.node_workers import NodeWorkers
|
||||
from robot_speaker.core.node_callbacks import NodeCallbacks
|
||||
from robot_speaker.core.intent_router import IntentRouter, IntentResult
|
||||
|
||||
|
||||
class RobotSpeakerNode(Node):
|
||||
# ==================== 初始化 ====================
|
||||
def __init__(self):
|
||||
super().__init__('robot_speaker_node')
|
||||
|
||||
# 直接从配置文件加载参数
|
||||
self._load_config()
|
||||
|
||||
# 初始化队列(线程间通信)
|
||||
self.audio_queue = queue.Queue() # 录音线程 → ASR线程
|
||||
self.text_queue = queue.Queue() # ASR线程 → 主线程
|
||||
self.tts_queue = queue.Queue() # 主线程 → TTS线程
|
||||
|
||||
# 初始化线程同步事件
|
||||
self.interrupt_event = threading.Event() # 中断标志
|
||||
self.stop_event = threading.Event() # 停止标志
|
||||
self.tts_playing_event = threading.Event() # TTS播放状态
|
||||
|
||||
# 初始化会话管理
|
||||
self.session_active = False
|
||||
self.session_start_time = 0.0
|
||||
self.session_lock = threading.Lock()
|
||||
|
||||
# 状态机状态
|
||||
self.conversation_state = ConversationState.IDLE # 当前会话状态
|
||||
self.state_lock = threading.Lock() # 保护状态机状态
|
||||
|
||||
# 声纹识别共享状态
|
||||
self.current_speaker_id = None # 当前说话人ID(共享状态,只读)
|
||||
self.current_speaker_state = SpeakerState.UNKNOWN # 当前说话人状态
|
||||
self.current_speaker_score = 0.0 # 当前说话人相似度得分
|
||||
self.sv_lock = threading.Lock() # 保护声纹识别共享状态
|
||||
self.sv_speech_end_event = threading.Event() # 通知声纹线程处理(speech_end触发)
|
||||
self.sv_result_ready_event = threading.Event() # 保留兼容(已不用于同步)
|
||||
self.sv_result_lock = threading.Lock() # 声纹结果序号锁
|
||||
self.sv_result_cv = threading.Condition(self.sv_result_lock)
|
||||
self.sv_result_seq = 0
|
||||
# 声纹缓冲区大小将在_init_components中初始化(需要先读取参数)
|
||||
self.sv_audio_buffer = None # 声纹验证录音缓冲区(将在_init_components中初始化)
|
||||
self.sv_recording = False # 是否正在为声纹验证录音
|
||||
|
||||
# 声纹注册状态
|
||||
self.utterance_lock = threading.Lock()
|
||||
self.current_utterance_id = 0
|
||||
self.last_processed_utterance_id = 0
|
||||
|
||||
self.intent_router = IntentRouter()
|
||||
self.callbacks = NodeCallbacks(self)
|
||||
|
||||
# 初始化组件(VAD、录音器、ASR、LLM、TTS)
|
||||
self._init_components()
|
||||
self.workers = NodeWorkers(self)
|
||||
|
||||
# 状态机初始状态
|
||||
if self.sv_enabled and self.sv_client:
|
||||
speaker_count = self.sv_client.get_speaker_count()
|
||||
if speaker_count == 0:
|
||||
self.get_logger().info("声纹数据库为空,请注册声纹")
|
||||
|
||||
# ROS订阅
|
||||
self.interrupt_sub = self.create_subscription(
|
||||
String, 'interrupt_command', self.callbacks.handle_interrupt_command, self.system_interrupt_command_queue_depth
|
||||
)
|
||||
self.skill_sequence_pub = self.create_publisher(String, '/llm_skill_sequence', 10)
|
||||
self.skill_feedback_sub = self.create_subscription(
|
||||
String, '/skill_execution_feedback', self._on_skill_feedback, 10
|
||||
)
|
||||
self.skill_result_sub = self.create_subscription(
|
||||
String, '/skill_execution_result', self._on_skill_result, 10
|
||||
)
|
||||
|
||||
self.latest_skill_feedback = None
|
||||
self.latest_skill_result = None
|
||||
|
||||
# 启动线程
|
||||
self._start_threads()
|
||||
self.get_logger().info("语音节点已启动")
|
||||
|
||||
# ==================== 配置加载 ====================
|
||||
def _load_config(self):
|
||||
"""直接从 voice.yaml 配置文件加载参数"""
|
||||
config_file = os.path.join(
|
||||
get_package_share_directory('robot_speaker'),
|
||||
'config',
|
||||
'voice.yaml'
|
||||
)
|
||||
with open(config_file, 'r') as f:
|
||||
config = yaml.safe_load(f)
|
||||
|
||||
# 音频参数
|
||||
audio = config['audio']
|
||||
mic = audio['microphone']
|
||||
soundcard = audio['soundcard']
|
||||
echo = audio['echo_cancellation']
|
||||
tts_audio = audio['tts']
|
||||
|
||||
self.input_device_index = mic['device_index']
|
||||
self.output_card_index = soundcard['card_index']
|
||||
self.output_device_index = soundcard['device_index']
|
||||
self.sample_rate = mic['sample_rate']
|
||||
self.channels = mic['channels']
|
||||
self.chunk = mic['chunk']
|
||||
self.audio_microphone_heartbeat_interval = mic['heartbeat_interval']
|
||||
self.output_sample_rate = soundcard['sample_rate']
|
||||
self.output_channels = soundcard['channels']
|
||||
self.output_volume = soundcard['volume']
|
||||
self.audio_echo_cancellation_enabled = echo.get('enabled', True) # 默认启用
|
||||
self.audio_echo_cancellation_max_duration_ms = echo['max_duration_ms']
|
||||
self.audio_tts_source_sample_rate = tts_audio['source_sample_rate']
|
||||
self.audio_tts_source_channels = tts_audio['source_channels']
|
||||
self.audio_tts_ffmpeg_thread_queue_size = tts_audio['ffmpeg_thread_queue_size']
|
||||
|
||||
# VAD参数
|
||||
vad = config['vad']
|
||||
self.vad_mode = vad['vad_mode']
|
||||
self.silence_duration_ms = vad['silence_duration_ms']
|
||||
self.min_energy_threshold = vad['min_energy_threshold']
|
||||
|
||||
# DashScope参数
|
||||
dashscope = config['dashscope']
|
||||
self.dashscope_api_key = dashscope['api_key']
|
||||
self.asr_model = dashscope['asr']['model']
|
||||
self.asr_url = dashscope['asr']['url']
|
||||
self.llm_model = dashscope['llm']['model']
|
||||
self.llm_base_url = dashscope['llm']['base_url']
|
||||
self.llm_temperature = dashscope['llm']['temperature']
|
||||
self.llm_max_tokens = dashscope['llm']['max_tokens']
|
||||
self.llm_max_history = dashscope['llm']['max_history']
|
||||
self.llm_summary_trigger = dashscope['llm']['summary_trigger']
|
||||
self.tts_model = dashscope['tts']['model']
|
||||
self.tts_voice = dashscope['tts']['voice']
|
||||
|
||||
# 系统参数
|
||||
system = config['system']
|
||||
self.use_llm = system['use_llm']
|
||||
self.use_wake_word = system['use_wake_word']
|
||||
self.wake_word = system['wake_word']
|
||||
self.session_timeout = system['session_timeout']
|
||||
self.system_shutup_keywords = system['shutup_keywords']
|
||||
self.system_interrupt_command_queue_depth = system['interrupt_command_queue_depth']
|
||||
self.sv_enabled = system['sv_enabled']
|
||||
self.sv_model_path = os.path.expanduser(system['sv_model_path'])
|
||||
self.sv_threshold = system['sv_threshold']
|
||||
self.sv_speaker_db_path = os.path.expanduser(system['sv_speaker_db_path']) # 展开用户目录
|
||||
self.sv_buffer_size = system['sv_buffer_size']
|
||||
|
||||
# 相机参数
|
||||
camera = config['camera']
|
||||
self.camera_serial_number = camera['serial_number']
|
||||
self.camera_rgb_width = camera['rgb']['width']
|
||||
self.camera_rgb_height = camera['rgb']['height']
|
||||
self.camera_rgb_fps = camera['rgb']['fps']
|
||||
self.camera_rgb_format = camera['rgb']['format']
|
||||
self.camera_image_jpeg_quality = camera['image']['jpeg_quality']
|
||||
self.camera_image_max_size = camera['image']['max_size']
|
||||
|
||||
self.knowledge_file = os.path.join(
|
||||
get_package_share_directory('robot_speaker'),
|
||||
'config',
|
||||
'knowledge.json'
|
||||
)
|
||||
|
||||
# ==================== 组件初始化 ====================
|
||||
def _init_components(self):
|
||||
"""初始化所有组件"""
|
||||
self.shutup_keywords = [k.strip() for k in self.system_shutup_keywords.split(',') if k.strip()]
|
||||
|
||||
self.kb_answers_map = {}
|
||||
if self.knowledge_file and os.path.exists(self.knowledge_file):
|
||||
try:
|
||||
with open(self.knowledge_file, 'r') as f:
|
||||
kb_data = json.load(f)
|
||||
entries = kb_data["entries"]
|
||||
for entry in entries:
|
||||
patterns = entry["patterns"]
|
||||
answer = entry["answer"]
|
||||
if not answer.strip():
|
||||
continue
|
||||
for pattern in patterns:
|
||||
key = pattern.strip().lower()
|
||||
if key:
|
||||
self.kb_answers_map[key] = answer.strip()
|
||||
self.get_logger().info(f"知识库已加载: {len(self.kb_answers_map)} 条")
|
||||
except Exception as e:
|
||||
self.get_logger().warning(f"知识库加载失败: {e}")
|
||||
|
||||
self.sv_audio_buffer = collections.deque(maxlen=self.sv_buffer_size)
|
||||
|
||||
self.vad_detector = VADDetector(
|
||||
mode=self.vad_mode,
|
||||
sample_rate=self.sample_rate
|
||||
)
|
||||
|
||||
# 创建参考信号缓冲区(用于回声消除),虽然播放是44100Hz,但麦克风输入是16kHz
|
||||
self.reference_signal_buffer = ReferenceSignalBuffer(
|
||||
max_duration_ms=self.audio_echo_cancellation_max_duration_ms,
|
||||
sample_rate=self.sample_rate,
|
||||
channels=self.output_channels
|
||||
) if self.audio_echo_cancellation_enabled else None
|
||||
|
||||
# 录音器 - 直接发送音频chunk到队列
|
||||
self.audio_recorder = AudioRecorder(
|
||||
device_index=self.input_device_index,
|
||||
sample_rate=self.sample_rate,
|
||||
channels=self.channels,
|
||||
chunk=self.chunk,
|
||||
vad_detector=self.vad_detector,
|
||||
audio_queue=self.audio_queue,
|
||||
silence_duration_ms=self.silence_duration_ms,
|
||||
min_energy_threshold=self.min_energy_threshold,
|
||||
heartbeat_interval=self.audio_microphone_heartbeat_interval,
|
||||
on_heartbeat=self.callbacks.on_heartbeat,
|
||||
is_playing=self.tts_playing_event.is_set,
|
||||
on_new_segment=self.callbacks.on_new_segment,
|
||||
on_speech_start=self.callbacks.on_speech_start,
|
||||
on_speech_end=self.callbacks.on_speech_end,
|
||||
stop_flag=self.stop_event.is_set,
|
||||
on_audio_chunk=self.callbacks.on_audio_chunk_for_sv if self.sv_enabled else None, # 声纹录音回调
|
||||
should_put_to_queue=self.callbacks.should_put_audio_to_queue, # 检查是否应该将音频放入队列
|
||||
get_silence_threshold=self.callbacks.get_silence_threshold, # 动态静音阈值回调
|
||||
enable_echo_cancellation=self.audio_echo_cancellation_enabled, # 从配置文件读取
|
||||
reference_signal_buffer=self.reference_signal_buffer, # 传递参考信号缓冲区
|
||||
logger=self.get_logger()
|
||||
)
|
||||
|
||||
# ASR客户端 - 流式识别
|
||||
self.asr_client = DashScopeASR(
|
||||
api_key=self.dashscope_api_key,
|
||||
sample_rate=self.sample_rate,
|
||||
model=self.asr_model,
|
||||
url=self.asr_url,
|
||||
logger=self.get_logger()
|
||||
)
|
||||
self.asr_client.on_sentence_end = self.callbacks.on_asr_sentence_end
|
||||
self.asr_client.on_text_update = self.callbacks.on_asr_text_update
|
||||
self.asr_client.start()
|
||||
|
||||
# LLM客户端
|
||||
if self.use_llm:
|
||||
self.llm_client = DashScopeLLM(
|
||||
api_key=self.dashscope_api_key,
|
||||
model=self.llm_model,
|
||||
base_url=self.llm_base_url,
|
||||
temperature=self.llm_temperature,
|
||||
max_tokens=self.llm_max_tokens,
|
||||
name="LLM-chat",
|
||||
logger=self.get_logger()
|
||||
)
|
||||
self.history = ConversationHistory(
|
||||
max_history=self.llm_max_history,
|
||||
summary_trigger=self.llm_summary_trigger
|
||||
)
|
||||
else:
|
||||
self.llm_client = None
|
||||
self.history = None
|
||||
|
||||
# TTS客户端
|
||||
self.get_logger().info(f"TTS配置: model={self.tts_model}, voice={self.tts_voice}")
|
||||
self.get_logger().info(f"音频输出配置: sample_rate={self.output_sample_rate}, channels={self.output_channels}")
|
||||
self.tts_client = DashScopeTTSClient(
|
||||
api_key=self.dashscope_api_key,
|
||||
model=self.tts_model,
|
||||
voice=self.tts_voice,
|
||||
card_index=self.output_card_index,
|
||||
device_index=self.output_device_index,
|
||||
output_sample_rate=self.output_sample_rate,
|
||||
output_channels=self.output_channels,
|
||||
output_volume=self.output_volume,
|
||||
tts_source_sample_rate=self.audio_tts_source_sample_rate,
|
||||
tts_source_channels=self.audio_tts_source_channels,
|
||||
tts_ffmpeg_thread_queue_size=self.audio_tts_ffmpeg_thread_queue_size,
|
||||
reference_signal_buffer=self.reference_signal_buffer, # 传递参考信号缓冲区
|
||||
logger=self.get_logger()
|
||||
)
|
||||
|
||||
# 相机客户端(默认一直运行)
|
||||
try:
|
||||
self.camera_client = CameraClient(
|
||||
serial_number=self.camera_serial_number,
|
||||
width=self.camera_rgb_width,
|
||||
height=self.camera_rgb_height,
|
||||
fps=self.camera_rgb_fps,
|
||||
format=self.camera_rgb_format,
|
||||
logger=self.get_logger()
|
||||
)
|
||||
self.camera_client.initialize()
|
||||
except Exception as e:
|
||||
self.get_logger().warning(f"相机初始化失败: {e},相机功能将不可用")
|
||||
self.camera_client = None
|
||||
|
||||
# 声纹识别客户端
|
||||
if self.sv_enabled and self.sv_model_path:
|
||||
try:
|
||||
self.sv_client = SpeakerVerificationClient(
|
||||
model_path=self.sv_model_path,
|
||||
threshold=self.sv_threshold,
|
||||
speaker_db_path=self.sv_speaker_db_path,
|
||||
logger=self.get_logger()
|
||||
)
|
||||
except Exception as e:
|
||||
self.get_logger().warning(f"声纹识别初始化失败: {e},声纹功能将不可用")
|
||||
self.sv_client = None
|
||||
self.sv_enabled = False
|
||||
else:
|
||||
self.sv_client = None
|
||||
|
||||
# ==================== 线程启动 ====================
|
||||
def _start_threads(self):
|
||||
"""启动线程"""
|
||||
# 线程1: 录音线程
|
||||
self.recording_thread = threading.Thread(
|
||||
target=self.workers.recording_worker,
|
||||
name="RecordingThread",
|
||||
daemon=True
|
||||
)
|
||||
self.recording_thread.start()
|
||||
|
||||
# 线程2: ASR推理线程
|
||||
self.asr_thread = threading.Thread(
|
||||
target=self.workers.asr_worker,
|
||||
name="ASRThread",
|
||||
daemon=True
|
||||
)
|
||||
self.asr_thread.start()
|
||||
|
||||
# 线程3: 主线程 - 处理业务逻辑
|
||||
self.process_thread = threading.Thread(
|
||||
target=self.workers.process_worker,
|
||||
name="ProcessThread",
|
||||
daemon=True
|
||||
)
|
||||
self.process_thread.start()
|
||||
|
||||
# 线程4: TTS播放线程
|
||||
self.tts_thread = threading.Thread(
|
||||
target=self._tts_worker,
|
||||
name="TTSThread",
|
||||
daemon=True
|
||||
)
|
||||
self.tts_thread.start()
|
||||
|
||||
# 线程5: 声纹识别线程(如果启用)
|
||||
if self.sv_enabled and self.sv_client:
|
||||
self.sv_thread = threading.Thread(
|
||||
target=self.workers.sv_worker,
|
||||
name="SVThread",
|
||||
daemon=True
|
||||
)
|
||||
self.sv_thread.start()
|
||||
else:
|
||||
self.sv_thread = None
|
||||
|
||||
# ==================== TTS播放线程 ====================
|
||||
def _tts_worker(self):
|
||||
"""
|
||||
线程4: TTS播放线程 - 只播放
|
||||
"""
|
||||
self.get_logger().info("[TTS播放线程] 启动")
|
||||
while not self.stop_event.is_set():
|
||||
try:
|
||||
text = self.tts_queue.get(timeout=1.0)
|
||||
except queue.Empty:
|
||||
if self.interrupt_event.is_set():
|
||||
self.get_logger().debug("[TTS播放线程] 检测到中断事件")
|
||||
continue
|
||||
|
||||
if self.interrupt_event.is_set():
|
||||
self.get_logger().info("[TTS播放线程] 中断播放,跳过文本")
|
||||
continue
|
||||
|
||||
if not text or not str(text).strip():
|
||||
continue
|
||||
|
||||
text_str = str(text).strip()
|
||||
text_len = len(text_str)
|
||||
self.get_logger().info(f"[TTS播放线程] 开始播放: {text_str[:100]}... (总长度: {text_len}字符)")
|
||||
self.tts_playing_event.set()
|
||||
|
||||
request = TTSRequest(text=text_str, voice=None)
|
||||
success = self.tts_client.synthesize(
|
||||
request,
|
||||
interrupt_check=lambda: self.interrupt_event.is_set()
|
||||
)
|
||||
if success:
|
||||
self.get_logger().info("[TTS播放线程] 播放完成")
|
||||
else:
|
||||
self.get_logger().info("[TTS播放线程] 播放被中断")
|
||||
|
||||
self.tts_playing_event.clear()
|
||||
|
||||
if self.interrupt_event.is_set():
|
||||
self.get_logger().info("[TTS播放线程] 播放完成后检测到中断,清空队列")
|
||||
self._drain_queue(self.tts_queue)
|
||||
self.interrupt_event.clear()
|
||||
|
||||
# ==================== 状态机方法 ====================
|
||||
def _change_state(self, new_state: ConversationState, reason: str | None = None):
|
||||
"""改变状态机状态"""
|
||||
with self.state_lock:
|
||||
old_state = self.conversation_state
|
||||
self.conversation_state = new_state
|
||||
if reason:
|
||||
self.get_logger().info(f"[状态机] {old_state.value} -> {new_state.value}: {reason}")
|
||||
else:
|
||||
self.get_logger().info(f"[状态机] {old_state.value} -> {new_state.value}")
|
||||
|
||||
def _get_state(self) -> ConversationState:
|
||||
"""获取当前状态"""
|
||||
with self.state_lock:
|
||||
return self.conversation_state
|
||||
|
||||
# ==================== LLM处理(含拍照) ====================
|
||||
def _encode_image_to_base64(self, image_data: np.ndarray, quality: int = 85) -> str:
|
||||
"""将numpy图像数组编码为base64字符串"""
|
||||
try:
|
||||
if image_data.shape[2] == 3:
|
||||
pil_image = Image.fromarray(image_data, 'RGB')
|
||||
else:
|
||||
pil_image = Image.fromarray(image_data)
|
||||
|
||||
buffer = io.BytesIO()
|
||||
pil_image.save(buffer, format='JPEG', quality=quality)
|
||||
image_bytes = buffer.getvalue()
|
||||
base64_str = base64.b64encode(image_bytes).decode('utf-8')
|
||||
return base64_str
|
||||
except Exception as e:
|
||||
self.get_logger().error(f"图像编码失败: {e}")
|
||||
return ""
|
||||
|
||||
def _llm_process_stream_with_camera(
|
||||
self,
|
||||
user_text: str,
|
||||
need_camera: bool,
|
||||
system_prompt: str | None = None,
|
||||
suppress_tts: bool = False
|
||||
) -> str:
|
||||
"""LLM流式处理 - 支持多模态(文本+图像)"""
|
||||
if not self.llm_client or not self.history:
|
||||
return ""
|
||||
|
||||
messages = list(self.history.get_messages())
|
||||
|
||||
has_system_msg = any(msg.role == "system" for msg in messages)
|
||||
if not has_system_msg:
|
||||
if not system_prompt:
|
||||
system_prompt = self.intent_router.build_default_system_prompt()
|
||||
messages.insert(0, LLMMessage(role="system", content=system_prompt))
|
||||
|
||||
full_reply = ""
|
||||
tts_text_buffer = ""
|
||||
image_base64_list = []
|
||||
|
||||
def on_token(token: str):
|
||||
nonlocal full_reply, tts_text_buffer
|
||||
if self.interrupt_event.is_set():
|
||||
self.get_logger().info("[LLM流式处理] on_token回调中检测到中断,停止处理")
|
||||
return
|
||||
|
||||
full_reply += token
|
||||
tts_text_buffer += token
|
||||
|
||||
if need_camera and self.camera_client:
|
||||
with self.camera_client.capture_context() as image_data:
|
||||
if image_data is not None:
|
||||
image_base64 = self._encode_image_to_base64(
|
||||
image_data,
|
||||
quality=self.camera_image_jpeg_quality
|
||||
)
|
||||
if image_base64:
|
||||
image_base64_list.append(image_base64)
|
||||
self.get_logger().info("[相机] 已拍照")
|
||||
|
||||
if image_base64_list:
|
||||
self.get_logger().info(
|
||||
f"[多模态] 准备发送给LLM: {len(image_base64_list)}张图片,用户文本: {user_text[:50]}"
|
||||
)
|
||||
for idx, img_b64 in enumerate(image_base64_list):
|
||||
self.get_logger().debug(f"[多模态] 图片#{idx+1} base64长度: {len(img_b64)}")
|
||||
|
||||
reply = self.llm_client.chat_stream(
|
||||
messages,
|
||||
on_token=on_token,
|
||||
images=image_base64_list if image_base64_list else None,
|
||||
interrupt_check=lambda: self.interrupt_event.is_set()
|
||||
)
|
||||
|
||||
if self.interrupt_event.is_set() or (reply is None):
|
||||
if self.interrupt_event.is_set():
|
||||
self.get_logger().info("[LLM流式处理] 处理被中断")
|
||||
return ""
|
||||
|
||||
if image_base64_list:
|
||||
for img_b64 in image_base64_list:
|
||||
del img_b64
|
||||
image_base64_list.clear()
|
||||
self.get_logger().info("[相机] 已删除照片")
|
||||
|
||||
if reply and reply.strip():
|
||||
tts_text_to_send = reply.strip()
|
||||
tts_buffer_len = len(tts_text_buffer.strip()) if tts_text_buffer else 0
|
||||
reply_len = len(tts_text_to_send)
|
||||
if tts_buffer_len != reply_len:
|
||||
self.get_logger().info(
|
||||
f"[流式TTS] tts_text_buffer({tts_buffer_len}字符)和reply({reply_len}字符)长度不一致,使用reply作为TTS文本"
|
||||
)
|
||||
elif tts_text_buffer and tts_text_buffer.strip():
|
||||
tts_text_to_send = tts_text_buffer.strip()
|
||||
self.get_logger().warning(
|
||||
f"[流式TTS] reply为空,使用tts_text_buffer({len(tts_text_to_send)}字符)作为TTS文本"
|
||||
)
|
||||
else:
|
||||
tts_text_to_send = ""
|
||||
self.get_logger().warning("[流式TTS] reply和tts_text_buffer都为空,无法发送TTS文本")
|
||||
|
||||
if not self.interrupt_event.is_set() and tts_text_to_send and not suppress_tts:
|
||||
text_len = len(tts_text_to_send)
|
||||
self.get_logger().info(
|
||||
f"[流式TTS] 发送完整文本到TTS队列: {tts_text_to_send[:100]}... (总长度: {text_len}字符)"
|
||||
)
|
||||
if text_len > 100:
|
||||
self.get_logger().debug(f"[流式TTS] 完整文本内容: {tts_text_to_send}")
|
||||
self._put_tts_text(tts_text_to_send)
|
||||
elif suppress_tts:
|
||||
self.get_logger().info("[流式TTS] suppress_tts开启,跳过TTS输出")
|
||||
|
||||
return reply.strip() if reply else ""
|
||||
|
||||
# ==================== 中断与TTS工具 ====================
|
||||
def _force_stop_tts(self):
|
||||
"""强制停止TTS播放 - 直接杀死记录的ffmpeg进程PID"""
|
||||
self._drain_queue(self.tts_queue)
|
||||
self.interrupt_event.set()
|
||||
|
||||
if self.tts_client and self.tts_client.current_ffmpeg_pid:
|
||||
try:
|
||||
pid = self.tts_client.current_ffmpeg_pid
|
||||
os.kill(pid, 9) # SIGKILL
|
||||
self.get_logger().info(f"[强制停止TTS] 已终止ffmpeg进程,PID={pid}")
|
||||
self.tts_client.current_ffmpeg_pid = None
|
||||
except ProcessLookupError:
|
||||
self.get_logger().debug(f"[强制停止TTS] ffmpeg进程已不存在,PID={pid}")
|
||||
self.tts_client.current_ffmpeg_pid = None
|
||||
except Exception as e:
|
||||
self.get_logger().warning(f"[强制停止TTS] 终止ffmpeg进程失败: {e}")
|
||||
|
||||
def _check_interrupt(self, auto_clear: bool = False) -> bool:
|
||||
"""
|
||||
检查中断标志
|
||||
"""
|
||||
if self.interrupt_event.is_set():
|
||||
if auto_clear:
|
||||
self.interrupt_event.clear()
|
||||
return True
|
||||
return False
|
||||
|
||||
def _check_interrupt_and_cancel_turn(self) -> bool:
|
||||
"""检查中断并取消轮次(统一处理中断后的清理)"""
|
||||
if self._check_interrupt(auto_clear=True):
|
||||
if self.use_llm and self.history:
|
||||
self.history.cancel_turn()
|
||||
return True
|
||||
return False
|
||||
|
||||
# ==================== 注册/会话/唤醒词 ====================
|
||||
def _handle_empty_speaker_db(self) -> bool:
|
||||
"""处理数据库为空的情况(统一处理)"""
|
||||
if not (self.sv_enabled and self.sv_client):
|
||||
return False
|
||||
|
||||
speaker_count = self.sv_client.get_speaker_count()
|
||||
if speaker_count == 0:
|
||||
with self.sv_lock:
|
||||
self.current_speaker_id = None
|
||||
self.current_speaker_state = SpeakerState.UNKNOWN
|
||||
self.current_speaker_score = 0.0
|
||||
self.sv_result_ready_event.set()
|
||||
return True
|
||||
return False
|
||||
|
||||
def _put_tts_text(self, text: str):
|
||||
"""统一处理TTS队列put(带异常处理)"""
|
||||
try:
|
||||
self.tts_queue.put(text, timeout=0.2)
|
||||
self.get_logger().debug(f"[TTS队列] 文本已成功放入队列: {text[:50]}... (队列大小: {self.tts_queue.qsize()})")
|
||||
except Exception as e:
|
||||
self.get_logger().error(f"[TTS队列] 放入队列失败: {e}, 文本: {text[:50]}")
|
||||
|
||||
def _interrupt_tts(self, reason: str):
|
||||
"""
|
||||
中断TTS播放,只设置中断事件,不清空队列,让TTS线程自己检查并停止播放
|
||||
"""
|
||||
self.get_logger().info(f"[中断] {reason}")
|
||||
self.interrupt_event.set()
|
||||
|
||||
@staticmethod
|
||||
def _drain_queue(q: queue.Queue):
|
||||
"""清空队列"""
|
||||
while True:
|
||||
try:
|
||||
q.get_nowait()
|
||||
except queue.Empty:
|
||||
break
|
||||
|
||||
def _start_session(self):
|
||||
"""开始会话"""
|
||||
with self.session_lock:
|
||||
self.session_active = True
|
||||
self.session_start_time = time.time()
|
||||
|
||||
def _reset_session(self):
|
||||
"""重置会话"""
|
||||
with self.session_lock:
|
||||
self.session_start_time = time.time()
|
||||
|
||||
def _is_session_active(self) -> bool:
|
||||
"""检查会话是否活跃"""
|
||||
with self.session_lock:
|
||||
if not self.session_active:
|
||||
return False
|
||||
if time.time() - self.session_start_time >= self.session_timeout:
|
||||
self.session_active = False
|
||||
return False
|
||||
return True
|
||||
|
||||
# ==================== 意图处理 ====================
|
||||
def _handle_wake_word(self, text: str) -> str:
|
||||
"""处理唤醒词:ASR文本转拼音,检查是否包含唤醒词拼音"""
|
||||
if not self.use_wake_word:
|
||||
return text.strip()
|
||||
|
||||
if self._is_session_active():
|
||||
self._reset_session()
|
||||
return text.strip()
|
||||
|
||||
text_pinyin = self.intent_router.to_pinyin(text)
|
||||
wake_word_pinyin = self.wake_word.lower().strip()
|
||||
self.get_logger().info(f"[唤醒词] 原始文本: {text}, 文本拼音: {text_pinyin}, 唤醒词拼音: {wake_word_pinyin}")
|
||||
if not wake_word_pinyin:
|
||||
self.get_logger().info("[唤醒词] 唤醒词为空,过滤文本")
|
||||
return ""
|
||||
|
||||
text_pinyin_parts = text_pinyin.split() if text_pinyin else []
|
||||
wake_word_parts = wake_word_pinyin.split()
|
||||
|
||||
start_idx = -1
|
||||
for i in range(len(text_pinyin_parts) - len(wake_word_parts) + 1):
|
||||
if text_pinyin_parts[i:i + len(wake_word_parts)] == wake_word_parts:
|
||||
start_idx = i
|
||||
break
|
||||
|
||||
if start_idx == -1:
|
||||
self.get_logger().info(f"[唤醒词] 未检测到唤醒词 '{self.wake_word}',过滤文本")
|
||||
return ""
|
||||
|
||||
removed = 0
|
||||
new_text = ""
|
||||
for c in text:
|
||||
if '\u4e00' <= c <= '\u9fa5':
|
||||
if removed < start_idx or removed >= start_idx + len(wake_word_parts):
|
||||
new_text += c
|
||||
removed += 1
|
||||
else:
|
||||
new_text += c
|
||||
|
||||
self._start_session()
|
||||
return new_text.strip()
|
||||
|
||||
def _check_shutup_command(self, text: str) -> bool:
|
||||
"""检查闭嘴指令"""
|
||||
if not text:
|
||||
return False
|
||||
text_lower = text.lower()
|
||||
text_pinyin = self.intent_router.to_pinyin(text)
|
||||
for keyword in self.shutup_keywords:
|
||||
kw = keyword.lower().strip()
|
||||
if not kw:
|
||||
continue
|
||||
if kw in text_lower or (text_pinyin and kw in text_pinyin):
|
||||
return True
|
||||
return False
|
||||
|
||||
def _handle_intent(self, intent_payload: IntentResult):
|
||||
"""按意图路由到不同处理逻辑"""
|
||||
intent = intent_payload.intent
|
||||
text = intent_payload.text
|
||||
need_camera = intent_payload.need_camera
|
||||
system_prompt = intent_payload.system_prompt
|
||||
|
||||
if intent == "kb_qa":
|
||||
answer = None
|
||||
text_pinyin = self.intent_router.to_pinyin(text)
|
||||
if text_pinyin:
|
||||
answer = self.kb_answers_map.get(text_pinyin)
|
||||
if answer:
|
||||
if "{wake_word}" in answer:
|
||||
answer = answer.replace("{wake_word}", self.wake_word or "")
|
||||
self._put_tts_text(answer)
|
||||
else:
|
||||
pass
|
||||
return
|
||||
|
||||
if self.use_llm and self.llm_client:
|
||||
if self.history:
|
||||
self.history.start_turn(text)
|
||||
|
||||
reply = self._llm_process_stream_with_camera(
|
||||
text,
|
||||
need_camera=need_camera,
|
||||
system_prompt=system_prompt,
|
||||
suppress_tts=(intent == "skill_sequence")
|
||||
)
|
||||
if reply:
|
||||
if self.history:
|
||||
self.history.commit_turn(reply)
|
||||
if intent == "skill_sequence":
|
||||
skill_msg = String()
|
||||
skill_msg.data = reply.strip()
|
||||
self.skill_sequence_pub.publish(skill_msg)
|
||||
self.get_logger().info(f"[技能序列] 已发布: {skill_msg.data}")
|
||||
else:
|
||||
if self.history:
|
||||
self.history.cancel_turn()
|
||||
else:
|
||||
self.get_logger().warning("[主线程] 未启用LLM,无法处理文本")
|
||||
|
||||
# ==================== 资源清理 ====================
|
||||
def destroy_node(self):
|
||||
"""销毁节点"""
|
||||
self.get_logger().info("语音节点正在关闭...")
|
||||
self.stop_event.set()
|
||||
self.interrupt_event.set()
|
||||
self.get_logger().info("强制停止TTS播放...")
|
||||
self._force_stop_tts()
|
||||
|
||||
self._drain_queue(self.tts_queue)
|
||||
|
||||
threads_to_join = [self.recording_thread, self.asr_thread, self.process_thread, self.tts_thread]
|
||||
if self.sv_thread:
|
||||
threads_to_join.append(self.sv_thread)
|
||||
for thread in threads_to_join:
|
||||
if thread and thread.is_alive():
|
||||
thread.join(timeout=1.0)
|
||||
|
||||
self._force_stop_tts()
|
||||
|
||||
if hasattr(self, 'asr_client') and self.asr_client:
|
||||
self.asr_client.stop()
|
||||
|
||||
if hasattr(self, 'audio_recorder') and self.audio_recorder:
|
||||
self.audio_recorder.cleanup()
|
||||
|
||||
if hasattr(self, 'camera_client') and self.camera_client:
|
||||
self.camera_client.cleanup()
|
||||
|
||||
if hasattr(self, 'sv_client') and self.sv_client:
|
||||
try:
|
||||
self.sv_client.save_speakers()
|
||||
self.sv_client.cleanup()
|
||||
except Exception as e:
|
||||
self.get_logger().warning(f"清理声纹识别资源时出错: {e}")
|
||||
|
||||
super().destroy_node()
|
||||
|
||||
def _on_skill_feedback(self, msg: String):
|
||||
try:
|
||||
feedback = json.loads(msg.data)
|
||||
self.latest_skill_feedback = feedback
|
||||
feedback_text = (
|
||||
f"【执行状态】阶段:{feedback.get('stage','')}, "
|
||||
f"技能:{feedback.get('current_skill','')}, "
|
||||
f"进度:{feedback.get('progress', 0):.1%}, "
|
||||
f"详情:{feedback.get('detail','')}"
|
||||
)
|
||||
if self.history:
|
||||
self.history.add_message("system", feedback_text)
|
||||
except Exception as e:
|
||||
self.get_logger().warning(f"[技能反馈] 解析失败: {e}")
|
||||
|
||||
def _on_skill_result(self, msg: String):
|
||||
try:
|
||||
result = json.loads(msg.data)
|
||||
self.latest_skill_result = result
|
||||
result_text = (
|
||||
f"【执行结果】{'成功' if result.get('success') else '失败'}, "
|
||||
f"总技能数:{result.get('total_skills', 0)}, "
|
||||
f"成功数:{result.get('succeeded_skills', 0)}, "
|
||||
f"消息:{result.get('message','')}"
|
||||
)
|
||||
if self.history:
|
||||
self.history.add_message("system", result_text)
|
||||
except Exception as e:
|
||||
self.get_logger().warning(f"[技能结果] 解析失败: {e}")
|
||||
|
||||
|
||||
def _init_ros(args):
|
||||
rclpy.init(args=args)
|
||||
|
||||
def _create_node():
|
||||
return RobotSpeakerNode()
|
||||
|
||||
def _run_node(node):
|
||||
rclpy.spin(node)
|
||||
|
||||
def _cleanup_node(node):
|
||||
if node:
|
||||
node.destroy_node()
|
||||
|
||||
def _shutdown_ros():
|
||||
if rclpy.ok():
|
||||
rclpy.shutdown()
|
||||
|
||||
# ==================== 入口 ====================
|
||||
def main(args=None):
|
||||
node = None
|
||||
_init_ros(args)
|
||||
node = _create_node()
|
||||
_run_node(node)
|
||||
_cleanup_node(node)
|
||||
_shutdown_ros()
|
||||
|
||||
|
||||
if __name__ == '__main__':
|
||||
main()
|
||||
36
robot_speaker/core/types.py
Normal file
36
robot_speaker/core/types.py
Normal file
@@ -0,0 +1,36 @@
|
||||
"""
|
||||
统一数据结构定义
|
||||
"""
|
||||
from dataclasses import dataclass
|
||||
|
||||
|
||||
@dataclass
|
||||
class ASRResult:
|
||||
"""ASR识别结果"""
|
||||
text: str
|
||||
confidence: float | None = None
|
||||
language: str | None = None
|
||||
|
||||
|
||||
@dataclass
|
||||
class LLMMessage:
|
||||
"""LLM消息"""
|
||||
role: str # "user", "assistant", "system"
|
||||
content: str
|
||||
|
||||
|
||||
@dataclass
|
||||
class TTSRequest:
|
||||
"""TTS请求"""
|
||||
text: str
|
||||
voice: str | None = None # 如果为None,使用控制台配置的默认音色
|
||||
speed: float | None = None
|
||||
pitch: float | None = None
|
||||
|
||||
|
||||
@dataclass
|
||||
class ImageMessage:
|
||||
"""图像消息 - 用于多模态LLM"""
|
||||
image_data: bytes # base64编码的图像数据
|
||||
image_format: str = "jpeg"
|
||||
|
||||
5
robot_speaker/models/__init__.py
Normal file
5
robot_speaker/models/__init__.py
Normal file
@@ -0,0 +1,5 @@
|
||||
"""模型层"""
|
||||
|
||||
|
||||
|
||||
|
||||
5
robot_speaker/models/asr/__init__.py
Normal file
5
robot_speaker/models/asr/__init__.py
Normal file
@@ -0,0 +1,5 @@
|
||||
"""ASR模型"""
|
||||
|
||||
|
||||
|
||||
|
||||
13
robot_speaker/models/asr/base.py
Normal file
13
robot_speaker/models/asr/base.py
Normal file
@@ -0,0 +1,13 @@
|
||||
class ASRClient:
|
||||
def start(self) -> bool:
|
||||
raise NotImplementedError
|
||||
|
||||
def stop(self) -> bool:
|
||||
raise NotImplementedError
|
||||
|
||||
def send_audio(self, audio_data: bytes) -> bool:
|
||||
raise NotImplementedError
|
||||
|
||||
|
||||
|
||||
|
||||
218
robot_speaker/models/asr/dashscope.py
Normal file
218
robot_speaker/models/asr/dashscope.py
Normal file
@@ -0,0 +1,218 @@
|
||||
"""
|
||||
ASR语音识别模块
|
||||
"""
|
||||
import base64
|
||||
import time
|
||||
import threading
|
||||
import dashscope
|
||||
from dashscope.audio.qwen_omni import OmniRealtimeConversation, OmniRealtimeCallback
|
||||
from dashscope.audio.qwen_omni.omni_realtime import TranscriptionParams, MultiModality
|
||||
from robot_speaker.models.asr.base import ASRClient
|
||||
|
||||
|
||||
class DashScopeASR(ASRClient):
|
||||
"""DashScope实时ASR识别器封装"""
|
||||
|
||||
def __init__(self, api_key: str,
|
||||
sample_rate: int,
|
||||
model: str,
|
||||
url: str,
|
||||
logger=None):
|
||||
dashscope.api_key = api_key
|
||||
self.sample_rate = sample_rate
|
||||
self.model = model
|
||||
self.url = url
|
||||
self.logger = logger
|
||||
|
||||
self.conversation = None
|
||||
self.running = False
|
||||
self.on_sentence_end = None
|
||||
self.on_text_update = None # 实时文本更新回调
|
||||
|
||||
# 线程同步机制
|
||||
self._stop_lock = threading.Lock() # 防止并发调用 stop_current_recognition
|
||||
self._final_result_event = threading.Event() # 等待 final 回调完成
|
||||
self._pending_commit = False # 标记是否有待处理的 commit
|
||||
|
||||
def _log(self, level: str, msg: str):
|
||||
"""记录日志,根据级别调用对应的ROS2日志方法"""
|
||||
if self.logger:
|
||||
# ROS2 logger不能动态改变severity级别,需要显式调用对应方法
|
||||
if level == "debug":
|
||||
self.logger.debug(msg)
|
||||
elif level == "info":
|
||||
self.logger.info(msg)
|
||||
elif level == "warning":
|
||||
self.logger.warn(msg)
|
||||
elif level == "error":
|
||||
self.logger.error(msg)
|
||||
else:
|
||||
self.logger.info(msg) # 默认使用info级别
|
||||
else:
|
||||
print(f"[ASR] {msg}")
|
||||
|
||||
def start(self):
|
||||
"""启动ASR识别器"""
|
||||
if self.running:
|
||||
return False
|
||||
|
||||
try:
|
||||
callback = _ASRCallback(self)
|
||||
self.conversation = OmniRealtimeConversation(
|
||||
model=self.model,
|
||||
url=self.url,
|
||||
callback=callback
|
||||
)
|
||||
callback.conversation = self.conversation
|
||||
|
||||
self.conversation.connect()
|
||||
|
||||
transcription_params = TranscriptionParams(
|
||||
language='zh',
|
||||
sample_rate=self.sample_rate,
|
||||
input_audio_format="pcm",
|
||||
)
|
||||
|
||||
# 本地 VAD → 只控制 TTS 打断
|
||||
# 服务端 turn detection → 只控制 ASR 输出、LLM 生成轮次
|
||||
|
||||
self.conversation.update_session(
|
||||
output_modalities=[MultiModality.TEXT],
|
||||
enable_input_audio_transcription=True,
|
||||
transcription_params=transcription_params,
|
||||
enable_turn_detection=True,
|
||||
# 保留服务端 turn detection
|
||||
turn_detection_type='server_vad', # 服务端VAD
|
||||
turn_detection_threshold=0.2, # 可调
|
||||
turn_detection_silence_duration_ms=800
|
||||
)
|
||||
|
||||
self.running = True
|
||||
self._log("info", "ASR已启动")
|
||||
return True
|
||||
except Exception as e:
|
||||
self.running = False
|
||||
self._log("error", f"ASR启动失败: {e}")
|
||||
if self.conversation:
|
||||
try:
|
||||
self.conversation.close()
|
||||
except:
|
||||
pass
|
||||
self.conversation = None
|
||||
return False
|
||||
|
||||
def send_audio(self, audio_chunk: bytes):
|
||||
"""发送音频chunk到ASR"""
|
||||
if not self.running or not self.conversation:
|
||||
return False
|
||||
try:
|
||||
audio_b64 = base64.b64encode(audio_chunk).decode('ascii')
|
||||
self.conversation.append_audio(audio_b64)
|
||||
return True
|
||||
except Exception as e:
|
||||
# 连接已关闭或其他错误,静默处理(避免日志过多)
|
||||
# running状态会在stop_current_recognition中正确设置
|
||||
return False
|
||||
|
||||
def stop_current_recognition(self):
|
||||
"""
|
||||
触发提交操作获取当前识别结果,但不关闭连接
|
||||
"""
|
||||
if not self.running or not self.conversation:
|
||||
return False
|
||||
|
||||
# 使用锁防止并发调用
|
||||
if not self._stop_lock.acquire(blocking=False):
|
||||
self._log("warning", "stop_current_recognition 正在执行,跳过本次调用")
|
||||
return False
|
||||
|
||||
try:
|
||||
# 重置事件,准备等待 final 回调
|
||||
self._final_result_event.clear()
|
||||
self._pending_commit = True
|
||||
|
||||
# 触发 commit,等待 final 结果
|
||||
self.conversation.commit()
|
||||
|
||||
# 等待 final 回调完成(最多等待1秒)
|
||||
if self._final_result_event.wait(timeout=1.0):
|
||||
self._log("debug", "已收到 final 回调")
|
||||
else:
|
||||
self._log("warning", "等待 final 回调超时,继续执行")
|
||||
|
||||
return True
|
||||
|
||||
except Exception as e:
|
||||
self._log("error", f"提交当前识别结果失败: {e}")
|
||||
# 出现错误时尝试重启连接
|
||||
self.running = False
|
||||
try:
|
||||
if self.conversation:
|
||||
self.conversation.close()
|
||||
except:
|
||||
pass
|
||||
self.conversation = None
|
||||
time.sleep(0.1)
|
||||
return self.start()
|
||||
|
||||
finally:
|
||||
self._pending_commit = False
|
||||
self._stop_lock.release()
|
||||
|
||||
def stop(self):
|
||||
"""停止ASR识别器"""
|
||||
# 等待正在执行的 stop_current_recognition 完成
|
||||
with self._stop_lock:
|
||||
self.running = False
|
||||
self._final_result_event.set() # 唤醒可能正在等待的线程
|
||||
if self.conversation:
|
||||
try:
|
||||
self.conversation.close()
|
||||
except Exception as e:
|
||||
self._log("warning", f"停止时关闭连接出错: {e}")
|
||||
self.conversation = None
|
||||
self._log("info", "ASR已停止")
|
||||
|
||||
|
||||
class _ASRCallback(OmniRealtimeCallback):
|
||||
"""ASR回调处理"""
|
||||
|
||||
def __init__(self, asr_client: DashScopeASR):
|
||||
self.asr_client = asr_client
|
||||
self.conversation = None
|
||||
|
||||
def on_open(self):
|
||||
self.asr_client._log("info", "ASR WebSocket已连接")
|
||||
|
||||
def on_close(self, code, msg):
|
||||
self.asr_client._log("info", f"ASR WebSocket已关闭: code={code}, msg={msg}")
|
||||
|
||||
def on_event(self, response):
|
||||
event_type = response.get('type', '')
|
||||
|
||||
if event_type == 'session.created':
|
||||
session_id = response.get('session', {}).get('id', '')
|
||||
self.asr_client._log("info", f"ASR会话已创建: {session_id}")
|
||||
|
||||
elif event_type == 'conversation.item.input_audio_transcription.completed':
|
||||
# 最终识别结果
|
||||
transcript = response.get('transcript', '')
|
||||
if transcript and transcript.strip() and self.asr_client.on_sentence_end:
|
||||
self.asr_client.on_sentence_end(transcript.strip())
|
||||
|
||||
# 如果有待处理的 commit,通知等待的线程
|
||||
if self.asr_client._pending_commit:
|
||||
self.asr_client._final_result_event.set()
|
||||
|
||||
elif event_type == 'conversation.item.input_audio_transcription.text':
|
||||
# 实时识别文本更新(多轮提示)
|
||||
transcript = response.get('transcript', '') or response.get('text', '')
|
||||
if transcript and transcript.strip() and self.asr_client.on_text_update:
|
||||
self.asr_client.on_text_update(transcript.strip())
|
||||
|
||||
elif event_type == 'input_audio_buffer.speech_started':
|
||||
self.asr_client._log("info", "ASR检测到说话开始")
|
||||
|
||||
elif event_type == 'input_audio_buffer.speech_stopped':
|
||||
self.asr_client._log("info", "ASR检测到说话结束")
|
||||
|
||||
5
robot_speaker/models/llm/__init__.py
Normal file
5
robot_speaker/models/llm/__init__.py
Normal file
@@ -0,0 +1,5 @@
|
||||
"""LLM模型"""
|
||||
|
||||
|
||||
|
||||
|
||||
15
robot_speaker/models/llm/base.py
Normal file
15
robot_speaker/models/llm/base.py
Normal file
@@ -0,0 +1,15 @@
|
||||
from robot_speaker.core.types import LLMMessage
|
||||
|
||||
|
||||
class LLMClient:
|
||||
def chat(self, messages: list[LLMMessage]) -> str | None:
|
||||
raise NotImplementedError
|
||||
|
||||
def chat_stream(self, messages: list[LLMMessage],
|
||||
on_token=None,
|
||||
interrupt_check=None) -> str | None:
|
||||
raise NotImplementedError
|
||||
|
||||
|
||||
|
||||
|
||||
149
robot_speaker/models/llm/dashscope.py
Normal file
149
robot_speaker/models/llm/dashscope.py
Normal file
@@ -0,0 +1,149 @@
|
||||
"""
|
||||
LLM大语言模型模块
|
||||
支持多模态(文本+图像)
|
||||
"""
|
||||
from openai import OpenAI
|
||||
from typing import Optional, List
|
||||
from robot_speaker.core.types import LLMMessage
|
||||
from robot_speaker.models.llm.base import LLMClient
|
||||
|
||||
|
||||
class DashScopeLLM(LLMClient):
|
||||
"""DashScope LLM客户端封装"""
|
||||
|
||||
def __init__(self, api_key: str,
|
||||
model: str,
|
||||
base_url: str,
|
||||
temperature: float,
|
||||
max_tokens: int,
|
||||
name: str = "LLM",
|
||||
logger=None):
|
||||
self.client = OpenAI(api_key=api_key, base_url=base_url)
|
||||
self.model = model
|
||||
self.temperature = temperature
|
||||
self.max_tokens = max_tokens
|
||||
self.name = name
|
||||
self.logger = logger
|
||||
|
||||
def _log(self, level: str, msg: str):
|
||||
"""记录日志,根据级别调用对应的ROS2日志方法"""
|
||||
msg = f"[{self.name}] {msg}"
|
||||
if self.logger:
|
||||
# ROS2 logger不能动态改变severity级别,需要显式调用对应方法
|
||||
if level == "debug":
|
||||
self.logger.debug(msg)
|
||||
elif level == "info":
|
||||
self.logger.info(msg)
|
||||
elif level == "warning":
|
||||
self.logger.warn(msg)
|
||||
elif level == "error":
|
||||
self.logger.error(msg)
|
||||
else:
|
||||
self.logger.info(msg) # 默认使用info级别
|
||||
|
||||
def chat(self, messages: list[LLMMessage]) -> str | None:
|
||||
"""非流式聊天:任务规划"""
|
||||
payload_messages = [{"role": msg.role, "content": msg.content} for msg in messages]
|
||||
response = self.client.chat.completions.create(
|
||||
model=self.model,
|
||||
messages=payload_messages,
|
||||
temperature=self.temperature,
|
||||
max_tokens=self.max_tokens,
|
||||
stream=False
|
||||
)
|
||||
reply = response.choices[0].message.content.strip()
|
||||
return reply if reply else None
|
||||
|
||||
def chat_stream(self, messages: list[LLMMessage],
|
||||
on_token=None,
|
||||
images: Optional[List[str]] = None,
|
||||
interrupt_check=None) -> str | None:
|
||||
"""
|
||||
流式聊天:语音系统
|
||||
支持多模态(文本+图像)
|
||||
支持中断检查(interrupt_check: 返回True表示需要中断)
|
||||
"""
|
||||
# 转换消息格式,支持多模态
|
||||
# 图像只添加到最后一个user消息中
|
||||
payload_messages = []
|
||||
last_user_idx = -1
|
||||
for i, msg in enumerate(messages):
|
||||
if msg.role == "user":
|
||||
last_user_idx = i
|
||||
|
||||
has_images_in_message = False
|
||||
for i, msg in enumerate(messages):
|
||||
msg_dict = {"role": msg.role}
|
||||
|
||||
# 如果当前消息是最后一个user消息且有图像,构建多模态content
|
||||
if i == last_user_idx and msg.role == "user" and images and len(images) > 0:
|
||||
content_list = [{"type": "text", "text": msg.content}]
|
||||
# 添加所有图像
|
||||
for img_idx, img_base64 in enumerate(images):
|
||||
image_url = f"data:image/jpeg;base64,{img_base64[:50]}..." if len(img_base64) > 50 else f"data:image/jpeg;base64,{img_base64}"
|
||||
content_list.append({
|
||||
"type": "image_url",
|
||||
"image_url": {
|
||||
"url": f"data:image/jpeg;base64,{img_base64}"
|
||||
}
|
||||
})
|
||||
self._log("info", f"[多模态] 添加图像 #{img_idx+1} 到user消息,base64长度: {len(img_base64)}")
|
||||
msg_dict["content"] = content_list
|
||||
has_images_in_message = True
|
||||
else:
|
||||
msg_dict["content"] = msg.content
|
||||
|
||||
payload_messages.append(msg_dict)
|
||||
|
||||
# 记录多模态信息
|
||||
if images and len(images) > 0:
|
||||
if has_images_in_message:
|
||||
# 找到最后一个user消息,记录其content结构
|
||||
last_user_msg = payload_messages[last_user_idx] if last_user_idx >= 0 else None
|
||||
if last_user_msg and isinstance(last_user_msg.get("content"), list):
|
||||
content_items = last_user_msg["content"]
|
||||
text_items = [item for item in content_items if item.get("type") == "text"]
|
||||
image_items = [item for item in content_items if item.get("type") == "image_url"]
|
||||
self._log("info", f"[多模态] 已发送多模态请求: {len(text_items)}个文本 + {len(image_items)}张图片")
|
||||
self._log("debug", f"[多模态] 用户文本: {text_items[0].get('text', '')[:50] if text_items else 'N/A'}")
|
||||
else:
|
||||
self._log("warning", "[多模态] 消息格式异常,无法确认图片是否添加")
|
||||
else:
|
||||
self._log("warning", f"[多模态] 有{len(images)}张图片,但未找到user消息,图片未被添加")
|
||||
else:
|
||||
self._log("debug", "[多模态] 纯文本请求(无图片)")
|
||||
|
||||
full_reply = ""
|
||||
interrupted = False
|
||||
|
||||
stream = self.client.chat.completions.create(
|
||||
model=self.model,
|
||||
messages=payload_messages,
|
||||
temperature=self.temperature,
|
||||
max_tokens=self.max_tokens,
|
||||
stream=True
|
||||
)
|
||||
|
||||
for chunk in stream:
|
||||
# 检查中断标志
|
||||
if interrupt_check and interrupt_check():
|
||||
self._log("info", "LLM流式处理被中断")
|
||||
interrupted = True
|
||||
break
|
||||
|
||||
if chunk.choices and chunk.choices[0].delta.content:
|
||||
content = chunk.choices[0].delta.content
|
||||
full_reply += content
|
||||
if on_token:
|
||||
on_token(content)
|
||||
# 在on_token回调后再次检查中断(on_token可能设置中断标志)
|
||||
if interrupt_check and interrupt_check():
|
||||
self._log("info", "LLM流式处理在on_token回调后被中断")
|
||||
interrupted = True
|
||||
break
|
||||
|
||||
if interrupted:
|
||||
return None # 被中断时返回None,表示未完成
|
||||
|
||||
return full_reply.strip() if full_reply else None
|
||||
|
||||
5
robot_speaker/models/tts/__init__.py
Normal file
5
robot_speaker/models/tts/__init__.py
Normal file
@@ -0,0 +1,5 @@
|
||||
"""TTS模型"""
|
||||
|
||||
|
||||
|
||||
|
||||
14
robot_speaker/models/tts/base.py
Normal file
14
robot_speaker/models/tts/base.py
Normal file
@@ -0,0 +1,14 @@
|
||||
from robot_speaker.core.types import TTSRequest
|
||||
|
||||
|
||||
class TTSClient:
|
||||
"""TTS客户端抽象基类"""
|
||||
|
||||
def synthesize(self, request: TTSRequest,
|
||||
on_chunk=None,
|
||||
interrupt_check=None) -> bool:
|
||||
raise NotImplementedError
|
||||
|
||||
|
||||
|
||||
|
||||
244
robot_speaker/models/tts/dashscope.py
Normal file
244
robot_speaker/models/tts/dashscope.py
Normal file
@@ -0,0 +1,244 @@
|
||||
"""
|
||||
TTS语音合成模块
|
||||
"""
|
||||
import subprocess
|
||||
import dashscope
|
||||
from dashscope.audio.tts_v2 import SpeechSynthesizer, ResultCallback, AudioFormat
|
||||
from robot_speaker.core.types import TTSRequest
|
||||
from robot_speaker.models.tts.base import TTSClient
|
||||
|
||||
|
||||
class DashScopeTTSClient(TTSClient):
|
||||
"""DashScope流式TTS客户端封装"""
|
||||
|
||||
def __init__(self, api_key: str,
|
||||
model: str,
|
||||
voice: str,
|
||||
card_index: int,
|
||||
device_index: int,
|
||||
output_sample_rate: int = 44100,
|
||||
output_channels: int = 2,
|
||||
output_volume: float = 1.0,
|
||||
tts_source_sample_rate: int = 22050, # TTS服务固定输出采样率
|
||||
tts_source_channels: int = 1, # TTS服务固定输出声道数
|
||||
tts_ffmpeg_thread_queue_size: int = 1024, # ffmpeg输入线程队列大小
|
||||
reference_signal_buffer=None, # 参考信号缓冲区(用于回声消除)
|
||||
logger=None):
|
||||
dashscope.api_key = api_key
|
||||
self.model = model
|
||||
self.voice = voice
|
||||
self.card_index = card_index
|
||||
self.device_index = device_index
|
||||
self.output_sample_rate = output_sample_rate
|
||||
self.output_channels = output_channels
|
||||
self.output_volume = output_volume
|
||||
self.tts_source_sample_rate = tts_source_sample_rate
|
||||
self.tts_source_channels = tts_source_channels
|
||||
self.tts_ffmpeg_thread_queue_size = tts_ffmpeg_thread_queue_size
|
||||
self.reference_signal_buffer = reference_signal_buffer # 参考信号缓冲区
|
||||
self.logger = logger
|
||||
self.current_ffmpeg_pid = None # 当前ffmpeg进程的PID
|
||||
|
||||
# 构建ALSA设备, 允许 ffmpeg 自动重采样 / 重声道
|
||||
self.alsa_device = f"plughw:{card_index},{device_index}" if (
|
||||
card_index >= 0 and device_index >= 0
|
||||
) else "default"
|
||||
|
||||
def _log(self, level: str, msg: str):
|
||||
"""记录日志,根据级别调用对应的ROS2日志方法"""
|
||||
if self.logger:
|
||||
# ROS2 logger不能动态改变severity级别,需要显式调用对应方法
|
||||
if level == "debug":
|
||||
self.logger.debug(msg)
|
||||
elif level == "info":
|
||||
self.logger.info(msg)
|
||||
elif level == "warning":
|
||||
self.logger.warn(msg)
|
||||
elif level == "error":
|
||||
self.logger.error(msg)
|
||||
else:
|
||||
self.logger.info(msg) # 默认使用info级别
|
||||
else:
|
||||
print(f"[TTS] {msg}")
|
||||
|
||||
def synthesize(self, request: TTSRequest,
|
||||
on_chunk=None,
|
||||
interrupt_check=None) -> bool:
|
||||
"""主流程:流式合成并播放"""
|
||||
callback = _TTSCallback(self, interrupt_check, on_chunk, self.reference_signal_buffer)
|
||||
# 使用配置的voice,request.voice为None或空时使用self.voice
|
||||
voice_to_use = request.voice if request.voice and request.voice.strip() else self.voice
|
||||
|
||||
if not voice_to_use or not voice_to_use.strip():
|
||||
self._log("error", f"Voice参数无效: '{voice_to_use}'")
|
||||
return False
|
||||
|
||||
self._log("info", f"TTS开始: 文本='{request.text[:50]}...', voice='{voice_to_use}'")
|
||||
synthesizer = SpeechSynthesizer(
|
||||
model=self.model,
|
||||
voice=voice_to_use,
|
||||
format=AudioFormat.PCM_22050HZ_MONO_16BIT,
|
||||
callback=callback,
|
||||
)
|
||||
|
||||
try:
|
||||
synthesizer.streaming_call(request.text)
|
||||
synthesizer.streaming_complete()
|
||||
finally:
|
||||
callback.cleanup()
|
||||
|
||||
return not callback._interrupted
|
||||
|
||||
|
||||
class _TTSCallback(ResultCallback):
|
||||
"""TTS回调处理 - 使用ffmpeg播放,自动处理采样率转换"""
|
||||
|
||||
def __init__(self, tts_client: DashScopeTTSClient,
|
||||
interrupt_check=None,
|
||||
on_chunk=None,
|
||||
reference_signal_buffer=None):
|
||||
self.tts_client = tts_client
|
||||
self.interrupt_check = interrupt_check
|
||||
self.on_chunk = on_chunk
|
||||
self.reference_signal_buffer = reference_signal_buffer # 参考信号缓冲区
|
||||
self._proc = None
|
||||
self._interrupted = False
|
||||
self._cleaned_up = False
|
||||
|
||||
def on_open(self):
|
||||
# 使用ffmpeg播放,自动处理采样率转换(TTS源采样率 -> 设备采样率)
|
||||
# TTS服务输出固定采样率和声道数,ffmpeg会自动转换为设备采样率和声道数
|
||||
ffmpeg_cmd = [
|
||||
'ffmpeg',
|
||||
'-f', 's16le', # 原始 PCM
|
||||
'-ar', str(self.tts_client.tts_source_sample_rate), # TTS输出采样率(从配置文件读取)
|
||||
'-ac', str(self.tts_client.tts_source_channels), # TTS输出声道数(从配置文件读取)
|
||||
'-i', 'pipe:0', # stdin
|
||||
'-f', 'alsa', # 输出到 ALSA
|
||||
'-ar', str(self.tts_client.output_sample_rate), # 输出设备采样率(从配置文件读取)
|
||||
'-ac', str(self.tts_client.output_channels), # 输出设备声道数(从配置文件读取)
|
||||
'-acodec', 'pcm_s16le', # 输出编码
|
||||
'-fflags', 'nobuffer', # 减少缓冲
|
||||
'-flags', 'low_delay', # 低延迟
|
||||
'-avioflags', 'direct', # 尝试直通写入 ALSA,减少延迟
|
||||
self.tts_client.alsa_device
|
||||
]
|
||||
|
||||
# 将 -thread_queue_size 放到输入文件之前
|
||||
insert_pos = ffmpeg_cmd.index('-i')
|
||||
ffmpeg_cmd.insert(insert_pos, str(self.tts_client.tts_ffmpeg_thread_queue_size))
|
||||
ffmpeg_cmd.insert(insert_pos, '-thread_queue_size')
|
||||
|
||||
# 添加音量调节filter(如果音量不是1.0)
|
||||
if self.tts_client.output_volume != 1.0:
|
||||
# 在输出编码前插入音量filter
|
||||
# volume filter放在输入之后、输出编码之前
|
||||
acodec_idx = ffmpeg_cmd.index('-acodec')
|
||||
ffmpeg_cmd.insert(acodec_idx, f'volume={self.tts_client.output_volume}')
|
||||
ffmpeg_cmd.insert(acodec_idx, '-af')
|
||||
|
||||
self.tts_client._log("info", f"启动ffmpeg播放: ALSA设备={self.tts_client.alsa_device}, "
|
||||
f"输出采样率={self.tts_client.output_sample_rate}Hz, "
|
||||
f"输出声道数={self.tts_client.output_channels}, "
|
||||
f"音量={self.tts_client.output_volume * 100:.0f}%")
|
||||
self._proc = subprocess.Popen(
|
||||
ffmpeg_cmd,
|
||||
stdin=subprocess.PIPE,
|
||||
stdout=subprocess.DEVNULL,
|
||||
stderr=subprocess.PIPE # 改为PIPE以便捕获错误
|
||||
)
|
||||
# 记录ffmpeg进程PID
|
||||
self.tts_client.current_ffmpeg_pid = self._proc.pid
|
||||
self.tts_client._log("debug", f"ffmpeg进程已启动,PID={self._proc.pid}")
|
||||
|
||||
def on_complete(self):
|
||||
pass
|
||||
|
||||
def on_error(self, message: str):
|
||||
self.tts_client._log("error", f"TTS错误: {message}")
|
||||
|
||||
def on_close(self):
|
||||
self.cleanup()
|
||||
|
||||
def on_event(self, message):
|
||||
pass
|
||||
|
||||
def on_data(self, data: bytes) -> None:
|
||||
"""接收音频数据并播放"""
|
||||
if self._interrupted:
|
||||
return
|
||||
|
||||
if self.interrupt_check and self.interrupt_check():
|
||||
# 停止播放,不停止 TTS
|
||||
self._interrupted = True
|
||||
if self._proc:
|
||||
self._proc.terminate()
|
||||
return
|
||||
|
||||
# 优先写入ffmpeg,避免阻塞播放
|
||||
# 优先写入ffmpeg,避免阻塞播放
|
||||
if self._proc and self._proc.stdin and not self._interrupted:
|
||||
try:
|
||||
self._proc.stdin.write(data)
|
||||
self._proc.stdin.flush()
|
||||
except BrokenPipeError:
|
||||
# ffmpeg进程可能已退出,检查错误
|
||||
if self._proc.stderr:
|
||||
error_msg = self._proc.stderr.read().decode('utf-8', errors='ignore')
|
||||
self.tts_client._log("error", f"ffmpeg错误: {error_msg}")
|
||||
self._interrupted = True
|
||||
|
||||
# 将音频数据添加到参考信号缓冲区(用于回声消除)
|
||||
# 在写入ffmpeg之后处理,避免阻塞播放
|
||||
if self.reference_signal_buffer and data:
|
||||
try:
|
||||
self.reference_signal_buffer.add_reference(
|
||||
data,
|
||||
source_sample_rate=self.tts_client.tts_source_sample_rate,
|
||||
source_channels=self.tts_client.tts_source_channels
|
||||
)
|
||||
except Exception as e:
|
||||
# 参考信号处理失败不应影响播放
|
||||
self.tts_client._log("warning", f"参考信号处理失败: {e}")
|
||||
|
||||
if self.on_chunk:
|
||||
self.on_chunk(data)
|
||||
|
||||
def cleanup(self):
|
||||
"""清理资源"""
|
||||
if self._cleaned_up or not self._proc:
|
||||
return
|
||||
self._cleaned_up = True
|
||||
|
||||
# 关闭stdin,让ffmpeg处理完剩余数据
|
||||
if self._proc.stdin and not self._proc.stdin.closed:
|
||||
try:
|
||||
self._proc.stdin.close()
|
||||
except:
|
||||
pass
|
||||
|
||||
# 等待进程自然结束(根据文本长度估算,最少10秒,最多30秒)
|
||||
# 假设平均语速:3-4字/秒,加上缓冲时间
|
||||
if self._proc.poll() is None:
|
||||
try:
|
||||
# 增加等待时间,确保ffmpeg播放完成
|
||||
# 对于长文本,可能需要更长时间
|
||||
self._proc.wait(timeout=30.0)
|
||||
except:
|
||||
# 超时后,如果进程还在运行,说明可能卡住了,强制终止
|
||||
if self._proc.poll() is None:
|
||||
self.tts_client._log("warning", "ffmpeg播放超时,强制终止")
|
||||
try:
|
||||
self._proc.terminate()
|
||||
self._proc.wait(timeout=1.0)
|
||||
except:
|
||||
try:
|
||||
self._proc.kill()
|
||||
self._proc.wait(timeout=0.1)
|
||||
except:
|
||||
pass
|
||||
|
||||
# 清空PID记录
|
||||
if self.tts_client.current_ffmpeg_pid == self._proc.pid:
|
||||
self.tts_client.current_ffmpeg_pid = None
|
||||
|
||||
5
robot_speaker/perception/__init__.py
Normal file
5
robot_speaker/perception/__init__.py
Normal file
@@ -0,0 +1,5 @@
|
||||
"""感知层"""
|
||||
|
||||
|
||||
|
||||
|
||||
304
robot_speaker/perception/audio_pipeline.py
Normal file
304
robot_speaker/perception/audio_pipeline.py
Normal file
@@ -0,0 +1,304 @@
|
||||
"""
|
||||
音频处理模块:录音 + VAD + 回声消除
|
||||
"""
|
||||
import time
|
||||
import pyaudio
|
||||
import webrtcvad
|
||||
import struct
|
||||
import queue
|
||||
from .echo_cancellation import EchoCanceller, ReferenceSignalBuffer
|
||||
|
||||
|
||||
class VADDetector:
|
||||
"""VAD语音检测器"""
|
||||
|
||||
def __init__(self, mode: int, sample_rate: int):
|
||||
self.vad = webrtcvad.Vad(mode)
|
||||
self.sample_rate = sample_rate
|
||||
|
||||
|
||||
class AudioRecorder:
|
||||
"""音频录音器 - 录音线程"""
|
||||
|
||||
def __init__(self, device_index: int, sample_rate: int, channels: int,
|
||||
chunk: int, vad_detector: VADDetector,
|
||||
audio_queue: queue.Queue, # 音频队列:录音线程 → ASR线程
|
||||
silence_duration_ms: int = 1000,
|
||||
min_energy_threshold: int = 300, # 音频能量 > 300:有语音
|
||||
heartbeat_interval: float = 2.0,
|
||||
on_heartbeat=None,
|
||||
is_playing=None,
|
||||
on_new_segment=None, # 检测到新的人声段
|
||||
on_speech_start=None, # 检测到人声开始
|
||||
on_speech_end=None, # 检测到静音结束(说话结束)
|
||||
stop_flag=None,
|
||||
on_audio_chunk=None, # 音频chunk回调(用于声纹录音等,可选)
|
||||
should_put_to_queue=None, # 检查是否应该将音频放入队列(用于阻止ASR,可选)
|
||||
get_silence_threshold=None, # 获取动态静音阈值(毫秒,可选)
|
||||
enable_echo_cancellation: bool = True, # 是否启用回声消除
|
||||
reference_signal_buffer: ReferenceSignalBuffer = None, # 参考信号缓冲区(可选)
|
||||
logger=None):
|
||||
self.device_index = device_index
|
||||
self.sample_rate = sample_rate
|
||||
self.channels = channels
|
||||
self.chunk = chunk
|
||||
self.vad_detector = vad_detector
|
||||
self.audio_queue = audio_queue
|
||||
self.silence_duration_ms = int(silence_duration_ms)
|
||||
self.min_energy_threshold = int(min_energy_threshold)
|
||||
self.heartbeat_interval = heartbeat_interval
|
||||
|
||||
self.on_heartbeat = on_heartbeat
|
||||
self.is_playing = is_playing or (lambda: False)
|
||||
self.on_new_segment = on_new_segment
|
||||
self.on_speech_start = on_speech_start
|
||||
self.on_speech_end = on_speech_end
|
||||
self.stop_flag = stop_flag or (lambda: False)
|
||||
self.on_audio_chunk = on_audio_chunk # 音频chunk回调(用于声纹录音等)
|
||||
self.should_put_to_queue = should_put_to_queue or (lambda: True) # 默认允许放入队列
|
||||
self.get_silence_threshold = get_silence_threshold # 动态静音阈值回调
|
||||
self.logger = logger
|
||||
self.audio = pyaudio.PyAudio()
|
||||
|
||||
# 自动查找 iFLYTEK 麦克风设备
|
||||
try:
|
||||
count = self.audio.get_device_count()
|
||||
found_index = -1
|
||||
if self.logger:
|
||||
self.logger.info(f"开始扫描音频设备 (总数: {count})...")
|
||||
|
||||
for i in range(count):
|
||||
device_info = self.audio.get_device_info_by_index(i)
|
||||
device_name = device_info.get('name', '')
|
||||
max_input_channels = device_info.get('maxInputChannels', 0)
|
||||
|
||||
if self.logger:
|
||||
try:
|
||||
self.logger.info(f"扫描设备 [{i}]: Name='{device_name}', MaxInput={max_input_channels}, Rate={int(device_info.get('defaultSampleRate'))}")
|
||||
except:
|
||||
pass
|
||||
|
||||
# 检查是否包含 iFLYTEK 且支持录音(输入通道 > 0)
|
||||
if 'iFLYTEK' in device_name and max_input_channels > 0:
|
||||
found_index = i
|
||||
if self.logger:
|
||||
self.logger.info(f"已自动定位到麦克风设备: {device_name} (Index: {i})")
|
||||
break
|
||||
|
||||
if found_index != -1:
|
||||
self.device_index = found_index
|
||||
else:
|
||||
if self.logger:
|
||||
self.logger.warning(f"未自动检测到 iFLYTEK 设备,将继续使用配置的索引: {self.device_index}")
|
||||
|
||||
except Exception as e:
|
||||
if self.logger:
|
||||
self.logger.error(f"设备自动检测过程出错: {e}")
|
||||
|
||||
self.format = pyaudio.paInt16
|
||||
self._debug_counter = 0
|
||||
|
||||
# 回声消除相关
|
||||
self.enable_echo_cancellation = enable_echo_cancellation
|
||||
self.reference_signal_buffer = reference_signal_buffer
|
||||
if enable_echo_cancellation:
|
||||
# 初始化回声消除器(在录音线程中同步处理,不是单独线程)
|
||||
# frame_size设置为chunk大小,确保每次处理一个chunk
|
||||
frame_size = chunk
|
||||
try:
|
||||
# 获取参考信号声道数(从reference_signal_buffer获取,因为它是根据播放声道数创建的)
|
||||
ref_channels = self.reference_signal_buffer.channels if self.reference_signal_buffer else 1
|
||||
self.echo_canceller = EchoCanceller(
|
||||
sample_rate=sample_rate,
|
||||
frame_size=frame_size,
|
||||
channels=self.channels, # 麦克风输入:1声道
|
||||
ref_channels=ref_channels, # 参考信号:播放声道数(2声道)
|
||||
logger=logger
|
||||
)
|
||||
if self.echo_canceller.aec is not None:
|
||||
if logger:
|
||||
logger.info(f"回声消除器已启用: sample_rate={sample_rate}, frame_size={frame_size}")
|
||||
else:
|
||||
if logger:
|
||||
logger.warning("回声消除器初始化失败,将禁用回声消除功能")
|
||||
self.enable_echo_cancellation = False
|
||||
self.echo_canceller = None
|
||||
except Exception as e:
|
||||
if logger:
|
||||
logger.warning(f"回声消除器初始化失败: {e},将禁用回声消除功能")
|
||||
self.enable_echo_cancellation = False
|
||||
self.echo_canceller = None
|
||||
else:
|
||||
self.echo_canceller = None
|
||||
|
||||
def record_with_vad(self):
|
||||
"""录音线程:VAD + 能量检测"""
|
||||
if self.on_heartbeat:
|
||||
self.on_heartbeat()
|
||||
|
||||
try:
|
||||
stream = self.audio.open(
|
||||
format=self.format,
|
||||
channels=self.channels,
|
||||
rate=self.sample_rate,
|
||||
input=True,
|
||||
input_device_index=self.device_index if self.device_index >= 0 else None,
|
||||
frames_per_buffer=self.chunk
|
||||
)
|
||||
except Exception as e:
|
||||
raise RuntimeError(f"无法打开音频输入设备: {e}")
|
||||
|
||||
# VAD检测窗口, 最快 0.5s 内发现说话
|
||||
window_sec = 0.5
|
||||
# 连续 1s 没有检测到语音,就判定为静音状态
|
||||
no_speech_threshold = max(self.silence_duration_ms / 1000.0, 0.1)
|
||||
|
||||
last_heartbeat_time = time.time()
|
||||
|
||||
audio_buffer = [] # VAD 滑动窗口
|
||||
last_active_time = time.time() # 静音计时基准
|
||||
in_speech_segment = False # 是否处于语音段中(从检测到人声开始,直到静音超时结束)
|
||||
|
||||
try:
|
||||
while not self.stop_flag():
|
||||
# exception_on_overflow=False, 宁可丢帧,也不阻塞
|
||||
data = stream.read(self.chunk, exception_on_overflow=False)
|
||||
|
||||
# 回声消除处理
|
||||
processed_data = data
|
||||
if self.enable_echo_cancellation and self.echo_canceller and self.reference_signal_buffer:
|
||||
try:
|
||||
# 获取参考信号(长度与麦克风信号匹配)
|
||||
ref_signal = self.reference_signal_buffer.get_reference(num_samples=self.chunk)
|
||||
# 执行回声消除
|
||||
processed_data = self.echo_canceller.process(data, ref_signal)
|
||||
except Exception as e:
|
||||
if self.logger:
|
||||
self.logger.warning(f"回声消除处理失败: {e},使用原始音频")
|
||||
processed_data = data
|
||||
|
||||
# 检查是否应该将音频放入队列(用于阻止ASR,例如无声纹文件时需要注册)
|
||||
if self.should_put_to_queue():
|
||||
# 队列满时丢弃最旧的数据,ASR 跟不上时系统仍然听得见
|
||||
if self.audio_queue.full():
|
||||
self.audio_queue.get_nowait()
|
||||
# 使用处理后的音频数据(经过回声消除)
|
||||
self.audio_queue.put_nowait(processed_data)
|
||||
|
||||
# 音频chunk回调(用于声纹录音等,仅在需要时调用)
|
||||
if self.on_audio_chunk:
|
||||
# 回调使用处理后的音频数据
|
||||
self.on_audio_chunk(processed_data)
|
||||
|
||||
# VAD检测使用处理后的音频(经过回声消除)
|
||||
audio_buffer.append(processed_data) # 只用于 VAD,不用于 ASR
|
||||
|
||||
# VAD检测窗口
|
||||
now = time.time()
|
||||
if len(audio_buffer) * self.chunk / self.sample_rate >= window_sec:
|
||||
raw_audio = b''.join(audio_buffer)
|
||||
energy = self._calculate_energy(raw_audio)
|
||||
vad_result = self._check_activity(raw_audio)
|
||||
|
||||
self._debug_counter += 1
|
||||
if self._debug_counter >= 10:
|
||||
if self.logger:
|
||||
self.logger.info(f"[VAD调试] 能量={energy:.1f}, 阈值={self.min_energy_threshold}, VAD结果={vad_result}")
|
||||
self._debug_counter = 0
|
||||
|
||||
if vad_result:
|
||||
last_active_time = now
|
||||
|
||||
if not in_speech_segment: # 上一轮没说话,本轮开始说话
|
||||
in_speech_segment = True
|
||||
if self.on_speech_start:
|
||||
self.on_speech_start()
|
||||
|
||||
# 检测当前 TTS 是否在播放
|
||||
if self.is_playing() and self.on_new_segment:
|
||||
self.on_new_segment() # 打断 TTS的回调
|
||||
else:
|
||||
if in_speech_segment:
|
||||
# 处于语音段中,但当前帧为静音,检查静音时长
|
||||
silence_duration = now - last_active_time
|
||||
|
||||
# 动态获取静音阈值(如果提供回调函数)
|
||||
if self.get_silence_threshold:
|
||||
current_silence_ms = self.get_silence_threshold()
|
||||
current_no_speech_threshold = max(current_silence_ms / 1000.0, 0.1)
|
||||
else:
|
||||
current_no_speech_threshold = no_speech_threshold
|
||||
|
||||
# 添加调试日志
|
||||
if self.logger and silence_duration < current_no_speech_threshold:
|
||||
self.logger.debug(f"[VAD] 静音中: {silence_duration:.3f}秒 < {current_no_speech_threshold:.3f}秒阈值")
|
||||
|
||||
if silence_duration >= current_no_speech_threshold:
|
||||
if self.on_speech_end:
|
||||
if self.logger:
|
||||
self.logger.debug(f"[VAD] 触发speech_end: 静音持续时间 {silence_duration:.3f}秒 >= 阈值 {current_no_speech_threshold:.3f}秒")
|
||||
self.on_speech_end() # 通知系统用户停止说话
|
||||
in_speech_segment = False
|
||||
|
||||
if self.on_heartbeat and now - last_heartbeat_time >= self.heartbeat_interval:
|
||||
self.on_heartbeat()
|
||||
last_heartbeat_time = now
|
||||
|
||||
audio_buffer = []
|
||||
finally:
|
||||
if stream.is_active():
|
||||
stream.stop_stream()
|
||||
stream.close()
|
||||
|
||||
@staticmethod
|
||||
def _calculate_energy(audio_chunk: bytes) -> float:
|
||||
"""计算音频能量(RMS)"""
|
||||
if not audio_chunk:
|
||||
return 0.0
|
||||
# 计算样本数:音频字节数 // 2(因为是16位PCM,1个样本=2字节)
|
||||
n = len(audio_chunk) // 2
|
||||
if n <= 0:
|
||||
return 0.0
|
||||
# 把字节数据解包为16位有符号整数(小端序)
|
||||
samples = struct.unpack(f'<{n}h', audio_chunk[: n * 2])
|
||||
if not samples:
|
||||
return 0.0
|
||||
return (sum(s * s for s in samples) / len(samples)) ** 0.5
|
||||
|
||||
def _check_activity(self, audio_data: bytes) -> bool:
|
||||
"""VAD + 能量检测:先VAD检测,能量作为辅助判断"""
|
||||
energy = self._calculate_energy(audio_data)
|
||||
|
||||
rate = 0.4 # 连续人声经验值
|
||||
num = 0
|
||||
|
||||
# 采样率:16000 Hz, 帧时长:20ms=0.02s, 每帧采样点数=16000×0.02=320samples
|
||||
# 每帧字节数=320×2=640bytes
|
||||
bytes_per_sample = 2 # paInt16
|
||||
frame_samples = int(self.sample_rate * 0.02)
|
||||
frame_bytes = frame_samples * bytes_per_sample
|
||||
|
||||
if frame_bytes <= 0 or len(audio_data) < frame_bytes:
|
||||
return False
|
||||
|
||||
total_frames = len(audio_data) // frame_bytes
|
||||
required = max(1, int(total_frames * rate))
|
||||
|
||||
for i in range(0, len(audio_data), frame_bytes):
|
||||
chunk = audio_data[i:i + frame_bytes]
|
||||
if len(chunk) == frame_bytes:
|
||||
if self.vad_detector.vad.is_speech(chunk, sample_rate=self.sample_rate):
|
||||
num += 1
|
||||
|
||||
# 语音开头能量高, 中后段(拖音、尾音)能量下降
|
||||
vad_result = num >= required
|
||||
if vad_result and energy < self.min_energy_threshold * 0.5:
|
||||
return False
|
||||
|
||||
return vad_result
|
||||
|
||||
def cleanup(self):
|
||||
"""清理资源"""
|
||||
if hasattr(self, 'audio') and self.audio:
|
||||
self.audio.terminate()
|
||||
|
||||
131
robot_speaker/perception/camera_client.py
Normal file
131
robot_speaker/perception/camera_client.py
Normal file
@@ -0,0 +1,131 @@
|
||||
"""
|
||||
相机模块 - RealSense相机封装
|
||||
"""
|
||||
import numpy as np
|
||||
import contextlib
|
||||
|
||||
|
||||
class CameraClient:
|
||||
def __init__(self,
|
||||
serial_number: str | None,
|
||||
width: int,
|
||||
height: int,
|
||||
fps: int,
|
||||
format: str,
|
||||
logger=None):
|
||||
self.serial_number = serial_number
|
||||
self.width = width
|
||||
self.height = height
|
||||
self.fps = fps
|
||||
self.format = format
|
||||
self.logger = logger
|
||||
|
||||
self.pipeline = None
|
||||
self.config = None
|
||||
self._is_initialized = False
|
||||
self._rs = None
|
||||
|
||||
def _log(self, level: str, msg: str):
|
||||
if self.logger:
|
||||
getattr(self.logger, level, self.logger.info)(msg)
|
||||
else:
|
||||
print(f"[相机] {msg}")
|
||||
|
||||
def initialize(self) -> bool:
|
||||
"""
|
||||
初始化并启动相机管道
|
||||
"""
|
||||
if self._is_initialized:
|
||||
return True
|
||||
|
||||
try:
|
||||
import pyrealsense2 as rs
|
||||
self._rs = rs
|
||||
|
||||
self.pipeline = rs.pipeline()
|
||||
self.config = rs.config()
|
||||
|
||||
if self.serial_number:
|
||||
self.config.enable_device(self.serial_number)
|
||||
|
||||
self.config.enable_stream(
|
||||
rs.stream.color,
|
||||
self.width,
|
||||
self.height,
|
||||
rs.format.rgb8 if self.format == 'RGB8' else rs.format.bgr8,
|
||||
self.fps
|
||||
)
|
||||
|
||||
self.pipeline.start(self.config)
|
||||
self._is_initialized = True
|
||||
self._log("info", f"相机已启动并保持运行: {self.width}x{self.height}@{self.fps}fps")
|
||||
return True
|
||||
except Exception as e:
|
||||
self._log("error", f"相机初始化失败: {e}")
|
||||
self.cleanup()
|
||||
return False
|
||||
|
||||
def cleanup(self):
|
||||
"""停止相机管道,释放资源"""
|
||||
if self.pipeline:
|
||||
self.pipeline.stop()
|
||||
self._log("info", "相机已停止")
|
||||
self.pipeline = None
|
||||
self.config = None
|
||||
self._is_initialized = False
|
||||
|
||||
def capture_rgb(self) -> np.ndarray | None:
|
||||
"""
|
||||
从运行中的相机管道捕获一帧RGB图像
|
||||
"""
|
||||
if not self._is_initialized:
|
||||
self._log("error", "相机未初始化,无法捕获图像")
|
||||
return None
|
||||
|
||||
try:
|
||||
frames = self.pipeline.wait_for_frames()
|
||||
color_frame = frames.get_color_frame()
|
||||
|
||||
return np.asanyarray(color_frame.get_data())
|
||||
except Exception as e:
|
||||
self._log("error", f"捕获图像失败: {e}")
|
||||
return None
|
||||
|
||||
@contextlib.contextmanager
|
||||
def capture_context(self):
|
||||
"""
|
||||
上下文管理器:拍照并自动清理资源
|
||||
"""
|
||||
image_data = self.capture_rgb()
|
||||
try:
|
||||
yield image_data
|
||||
finally:
|
||||
if image_data is not None:
|
||||
del image_data
|
||||
|
||||
def capture_multiple(self, count: int = 1) -> list[np.ndarray]:
|
||||
"""
|
||||
捕获多张图像(为未来扩展准备)
|
||||
"""
|
||||
images = []
|
||||
for i in range(count):
|
||||
img = self.capture_rgb()
|
||||
if img is not None:
|
||||
images.append(img)
|
||||
else:
|
||||
self._log("warning", f"第{i+1}张图像捕获失败")
|
||||
return images
|
||||
|
||||
@contextlib.contextmanager
|
||||
def capture_multiple_context(self, count: int = 1):
|
||||
"""
|
||||
上下文管理器:捕获多张图像并自动清理资源
|
||||
"""
|
||||
images = self.capture_multiple(count)
|
||||
try:
|
||||
yield images
|
||||
finally:
|
||||
for img in images:
|
||||
del img
|
||||
images.clear()
|
||||
|
||||
98
robot_speaker/perception/echo_cancellation.py
Normal file
98
robot_speaker/perception/echo_cancellation.py
Normal file
@@ -0,0 +1,98 @@
|
||||
import collections
|
||||
import numpy as np
|
||||
|
||||
|
||||
class ReferenceSignalBuffer:
|
||||
"""参考信号缓冲区"""
|
||||
|
||||
def __init__(self, sample_rate: int, channels: int, max_duration_ms: int | None = None,
|
||||
buffer_seconds: float = 5.0):
|
||||
self.sample_rate = int(sample_rate)
|
||||
self.channels = int(channels)
|
||||
if max_duration_ms is not None:
|
||||
buffer_seconds = max(float(max_duration_ms) / 1000.0, 0.1)
|
||||
self.max_samples = int(self.sample_rate * buffer_seconds)
|
||||
self._buffer = collections.deque(maxlen=self.max_samples * self.channels)
|
||||
|
||||
def add_reference(self, data: bytes, source_sample_rate: int, source_channels: int):
|
||||
if source_sample_rate != self.sample_rate or source_channels != self.channels:
|
||||
return
|
||||
samples = np.frombuffer(data, dtype=np.int16)
|
||||
self._buffer.extend(samples.tolist())
|
||||
|
||||
def get_reference(self, num_samples: int) -> bytes:
|
||||
needed = int(num_samples) * self.channels
|
||||
if needed <= 0:
|
||||
return b""
|
||||
if len(self._buffer) < needed:
|
||||
data = list(self._buffer) + [0] * (needed - len(self._buffer))
|
||||
else:
|
||||
data = list(self._buffer)[-needed:]
|
||||
return np.array(data, dtype=np.int16).tobytes()
|
||||
|
||||
|
||||
class EchoCanceller:
|
||||
"""回声消除器(基于 aec-audio-processing)"""
|
||||
|
||||
def __init__(self, sample_rate: int, frame_size: int, channels: int, ref_channels: int, logger=None):
|
||||
self.sample_rate = int(sample_rate)
|
||||
self.frame_size = int(frame_size)
|
||||
self.channels = int(channels)
|
||||
self.ref_channels = int(ref_channels)
|
||||
self.logger = logger
|
||||
self.aec = None
|
||||
self._process_reverse = None
|
||||
self._frame_bytes = int(self.sample_rate / 100) * self.channels * 2 # 10ms, int16
|
||||
self._ref_frame_bytes = int(self.sample_rate / 100) * self.ref_channels * 2
|
||||
try:
|
||||
from aec_audio_processing import AudioProcessor
|
||||
self.aec = AudioProcessor(enable_aec=True, enable_ns=False, enable_agc=False)
|
||||
self.aec.set_stream_format(self.sample_rate, self.channels)
|
||||
if hasattr(self.aec, "set_reverse_stream_format"):
|
||||
self.aec.set_reverse_stream_format(self.sample_rate, self.ref_channels)
|
||||
if hasattr(self.aec, "set_stream_delay"):
|
||||
self.aec.set_stream_delay(0)
|
||||
if hasattr(self.aec, "process_reverse_stream"):
|
||||
self._process_reverse = self.aec.process_reverse_stream
|
||||
elif hasattr(self.aec, "process_reverse"):
|
||||
self._process_reverse = self.aec.process_reverse
|
||||
except Exception:
|
||||
self.aec = None
|
||||
|
||||
def process(self, mic_data: bytes, ref_data: bytes) -> bytes:
|
||||
if not self.aec:
|
||||
return mic_data
|
||||
if not mic_data:
|
||||
return mic_data
|
||||
|
||||
try:
|
||||
out_chunks = []
|
||||
total_len = len(mic_data)
|
||||
frame_bytes = self._frame_bytes
|
||||
ref_frame_bytes = self._ref_frame_bytes
|
||||
|
||||
frame_count = (total_len + frame_bytes - 1) // frame_bytes
|
||||
for i in range(frame_count):
|
||||
m_start = i * frame_bytes
|
||||
m_end = m_start + frame_bytes
|
||||
mic_frame = mic_data[m_start:m_end]
|
||||
if len(mic_frame) < frame_bytes:
|
||||
mic_frame = mic_frame + b"\x00" * (frame_bytes - len(mic_frame))
|
||||
|
||||
if ref_data:
|
||||
r_start = i * ref_frame_bytes
|
||||
r_end = r_start + ref_frame_bytes
|
||||
ref_frame = ref_data[r_start:r_end]
|
||||
if len(ref_frame) < ref_frame_bytes:
|
||||
ref_frame = ref_frame + b"\x00" * (ref_frame_bytes - len(ref_frame))
|
||||
if self._process_reverse:
|
||||
self._process_reverse(ref_frame)
|
||||
|
||||
processed = self.aec.process_stream(mic_frame)
|
||||
out_chunks.append(processed if processed is not None else mic_frame)
|
||||
|
||||
return b"".join(out_chunks)[:total_len]
|
||||
except Exception as e:
|
||||
if self.logger:
|
||||
self.logger.warning(f"回声消除处理失败: {e},使用原始音频")
|
||||
return mic_data
|
||||
304
robot_speaker/perception/speaker_verifier.py
Normal file
304
robot_speaker/perception/speaker_verifier.py
Normal file
@@ -0,0 +1,304 @@
|
||||
"""
|
||||
声纹识别模块
|
||||
"""
|
||||
import numpy as np
|
||||
import threading
|
||||
import tempfile
|
||||
import os
|
||||
import wave
|
||||
import time
|
||||
import json
|
||||
from enum import Enum
|
||||
|
||||
|
||||
class SpeakerState(Enum):
|
||||
"""说话人识别状态"""
|
||||
UNKNOWN = "unknown"
|
||||
VERIFIED = "verified"
|
||||
REJECTED = "rejected"
|
||||
ERROR = "error"
|
||||
|
||||
|
||||
class SpeakerVerificationClient:
|
||||
"""声纹识别客户端 - 非实时、低频处理"""
|
||||
|
||||
def __init__(self, model_path: str, threshold: float, speaker_db_path: str = None, logger=None):
|
||||
self.model_path = model_path
|
||||
self.threshold = threshold
|
||||
self.speaker_db_path = speaker_db_path
|
||||
self.logger = logger
|
||||
self.speaker_db = {} # {speaker_id: {"embedding": np.ndarray, "env": str, "threshold": float, "registered_at": float}}
|
||||
self._lock = threading.Lock()
|
||||
|
||||
# 优化CPU性能:限制Torch使用的线程数,防止多线程竞争导致性能骤降
|
||||
import torch
|
||||
torch.set_num_threads(1)
|
||||
|
||||
from funasr import AutoModel
|
||||
model_path = os.path.expanduser(self.model_path)
|
||||
# 禁用自动更新检查,防止每次初始化都联网检查
|
||||
self.model = AutoModel(model=model_path, device="cpu", disable_update=True)
|
||||
if self.logger:
|
||||
self.logger.info(f"声纹模型已加载: {model_path}, 阈值: {self.threshold}")
|
||||
|
||||
if self.speaker_db_path:
|
||||
self.load_speakers()
|
||||
|
||||
def _log(self, level: str, msg: str):
|
||||
"""记录日志 - 修复ROS2 logger在多线程环境中的问题"""
|
||||
if self.logger:
|
||||
try:
|
||||
log_methods = {
|
||||
"debug": self.logger.debug,
|
||||
"info": self.logger.info,
|
||||
"warning": self.logger.warning,
|
||||
"error": self.logger.error,
|
||||
"fatal": self.logger.fatal
|
||||
}
|
||||
log_method = log_methods.get(level.lower(), self.logger.info)
|
||||
log_method(msg)
|
||||
except ValueError as e:
|
||||
if "severity cannot be changed" in str(e):
|
||||
try:
|
||||
self.logger.info(f"[声纹-{level.upper()}] {msg}")
|
||||
except:
|
||||
print(f"[声纹-{level.upper()}] {msg}")
|
||||
else:
|
||||
raise
|
||||
else:
|
||||
print(f"[声纹] {msg}")
|
||||
|
||||
def _write_temp_wav(self, audio_data: np.ndarray, sample_rate: int = 16000):
|
||||
"""将numpy音频数组写入临时wav文件"""
|
||||
audio_int16 = audio_data.astype(np.int16)
|
||||
|
||||
fd, temp_path = tempfile.mkstemp(suffix='.wav', prefix='sv_')
|
||||
os.close(fd)
|
||||
|
||||
with wave.open(temp_path, 'wb') as wav_file:
|
||||
wav_file.setnchannels(1)
|
||||
wav_file.setsampwidth(2)
|
||||
wav_file.setframerate(sample_rate)
|
||||
wav_file.writeframes(audio_int16.tobytes())
|
||||
|
||||
return temp_path
|
||||
|
||||
def extract_embedding(self, audio_data: np.ndarray, sample_rate: int = 16000):
|
||||
"""
|
||||
提取说话人embedding(低频调用,一句话只调用一次)
|
||||
"""
|
||||
# 降采样到 16000Hz (如果需要)
|
||||
# Cam++ 等模型通常只支持 16k,如果传入 48k 会导致内部重采样极慢或计算量剧增
|
||||
target_sr = 16000
|
||||
if sample_rate > target_sr:
|
||||
if sample_rate % target_sr == 0:
|
||||
step = sample_rate // target_sr
|
||||
audio_data = audio_data[::step]
|
||||
sample_rate = target_sr
|
||||
else:
|
||||
# 简单的非整数倍降采样可能导致问题,但对于语音验证通常 48k->16k 是整数倍
|
||||
# 如果不是,此处暂不处理,依赖 funasr 内部处理,或者简单的步长取整
|
||||
step = int(sample_rate / target_sr)
|
||||
audio_data = audio_data[::step]
|
||||
sample_rate = target_sr
|
||||
|
||||
if len(audio_data) < int(sample_rate * 0.5):
|
||||
return None, False
|
||||
|
||||
temp_wav_path = None
|
||||
try:
|
||||
# 限制Torch在推理时使用单线程,避免在多任务环境下(尤其是一边录音一边识别)
|
||||
# 出现的极端CPU竞争和上下文切换开销
|
||||
import torch
|
||||
with torch.inference_mode():
|
||||
# 临时设置,虽然全局已经设置了,但在调用前再次确保
|
||||
# 注意:set_num_threads 是全局的,这里再次确认
|
||||
if torch.get_num_threads() != 1:
|
||||
torch.set_num_threads(1)
|
||||
|
||||
temp_wav_path = self._write_temp_wav(audio_data, sample_rate)
|
||||
result = self.model.generate(input=temp_wav_path)
|
||||
|
||||
embedding = result[0]['spk_embedding'].detach().cpu().numpy()[0] # shape [1, 192] -> [192]
|
||||
|
||||
embedding_dim = len(embedding)
|
||||
if embedding_dim == 0:
|
||||
return None, False
|
||||
|
||||
return embedding, True
|
||||
except Exception as e:
|
||||
self._log("error", f"提取embedding失败: {e}")
|
||||
return None, False
|
||||
finally:
|
||||
if temp_wav_path and os.path.exists(temp_wav_path):
|
||||
try:
|
||||
os.unlink(temp_wav_path)
|
||||
except:
|
||||
pass
|
||||
|
||||
def register_speaker(self, speaker_id: str, embedding: np.ndarray,
|
||||
env: str = "near", threshold: float = None) -> bool:
|
||||
"""
|
||||
注册说话人
|
||||
"""
|
||||
embedding_dim = len(embedding)
|
||||
if embedding_dim == 0:
|
||||
return False
|
||||
embedding_norm = np.linalg.norm(embedding)
|
||||
if embedding_norm == 0:
|
||||
self._log("error", f"注册失败:embedding范数为0")
|
||||
return False
|
||||
embedding_normalized = embedding / embedding_norm
|
||||
|
||||
speaker_threshold = threshold if threshold is not None else self.threshold
|
||||
|
||||
with self._lock:
|
||||
self.speaker_db[speaker_id] = {
|
||||
"embedding": embedding_normalized,
|
||||
"env": env, # 添加 env 字段
|
||||
"threshold": speaker_threshold,
|
||||
"registered_at": time.time()
|
||||
}
|
||||
self._log("info", f"已注册说话人: {speaker_id}, 阈值: {speaker_threshold:.3f}, 维度: {embedding_dim}")
|
||||
save_result = self.save_speakers()
|
||||
if not save_result:
|
||||
self._log("info", f"保存声纹数据库失败,但说话人已注册到内存: {speaker_id}")
|
||||
return True
|
||||
|
||||
def match_speaker(self, embedding: np.ndarray):
|
||||
"""
|
||||
匹配说话人(一句话只调用一次)
|
||||
"""
|
||||
if not self.speaker_db:
|
||||
return None, SpeakerState.UNKNOWN, 0.0, self.threshold
|
||||
|
||||
embedding_dim = len(embedding)
|
||||
if embedding_dim == 0:
|
||||
return None, SpeakerState.ERROR, 0.0, self.threshold
|
||||
|
||||
embedding_norm = np.linalg.norm(embedding)
|
||||
if embedding_norm == 0:
|
||||
return None, SpeakerState.ERROR, 0.0, self.threshold
|
||||
embedding_normalized = embedding / embedding_norm
|
||||
|
||||
best_match = None
|
||||
best_score = -1.0
|
||||
best_threshold = self.threshold
|
||||
|
||||
with self._lock:
|
||||
for speaker_id, speaker_data in self.speaker_db.items():
|
||||
ref_embedding = speaker_data["embedding"]
|
||||
score = np.dot(embedding_normalized, ref_embedding)
|
||||
|
||||
if score > best_score:
|
||||
best_score = score
|
||||
best_match = speaker_id
|
||||
best_threshold = speaker_data["threshold"]
|
||||
|
||||
state = SpeakerState.VERIFIED if best_score >= best_threshold else SpeakerState.REJECTED
|
||||
return (best_match, state, best_score, best_threshold)
|
||||
|
||||
def is_available(self) -> bool:
|
||||
return self.model is not None
|
||||
|
||||
def cleanup(self):
|
||||
"""清理资源"""
|
||||
pass
|
||||
|
||||
def get_speaker_count(self) -> int:
|
||||
with self._lock:
|
||||
return len(self.speaker_db)
|
||||
|
||||
def remove_speaker(self, speaker_id: str) -> bool:
|
||||
with self._lock:
|
||||
if speaker_id not in self.speaker_db:
|
||||
return False
|
||||
del self.speaker_db[speaker_id]
|
||||
self.save_speakers()
|
||||
return True
|
||||
|
||||
def load_speakers(self) -> bool:
|
||||
"""
|
||||
从文件加载已注册的声纹
|
||||
"""
|
||||
if not self.speaker_db_path:
|
||||
return False
|
||||
|
||||
if not os.path.exists(self.speaker_db_path):
|
||||
self._log("info", f"声纹数据库文件不存在: {self.speaker_db_path},将创建新数据库")
|
||||
return False
|
||||
|
||||
try:
|
||||
with open(self.speaker_db_path, 'r', encoding='utf-8') as f:
|
||||
data = json.load(f)
|
||||
|
||||
with self._lock:
|
||||
for speaker_id, speaker_data in data.items():
|
||||
embedding_list = speaker_data["embedding"]
|
||||
embedding_array = np.array(embedding_list, dtype=np.float32)
|
||||
|
||||
embedding_dim = len(embedding_array)
|
||||
if embedding_dim == 0:
|
||||
self._log("warning", f"跳过无效声纹: {speaker_id} (维度为0)")
|
||||
continue
|
||||
embedding_norm = np.linalg.norm(embedding_array)
|
||||
if embedding_norm > 0:
|
||||
embedding_array = embedding_array / embedding_norm
|
||||
|
||||
self.speaker_db[speaker_id] = {
|
||||
"embedding": embedding_array,
|
||||
"env": speaker_data["env"],
|
||||
"threshold": speaker_data["threshold"],
|
||||
"registered_at": speaker_data["registered_at"]
|
||||
}
|
||||
|
||||
count = len(self.speaker_db)
|
||||
self._log("info", f"已加载 {count} 个已注册说话人")
|
||||
return True
|
||||
except Exception as e:
|
||||
self._log("error", f"加载声纹数据库失败: {e}")
|
||||
return False
|
||||
|
||||
def save_speakers(self) -> bool:
|
||||
"""
|
||||
保存已注册的声纹到文件
|
||||
"""
|
||||
if not self.speaker_db_path:
|
||||
self._log("warning", "声纹数据库路径未配置,无法保存到文件(说话人已注册到内存)")
|
||||
return False
|
||||
|
||||
try:
|
||||
db_dir = os.path.dirname(self.speaker_db_path)
|
||||
if db_dir and not os.path.exists(db_dir):
|
||||
os.makedirs(db_dir, exist_ok=True)
|
||||
json_data = {}
|
||||
with self._lock:
|
||||
for speaker_id, speaker_data in self.speaker_db.items():
|
||||
json_data[speaker_id] = {
|
||||
"embedding": speaker_data["embedding"].tolist(), # numpy array -> list
|
||||
"env": speaker_data.get("env", "near"), # 兼容旧数据,默认使用 "near"
|
||||
"threshold": speaker_data["threshold"],
|
||||
"registered_at": speaker_data["registered_at"]
|
||||
}
|
||||
|
||||
temp_path = self.speaker_db_path + ".tmp"
|
||||
with open(temp_path, 'w', encoding='utf-8') as f:
|
||||
json.dump(json_data, f, indent=2, ensure_ascii=False)
|
||||
|
||||
os.replace(temp_path, self.speaker_db_path)
|
||||
|
||||
self._log("info", f"已保存 {len(json_data)} 个说话人到: {self.speaker_db_path}")
|
||||
return True
|
||||
except Exception as e:
|
||||
import traceback
|
||||
self._log("error", f"保存声纹数据库失败: {e}")
|
||||
self._log("error", f"保存路径: {self.speaker_db_path}")
|
||||
self._log("error", f"错误详情: {traceback.format_exc()}")
|
||||
temp_path = self.speaker_db_path + ".tmp"
|
||||
if os.path.exists(temp_path):
|
||||
try:
|
||||
os.unlink(temp_path)
|
||||
except:
|
||||
pass
|
||||
return False
|
||||
|
||||
@@ -1,55 +0,0 @@
|
||||
import rclpy
|
||||
from rclpy.node import Node
|
||||
from example_interfaces.msg import String
|
||||
import threading
|
||||
from queue import Queue
|
||||
import time
|
||||
import espeakng
|
||||
import pyttsx3
|
||||
|
||||
|
||||
class RobotSpeakerNode(Node):
|
||||
def __init__(self, node_name):
|
||||
super().__init__(node_name)
|
||||
self.novels_queue_ = Queue()
|
||||
self.novel_subscriber_ = self.create_subscription(
|
||||
String, 'robot_msg', self.novel_callback, 10)
|
||||
self.speech_thread_ = threading.Thread(target=self.speak_thread)
|
||||
self.speech_thread_.start()
|
||||
|
||||
def novel_callback(self, msg):
|
||||
self.novels_queue_.put(msg.data)
|
||||
|
||||
def speak_thread(self):
|
||||
# 初始化引擎
|
||||
engine = pyttsx3.init()
|
||||
# 调整参数
|
||||
engine.setProperty('rate', 150) # 语速(150更自然)
|
||||
engine.setProperty('volume', 1.0) # 音量(0.0-1.0)
|
||||
|
||||
# 选择中文音色(修正:使用 languages 属性,且是列表)
|
||||
voices = engine.getProperty('voices')
|
||||
for voice in voices:
|
||||
# 检查语音支持的语言列表中是否包含中文('zh' 或 'zh-CN' 等)
|
||||
if any('zh' in lang for lang in voice.languages):
|
||||
engine.setProperty('voice', voice.id)
|
||||
self.get_logger().info(f'已选择中文语音:{voice.id}')
|
||||
break
|
||||
else:
|
||||
self.get_logger().warning('未找到中文语音库,将使用默认语音')
|
||||
|
||||
while rclpy.ok():
|
||||
if self.novels_queue_.qsize() > 0:
|
||||
text = self.novels_queue_.get()
|
||||
engine.say(text)
|
||||
engine.runAndWait() # 等待语音播放完成
|
||||
else:
|
||||
time.sleep(0.5)
|
||||
|
||||
|
||||
|
||||
def main(args=None):
|
||||
rclpy.init(args=args)
|
||||
node = RobotSpeakerNode("robot_speaker_node")
|
||||
rclpy.spin(node)
|
||||
rclpy.shutdown()
|
||||
5
robot_speaker/understanding/__init__.py
Normal file
5
robot_speaker/understanding/__init__.py
Normal file
@@ -0,0 +1,5 @@
|
||||
"""理解层"""
|
||||
|
||||
|
||||
|
||||
|
||||
111
robot_speaker/understanding/context_manager.py
Normal file
111
robot_speaker/understanding/context_manager.py
Normal file
@@ -0,0 +1,111 @@
|
||||
"""
|
||||
对话历史管理模块
|
||||
"""
|
||||
from robot_speaker.core.types import LLMMessage
|
||||
import threading
|
||||
|
||||
|
||||
class ConversationHistory:
|
||||
"""对话历史管理器 - 实时语音"""
|
||||
|
||||
def __init__(self, max_history: int, summary_trigger: int):
|
||||
self.max_history = max_history
|
||||
self.summary_trigger = summary_trigger
|
||||
self.conversation_history: list[LLMMessage] = []
|
||||
self.summary: str | None = None
|
||||
|
||||
# 待确认机制
|
||||
self._pending_user_message: LLMMessage | None = None # 待确认的用户消息
|
||||
self._lock = threading.Lock() # 线程安全锁
|
||||
|
||||
def start_turn(self, user_content: str):
|
||||
"""开始一个新的对话轮次,暂存用户消息,等待LLM完成后确认写入历史"""
|
||||
with self._lock:
|
||||
self._pending_user_message = LLMMessage(role="user", content=user_content)
|
||||
|
||||
def commit_turn(self, assistant_content: str) -> bool:
|
||||
"""确认当前轮次完成,将usr和assistant消息写入历史"""
|
||||
with self._lock:
|
||||
if self._pending_user_message is None:
|
||||
return False
|
||||
|
||||
if not assistant_content or not assistant_content.strip():
|
||||
self._pending_user_message = None
|
||||
return False
|
||||
|
||||
self.conversation_history.append(self._pending_user_message)
|
||||
self.conversation_history.append(
|
||||
LLMMessage(role="assistant", content=assistant_content.strip())
|
||||
)
|
||||
|
||||
self._pending_user_message = None
|
||||
|
||||
self._maybe_compress()
|
||||
return True
|
||||
|
||||
def cancel_turn(self):
|
||||
"""取消当前待确认的轮次,丢弃待确认的用户消息,用于处理中断情况,防止不完整内容污染历史"""
|
||||
with self._lock:
|
||||
if self._pending_user_message is not None:
|
||||
self._pending_user_message = None
|
||||
|
||||
def add_message(self, role: str, content: str):
|
||||
"""直接添加消息"""
|
||||
with self._lock:
|
||||
# 如果有待确认的轮次,先取消它
|
||||
self.cancel_turn()
|
||||
self.conversation_history.append(LLMMessage(role=role, content=content))
|
||||
self._maybe_compress()
|
||||
|
||||
def get_messages(self) -> list[LLMMessage]:
|
||||
"""获取消息列表"""
|
||||
with self._lock:
|
||||
messages = []
|
||||
|
||||
if self.summary:
|
||||
messages.append(LLMMessage(role="system", content=self.summary))
|
||||
|
||||
if self.max_history > 0:
|
||||
messages.extend(self.conversation_history[-self.max_history * 2:])
|
||||
|
||||
if self._pending_user_message is not None:
|
||||
messages.append(self._pending_user_message)
|
||||
|
||||
return messages
|
||||
|
||||
def has_pending_turn(self) -> bool:
|
||||
"""检查是否有待确认的轮次"""
|
||||
with self._lock:
|
||||
return self._pending_user_message is not None
|
||||
|
||||
def _maybe_compress(self):
|
||||
"""压缩对话历史"""
|
||||
if self.max_history <= 0:
|
||||
self.conversation_history.clear()
|
||||
return
|
||||
|
||||
max_len = self.summary_trigger * 2
|
||||
if len(self.conversation_history) <= max_len:
|
||||
return
|
||||
|
||||
old = self.conversation_history[:-max_len]
|
||||
self.conversation_history = self.conversation_history[-max_len:]
|
||||
|
||||
summary_text = []
|
||||
for msg in old:
|
||||
summary_text.append(f"{msg.role}: {msg.content}")
|
||||
|
||||
compressed = "对话摘要:\n" + "\n".join(summary_text[-10:])
|
||||
|
||||
if self.summary:
|
||||
self.summary += "\n" + compressed
|
||||
else:
|
||||
self.summary = compressed
|
||||
|
||||
def clear(self):
|
||||
"""清空历史和待确认消息"""
|
||||
with self._lock:
|
||||
self.conversation_history.clear()
|
||||
self.summary = None
|
||||
self._pending_user_message = None
|
||||
|
||||
22
setup.py
22
setup.py
@@ -1,26 +1,36 @@
|
||||
from setuptools import find_packages, setup
|
||||
from setuptools import setup, find_packages
|
||||
import os
|
||||
from glob import glob
|
||||
|
||||
package_name = 'robot_speaker'
|
||||
|
||||
setup(
|
||||
name=package_name,
|
||||
version='0.0.0',
|
||||
packages=[package_name],
|
||||
version='0.0.1',
|
||||
packages=find_packages(where='.'),
|
||||
package_dir={'': '.'},
|
||||
data_files=[
|
||||
('share/ament_index/resource_index/packages',
|
||||
['resource/' + package_name]),
|
||||
('share/' + package_name, ['package.xml']),
|
||||
(os.path.join('share', package_name, 'launch'), glob('launch/*.launch.py')),
|
||||
(os.path.join('share', package_name, 'config'), glob('config/*.yaml') + glob('config/*.json')),
|
||||
],
|
||||
install_requires=[
|
||||
'setuptools',
|
||||
'pypinyin',
|
||||
],
|
||||
install_requires=['setuptools'],
|
||||
zip_safe=True,
|
||||
maintainer='mzebra',
|
||||
maintainer_email='mzebra@foxmail.com',
|
||||
description='TODO: Package description',
|
||||
description='语音识别和合成ROS2包',
|
||||
license='Apache-2.0',
|
||||
tests_require=['pytest'],
|
||||
entry_points={
|
||||
'console_scripts': [
|
||||
'robot_speaker_node=robot_speaker.robot_speaker_node:main'
|
||||
'robot_speaker_node = robot_speaker.core.robot_speaker_node:main',
|
||||
'register_speaker_node = robot_speaker.core.register_speaker_node:main',
|
||||
'skill_bridge_node = robot_speaker.bridge.skill_bridge_node:main',
|
||||
],
|
||||
},
|
||||
)
|
||||
|
||||
68
view_camera.py
Executable file
68
view_camera.py
Executable file
@@ -0,0 +1,68 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
查看相机画面的简单脚本
|
||||
按空格键保存当前帧,按'q'键退出
|
||||
"""
|
||||
import sys
|
||||
import cv2
|
||||
import numpy as np
|
||||
try:
|
||||
import pyrealsense2 as rs
|
||||
except ImportError:
|
||||
print("错误: 未安装pyrealsense2,请运行: pip install pyrealsense2")
|
||||
sys.exit(1)
|
||||
|
||||
def main():
|
||||
# 配置相机
|
||||
pipeline = rs.pipeline()
|
||||
config = rs.config()
|
||||
|
||||
# 启用彩色流
|
||||
config.enable_stream(rs.stream.color, 640, 480, rs.format.rgb8, 30)
|
||||
|
||||
# 启动管道
|
||||
pipeline.start(config)
|
||||
print("相机已启动,按空格键保存图片,按'q'键退出")
|
||||
|
||||
frame_count = 0
|
||||
try:
|
||||
while True:
|
||||
# 等待一帧
|
||||
frames = pipeline.wait_for_frames()
|
||||
color_frame = frames.get_color_frame()
|
||||
|
||||
if not color_frame:
|
||||
continue
|
||||
|
||||
# 转换为numpy数组 (RGB格式)
|
||||
color_image = np.asanyarray(color_frame.get_data())
|
||||
|
||||
# OpenCV使用BGR格式,需要转换
|
||||
bgr_image = cv2.cvtColor(color_image, cv2.COLOR_RGB2BGR)
|
||||
|
||||
# 显示图像
|
||||
cv2.imshow('Camera View', bgr_image)
|
||||
|
||||
# 等待按键
|
||||
key = cv2.waitKey(1) & 0xFF
|
||||
|
||||
if key == ord('q'):
|
||||
print("退出...")
|
||||
break
|
||||
elif key == ord(' '): # 空格键保存
|
||||
frame_count += 1
|
||||
filename = f'camera_frame_{frame_count:04d}.jpg'
|
||||
cv2.imwrite(filename, bgr_image)
|
||||
print(f"已保存: {filename}")
|
||||
|
||||
except KeyboardInterrupt:
|
||||
print("\n中断...")
|
||||
finally:
|
||||
pipeline.stop()
|
||||
cv2.destroyAllWindows()
|
||||
print("相机已关闭")
|
||||
|
||||
if __name__ == '__main__':
|
||||
main()
|
||||
|
||||
|
||||
Reference in New Issue
Block a user