What’s Happening?
Many of you are probably wondering where we are with the upcoming release of Kenzy v2.0 so I’ve decided to give a short update on where things stand.
MyCroft’s Demise
As you may be aware, the open source personal assistant known as MyCroft had to shut down earlier this year. The details on why can be found on their blog, but that put us in a rather interesting position with Kenzy. In our original release Kenzy relied on MyCroft’s “padaos” and “padatious” libraries that are the neural network-based intent parsers that were central to rationalizing incoming voice commands and translating them into actions. In addition, there were plans to switch from the robotic festival vox voice to MyCroft’s Mimic 3 which allows for around 500 different voices that sound pretty good in my opinion.
Well, like most good things, MyCroft has come to an end. In order to push Kenzy forward I’ve decided to fork the pados and padatious libraries from MyCroft and continue to support them for compatibility and upgrades. Who knows, they may even get some new features along the way.
Problem #1: Solved
Coqui’s Re-focus
Kenzy 1.0 originally used Mozilla DeepSpeech as its speech-to-text transcription model. DeepSpeech has been dead now for a while and move of the developers moved over to Coqui where they had built out a model zoo for speech-to-text which was almost a drop-in replacement for DeepSpeech.
Late last year Coqui decided to take a different direction and focus their efforts on text-to-speech instead of speech-to-text and in doing so they stopped developing their models. Ugh and double ugh.
Enter OpenAi’s Whisper.
I found Whisper through Coqui as they listed that as one of the reasons they were leaving the speech-to-text market. Well, after a bit of research and trial-n-error I was able to confirm that Whisper could work under specific conditions for Kenzy’s operation. The downside right now though is that it does not like the Raspberry Pi very much because of the torch/torchaudio libraries it requires. For now though I’m going forward with it as its accuracy is very impressive (and miles ahead of Coqui which makes sense why they would surrender that battle and move on to other things).
Whisper is still open source and it does run locally (no cloud services or 3rd party integrations required… sort of).
Problem #2: Workaround Enabled
Status Update
Kenzy 1.0 to Kenzy 2.0 has been a journey so far, but at each turn I’ve learned a lot and it’s been exciting. I’ve had to replace virtually every major library, but in the process every major feature is now more configurable with higher quality results. I also re-engineered the service process, communication mechanisms, and interactions so that they are more robust. In the new model each service can run independently on the same or separate hardware without conflicts so you can grow your Kenzy network however you want. It all still works 100% locally, but the hardware limits are a bit higher at present (and a problem I’m still working through).
When will it be available?
Soon-ish.
I’ve actually got kenzy running on 4x mini PCs that are watching over my house with 7x 2K network cameras. Right now she is just recording clips when she sees a person for security purposes, but she’s also looking for faces of people she knows and has the capacity to capture transcribed audio.
I’ve still got to wire up the skills from version 1.0 to 2.0 and finish out the web control panel, but those will likely be over-simplified for version 2.0.
I don’t have a release date yet as this has been a labor of love to this point. Feel free to shoot me a message if you’d like to buy me a cup of coffee or sponsor any of the features to get things to a stable state faster… or better yet just clone the repo and start submitting your merge requests.
Don’t worry, as I’m using Kenzy to build out my personal smart home it will continue to be developed (however slowly) for the forseeable future regardless of income. I’ll try to get a post out in the coming days on that whole setup.