SPCell

SPCell 

AI content creator

0subscribers

2posts

My Experience with the RVC Neural Network

Hello, readers! I’m SPCell, and I’d like to share my experience working with the RVC neural network, which I used to create “ Ai covers”,  transferring the vocal performance of one song onto another person’s voice. I began experimenting in summer 2023, had settled on optimal parameters by early 2024, and have since returned from time to time to fine‑tune the quality.
At the heart of it all is the training process: you must select argument values so that the output voice sounds as natural as possible and free of artifacts. If the model is under‑trained, the voice sounds robotic; if it’s over‑trained, the pitch begins to “jump,” but the voice itself still sounds acceptable, which is better than audible glitches. Initially, the best encoder was considered to be harvest, but after rmvpe appeared I switched to it, and later when rmvpe+ came out I adopted that as well, since it produced a modest but noticeable improvement over the version without the “+.”
Other training arguments I tweaked included:
*bitrate (depends on the sample rate of your dataset files),
*hop length (controls how strictly the pitch matches the original; lower values force a tighter match, higher values allow more flexibility),
*thread count (likely tied to how many GPU threads are used, affecting training strength),
*batch size (simultaneous file processing to speed up training; I set it to the maximum my GPU could handle),
*the total number of epochs (and saving checkpoints at intervals),
*the number of GPUs used.
I trained my models on Kaggle, where I could employ two GPUs, but found out that using a single GPU provided a cleaner final voice. To separate vocals from instrumentals I used Ultimate Vocal Remover, then cleaned any remaining artifacts in Adobe Audition, RX Pro Audio Editor, and SpectraLayers.
When it came time to generate covers, I always specified rmvpe or rmvpe+ as an argument, testing pitch adjustments separately so that the voice would match my dataset. In songs where the original singer performed at unusually high or low pitches, I’d raise or lower the generation pitch relative to the song’s normal sections (where the singer stays on a single tone) to keep the character of the voice aligned with the dataset.
Finally, I want to talk about building the dataset I've placed this section at the end because it consumed the most of my time. In total, I aimed for about 42–50 minutes of clean audio (I don’t recall the exact figure). Even when working with game dialogue datasets where characters speak a lot you have to listen through all the lines, cut out the unwanted ones, and still stay within the time limit. The hardest datasets were recordings of real people, since you’d have to chop up their speech and spend hours cleaning background noise with the programs mentioned above. Those huge time and energy investments combined with zero donations ultimately demotivated me from continuing neurocovers (though I might return someday, as I still have ideas).
The final stage of creating each cover involved syncing the character’s vocal track with the instrumental, adding reverb to the vocal, creating or sourcing a thematic cover image, inserting it into my Photoshop template, setting up simple animations in Adobe After Effects, and rendering the final video.
That’s the story of my time with RVC. It once amazed me and was a lot of fun, but the time investment and lack of audience feedback dampened my enthusiasm. My current priority is Stable Diffusion, which I’ve also been exploring since summer 2023. I’ll probably break my experience with it into a series of posts since there’s far more material to cover, so stay tuned for plenty of interesting content!
Subscription levels1

Benefactor

$5.7 per month
Simply thanks for supporting me. If there will be more donators I will consider introducing more subscription levels with special features.
Go up