My Experience with the RVC Neural Network
Hello, readers! I’m SPCell, and I’d like to share my experience working with the RVC neural network, which I used to create “ Ai covers”, transferring the vocal performance of one song onto another person’s voice. I began experimenting in summer 2023, had settled on optimal parameters by early 2024, and have since returned from time to time to fine‑tune the quality.
At the heart of it all is the training process: you must select argument values so that the output voice sounds as natural as possible and free of artifacts. If the model is under‑trained, the voice sounds robotic; if it’s over‑trained, the pitch begins to “jump,” but the voice itself still sounds acceptable, which is better than audible glitches. Initially, the best encoder was considered to be harvest, but after rmvpe appeared I switched to it, and later when rmvpe+ came out I adopted that as well, since it produced a modest but noticeable improvement over the version without the “+.”
Other training arguments I tweaked included:
*bitrate (depends on the sample rate of your dataset files),
*hop length (controls how strictly the pitch matches the original; lower values force a tighter match, higher values allow more flexibility),
*thread count (likely tied to how many GPU threads are used, affecting training strength),
*batch size (simultaneous file processing to speed up training; I set it to the maximum my GPU could handle),
*the total number of epochs (and saving checkpoints at intervals),
*the number of GPUs used.
I trained my models on Kaggle, where I could employ two GPUs, but found out that using a single GPU provided a cleaner final voice. To separate vocals from instrumentals I used Ultimate Vocal Remover, then cleaned any remaining artifacts in Adobe Audition, RX Pro Audio Editor, and SpectraLayers.
When it came time to generate covers, I always specified rmvpe or rmvpe+ as an argument, testing pitch adjustments separately so that the voice would match my dataset. In songs where the original singer performed at unusually high or low pitches, I’d raise or lower the generation pitch relative to the song’s normal sections (where the singer stays on a single tone) to keep the character of the voice aligned with the dataset.
neural
network
ai
cover
rvc
machine
learning
ml