Same utterance animated from four discrete speech representations, each paired with both decoder architectures.