Recognition of emotion in speech usually uses acoustic models that ignore 
the spoken content. Likewise one general model per emotion is trained 
independent of the phonetic structure. Given sufficient data, this approach 
seemingly works well enough. This paper tries to answer the question whether 
acoustic emotion recognition strongly depends on phonetic content, 
and if models tailored for the spoken unit can lead to higher accuracies. 
We therefore investigate phoneme-, and word-models by use of a large prosodic, 
spectral, and voice quality feature space and SVM. Experiments also take 
the necessity of ASR into account to select appropriate unit-models. 
Test-runs on the well-known EMO-DB database facing speaker-independence 
demonstrate superiority of word emotion models over today’s common general 
models provided sufficient occurrences in the training corpus.