Description
Despite the importance that music has played in multi-media across many cultures and for the better part of human history (from Gamelan performances, to 19th century Italian opera, through to today), it remains a mystery why humans prefer one music-video pairing over another. We present a novel dataset of human-annotated music videos. Our hope is that this dataset can serve as a springboard for a new vein of research into human audio-visual correspondences in the context of music and video, where no assumptions are made from the outset about which audio-visual features are implicated in human cross-modal correspondences. We also sketch out some approaches to learning these correspondences directly from the data in an end-to-end manner using contemporary machine learning methods, and present some preliminary results. We describe a model --- a three-stream audio-visual convolutional network --- that predicts these human judgments. Our primary contribution is a novel dataset of videos paired with a variety of music samples, for which we obtained human aesthetic judgments (ratings of the degree of “fit” between the music and video).