Tim Bush – Data Scientist at Lynchpin.
Obviously, the best Christmas song ever written is Fairytale of New York. It contains despair, heartbreak and profanity, a bit like secret Santa gift giving at the Lynchpin office.
However, as far as I know, Shane Macgowan wrote this without the help of analytics or machine learning algorithms, so I think I can probably do better.
To begin with I am going to use the interesting ‘Million Songs Dataset’ which is a 280Gb database of songs hosted by the University of Columbia
Million Song Dataset | scaling MIR research
This dataset is large and therefore complicated to access. However, an SQL database of some of the metadata of each track is more manageable. As a naive starting point, we can pull out tracks that have ‘Christmas’, ‘Santa Claus’, ‘Xmas’, ‘Reindeer’, ‘Jingle Bells’ or ‘Ho Ho Ho’ in the title. This gives us about 3136 songs to play with.

Christmas songs over time
If we look at the fraction of total songs that are Christmas Songs over the past few decades, the results are a bit depressing. The fraction of total pop songs that are Christmas songs has been decreasing since the 1940s (the earliest Christmas song is ‘White Christmas’ by Bing Crosby in 1942), and has basically been rubbish since the 1950s.
Clearly the time is ripe for some Data Science innovation in the Christmas genre.

Which artist should record a Christmas song?
Looking at the Last_fm ‘hotttnesss’ (a popularity measure) of artists, we can see that artists that have released Christmas songs are more popular than the average artist (we can only speculate whether this is correlation or causation).
In terms of artist selection, the most popular artist on Last_fm who have NOT yet released a Christmas Song, is Daft Punk*, so this is surely the only logical choice for our artist.
How long should a Christmas song be?
The honour of the longest Christmas related song goes to Charlie Daniels for ‘A Carolina Christmas Carol’, which is an astonishing 16.5 minutes of festive banter. In terms of very short songs, a notable number are by ‘The Wiggles’ from their 26 song album, ‘A Wiggly Wiggly Christmas’. Looking at the mean song duration, it appears that Christmas songs are shorter than average.
What about the lyrics?
Clearly the most important part. I am going to use the Natural Language Toolkit (nltk) in Python to see if we can get an idea of the lyrical content of Christmas songs. I took the lyrics from the Best 50 Christmas songs according to TimeOut magazine http://www.timeout.com/london/music/the-50-best-christmas-songs. For comparison, I also included lyrics from notable Christmas Carols ( Traditional Christmas Carol Lyrics).
The graph below shows the similarity of these different song types by lyrical content; each datapoint represents a song, and if they are closer together then they are classified as being more lyrically similar (this is known as an MDS plot).
The technical details are described more fully at the end. I used a clustering algorithm that we use at Lynchpin for customer segmentation; in that case each datapoint could represent the combination of products purchased by a customer, and the clustering would be based on some attribute of that customer (it could even be based on the results of performing a similar text-mining analysis on the results of a customer survey).
christmas-song-clustersThere are three clear clusters here. Christmas Pop songs (red and yellow stars) and Christmas Carols (purple stars) can be clearly distinguished by lyrical content. Interestingly, ‘The Wassailing Song’ by Blur is close to being classified as a Christmas Carol (which makes sense because it is based on a traditional Christmas song).  The top words in each of these categories are as follows:

Christmas Carols Santa Songs NO Santa Songs
King, born, Christ, come, angels, night, heavenly, singing, Lord, God, peacefully, little, Earth, love Santa, Oh, come, got, like, know, long, little, Christmas, let, just, snow, trees, night, good Christmas, time, baby, just, years, like, snow, got, let, Oh, happy, day, love, singing, trees

Basically, Christmas carols are more religious, and there are two types of Christmas pop song:

  • Santa songs: Songs that mention Santa (such as ‘Santa Claus Is Coming To Town’ and ‘Santa Claus Go Straight To The Ghetto’)
  • No Santa songs: Songs that do not mention Santa (such as ‘I wish it could be Christmas every day’ and ‘Last Christmas’).

I am just going to generate random sequences of words from this list until it spits out something that I decide sounds nice. However what comes out is some basic indication of a recipe for success. However that recipe needs refining.

Happy Snow Time 
Christmas singing years
Trees snow Christmas
happy years Oh years baby years
baby let time Christmas singing
(*Repeat ad infinitum)

So how do we form a song out of this? Well, at Lynchpin, we understand that the results of analytics need to be interpreted based on your Business Objectives before they make any sense. A list of random words generated by an algorithm does not make a song, just like a list of numbers without any interpretation is not a business outcome.

So with that in mind, we proudly present:

Happy Snow Time – Its Christmas Baby (A NO Santa Christmas Song by Tim Bush and performed by Daft Punk)
The trees are singing
Just like all the years

When they got covered with snow
On Christmas Day oh
Happy snow time
Its Christmas baby
I just want to let you know
Before I met you
I was a naked tree
Without a covering of snow
With nothing to sing for
Baby never let me go
Happy snow time
Its Christmas baby
Happy snow time
Its Christmas baby

Next year
The Million Songs Dataset also includes information on the musical content of song, such as tempo and time signature, it would be interesting to do a Data Science Project about the musical structure of Christmas songs using this information.

Technical Details of Song Lyric Clustering
Here is a basic qualitative description of how I clustered songs by their lyrical content, and then generated the plot that you saw earlier, this was all done in Python:

  • txt files of song tiles and song lyrics as lists.
  • Use TfidfVectorizer from scikit-learn which transforms text into feature vectors that can then be used as inputs to numerical algorithms. This step also removes words which are known as ‘Stopwords’ (basically unimportant and boring words such as ‘our’ and ‘from’). Importantly, it also stems words, meaning that it reduces variant forms of a word to their root (e.g. loved, loving and love will all be reduced to love).
  • Run a k-means clustering algorithm on the results.
  • Generate an MDS plot using some similarity measure.
  • Colour the datapoints in the MDS plot by what cluster they were assigned as being part of.

Technical Details of The Analysis of Christmas Songs Over Time
This was all done using the sqlite3 library within python. I basically wrote SQL queries to pull out Christmas songs, put the results in a pandas dataframe, and then plotted the results using matplotlib.