An eclectic Spotify library, visualized with Chartify

I’ve been collecting songs in my Spotify library for a good few years now, so it’s fairly representative of my more recent tastes. After looking for tools to get detailed information on my library for analysis purposes, I ran across this Medium post by Dimitris Spathis on visualization of music tastes gathered from Sort Your Music. Just connect your Spotify account, choose a playlist, and receive from each song its:

Title and Artist
Release date
BPM (beats per minute) - speed/tempo
Energy level
Danceability
Loudness - in decibels
Valence - positive mood/positivity
Length - duration
Acousticness
Popularity

(Actual table column names bolded.)

I created a playlist of my entire library, copied and saved the table as a CSV, and voila! Ready for cleaning and visualization. Unfortunately in this process I did lose some title/artist names that were not in Latin script; however these don’t affect anything for the current project.

Even without the overwhelming number of anime and alt rock songs from my younger years, the genres represented in these 756 songs are pretty eclectic, and encompass everything from jazzhop to indie pop, from classical to hard rock, from folk music to electronica. Though language isn’t reported by Sort Your Music, I know my music is also multilingual, including Brazilian Portuguese, Turkish, Japanese, French, and Russian on top of English. It would be nice to know what, out of all of this variety, I gravitate to most.

Coincidentally, as I was preparing to work on this project, Spotify released a new wrapper for Bokeh called Chartify that is meant to make visualization a breeze. I figured I’d give both Spathis’s approach and Chartify’s new functionality a go at once, using my own dataset.

Data prep

Data cleaning for this project wasn’t terribly difficult, considering much of the dataframe is made up of simple integers. I stripped down the release date into release year, and converted song length into duration in seconds. After seeing an anomalous year in the data (1900??), I corrected it to the release year for that recording. Danceability, valence/positivity, acousticness, and popularity were all measured on a 0-100 scale, making quick comparisons even in the dataframe possible. On to exploring the data!

Exploration

Next I turned to the trusty histogram for looking at distributions of our song variables, and Chartify makes combining them with KDE plots quite easy. Functionality for Bokeh’s gridplots and layouts isn’t supported in Chartify yet, so to display multiple plots at once you can manually save and organize plots together in a photo editor (e.g. Illustrator) or use this workaround with print screen in the meantime. It’s currently unclear how to format years on the x-axis, so they’ll contain commas until further documentation is released. Additionally, the KDE overlay seems to experience a minor bug, becoming a solid color when opacity should be lower.

years_plot.png
other_vars1.png
Means marked with a black vertical line.

Means marked with a black vertical line.

My music tastes are largely a few years behind the times, with peaks forming around 2005 and 2015, both time periods containing full albums rather than my usual ‘pick and choose’ approach to songs. The full time range begins with John Coltrane’s “Naima” and runs to 2018 mostly due to jazzhop playlists. The tempo of my music is middling, as is danceability and positivity. My songs come in at 242 seconds long (4 minutes 2 seconds), generally a bit shorter than the average current radio song length, and they’re quite loud, somewhat acoustic, and not very popular. Sounds about right.

It would also be nice to know which variables are related to one another within my library. Below are the highest correlations in hexplot form.

corr_plots.png

These correlations make pretty intuitive sense — energy and loudness go together while they both oppose acousticness, and danceability is related to positive mood.

Modelling

Now to the meat of it — what might a recommender system for new songs base its decisions off of for me?

To find this out, we need to look at this data ‘cloud’ made up of many dimensions. No need to break our brains attempting to understand that many dimensions — using a robust method called PCA (Principal Components Analysis), we can flatten this cloud into 2 dimensions (an x and y) that we can plot.

Clickthrough for zoomable version.

Clickthrough for zoomable version.

The dimensionality reduction left us with a tangle of song titles, but now we can see which songs are closer together and therefore more similar. Outliers are also obvious now.

We can use an SVM (Support Vector Machine) classifier on this dataset, combined with a contour plot, to show which songs are most influential in determining new recommendations.

Each point is a song, and songs become increasingly influential (brighter) until the golden standard, the recommendation area, is found in the magenta spaces.

Each point is a song, and songs become increasingly influential (brighter) until the golden standard, the recommendation area, is found in the magenta spaces.

Our outliers on the left have a surprising amount of influence — turns out that the left cluster is made up of short songs, and the small groups to the right are incredibly long songs, notably Irish folk songs and Coheed & Cambria. This means our new x-axis from PCA is based on song duration.

The very dense cluster toward the bottom of the magenta recommendation space is made up of 90s pop, indie music, and alternative rock favorites. This model would create fantastic recommendations for me, if shown new songs.


Initial review of Chartify

I always enjoy learning new tools, especially when they’ll make plotting in Python more intuitive. Chartify certainly accomplishes that, getting rid of much of the esoteric portions of working with Matplotlib and simplifying Bokeh. Creating charts is relatively straightforward once you get the grammar of it; the method chaining is nice too, and the defaults for each figure/line type are aesthetically pleasing.

That said, the tool is new and very much in active development, so a lot of features found in more mature libraries are not yet available. It took a lot of staring at their examples notebook and even sorting through source code to figure out what I could and couldn’t do within the library itself, and what I needed to try and access through Bokeh. Manually changing the opacity of a figure in a plot is a lost cause, for the time being.

I’m definitely looking forward to seeing where this library takes visualization in Python.


Full code available on Github.