I have been toying around with Kaggle’s Million Song Dataset Challenge recently because I have some interest in collaborative filtering (using matrix factorization). I haven’t made much progress with the competition (all 3 of my submissions are below the baseline), but I have learned a few things about dealing with large amounts of data.

The goal of the competition is to predict the 500 most likely songs each of 110,000 users will listen to next. As the name implies, there are 1,000,000 songs in the full dataset. To simplify things, I decided to concentrate on the most popular songs. I created a 110,000 x 2,000 matrix of 0’s and 1’s. Row i, column j is 1 if user i had listened to song j (the jth most popular song) and 0 if user i had not. As you can imagine, there are a lot more 0’s than 1’s in this matrix. The first few rows and columns look like this:

This matrix was about 430 Mb and took a while to load into MATLAB. So I wisened up and created a sparse matrix. A sparse matrix realizes that most of the values are 0’s and does not record them. Instead, it lists the locations of the non-zero elements and what the value is. For example, this is what the first few rows of the sparse matrix looks like:

The first column is the row number, the second is the column number, and the third is the value at that location. In this application, all the values are 1. For this matrix, I used the 50,000 most popular songs (instead of just 2,000), and the size was much smaller – just 17 Mb.

It is easy to load the sparse matrix into MATLAB with the spconvert command, and many of MATLAB’s functions (like singular value decomposition) are optimized for sparse matrices.

The goal of the competition is to predict the 500 most likely songs each of 110,000 users will listen to next. As the name implies, there are 1,000,000 songs in the full dataset. To simplify things, I decided to concentrate on the most popular songs. I created a 110,000 x 2,000 matrix of 0’s and 1’s. Row i, column j is 1 if user i had listened to song j (the jth most popular song) and 0 if user i had not. As you can imagine, there are a lot more 0’s than 1’s in this matrix. The first few rows and columns look like this:

0 0 1 0 0 0 1 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 …

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 …

1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 …

…

This matrix was about 430 Mb and took a while to load into MATLAB. So I wisened up and created a sparse matrix. A sparse matrix realizes that most of the values are 0’s and does not record them. Instead, it lists the locations of the non-zero elements and what the value is. For example, this is what the first few rows of the sparse matrix looks like:

1 3 1

1 7 1

1 10 1

1 13 1

1 82 1

1 717 1

2 1111 1

2 2972 1

2 3516 1

1 7 1

1 10 1

1 13 1

1 82 1

1 717 1

2 1111 1

2 2972 1

2 3516 1

…

The first column is the row number, the second is the column number, and the third is the value at that location. In this application, all the values are 1. For this matrix, I used the 50,000 most popular songs (instead of just 2,000), and the size was much smaller – just 17 Mb.

It is easy to load the sparse matrix into MATLAB with the spconvert command, and many of MATLAB’s functions (like singular value decomposition) are optimized for sparse matrices.