Craigslove

Love in the time of Craigslist

Download as .zip Download as .tar.gz View on GitHub

Blogpost #1

This week, we used Python's sklearn library to assign TF/IDF weights to all of the craigslist personal ads we've scraped, and then we performed k-means clustering on them. So far, we've played around with clustering only the Providence posts, because the data from all the cities combined was too big (for now). The results of 2-means clustering (i.e., the top terms used for each of 2 clusters) are as follows. It would appear that the first cluster corresponds to ads for low-commitment encounters, and the second to ads for deeper relationships:


top terms for cluster 0:

top terms for cluster 1:


Below are the results from another iteration, of 5-means clustering, as well as a corresponding word-cloud visualisation. It seems that there is moderate overlap from one cluster to another, and the distinctions between them are becoming less well-defined:


top terms for cluster 0:

top terms for cluster 1:

top terms for cluster 2:

top terms for cluster 3:

top terms for cluster 4:



We've encountered some challenges with working on the entirety of our craigslist data. When we attempted to perform clustering on that large a data set, we found that the machine on which we were running our code didn't have sufficient RAM to perform operations of that magnitude. After consulting our mentor Vinh, we are considering options such as using a sparse matrix implementation to be less computationally intensive given the high sparsity of our data set, or principal component analysis to reduce its dimensionality. More simply, we're also going to see if switching to a machine with more RAM is sufficient for our needs.

Looking ahead, we plan on using different clustering algorithms to investigate the possibility that some may perform better than others. We also hope to be able to run those clustering algorithms on larger datasets, as it would be interesting to compare clusters across different geographic regions. Lastly, we'd like to ultimately create a naive front end for our project through which to display our findings.