Craigslove by april1452

Blogpost #1

This week, we used Python's sklearn library to assign TF/IDF weights to all of the craigslist personal ads we've scraped, and then we performed k-means clustering on them. So far, we've played around with clustering only the Providence posts, because the data from all the cities combined was too big (for now). The results of 2-means clustering (i.e., the top terms used for each of 2 clusters) are as follows. It would appear that the first cluster corresponds to ads for low-commitment encounters, and the second to ads for deeper relationships:

top terms for cluster 0:

host
clean
cock
stats
suck
discreet
free
nice

top terms for cluster 1:

love
fun
time
interested
meet
nice
email
real

Below are the results from another iteration, of 5-means clustering, as well as a corresponding word-cloud visualisation. It seems that there is moderate overlap from one cluster to another, and the distinctions between them are becoming less well-defined:

top terms for cluster 0:

discreet
clean
host
masculine
safe
ddf

top terms for cluster 1:

top
bottom
host
stats
fuck
oral

top terms for cluster 2:

cock
suck
stats
big
nice
host

top terms for cluster 3:

dick
suck
big
love
host
ass

top terms for cluster 4:

host
stats
fun
ddf
clean
white

We've encountered some challenges with working on the entirety of our craigslist data. When we attempted to perform clustering on that large a data set, we found that the machine on which we were running our code didn't have sufficient RAM to perform operations of that magnitude. After consulting our mentor Vinh, we are considering options such as using a sparse matrix implementation to be less computationally intensive given the high sparsity of our data set, or principal component analysis to reduce its dimensionality. More simply, we're also going to see if switching to a machine with more RAM is sufficient for our needs.

Looking ahead, we plan on using different clustering algorithms to investigate the possibility that some may perform better than others. We also hope to be able to run those clustering algorithms on larger datasets, as it would be interesting to compare clusters across different geographic regions. Lastly, we'd like to ultimately create a naive front end for our project through which to display our findings.