Craigslove

Love in the time of Craigslist

Download as .zip Download as .tar.gz View on GitHub

Midterm Report

For the past four weeks, our group has been scraping posts from the Personals section of Craigslist in fourteen U.S. cities: New York, San Francisco, Los Angeles, Las Vegas, D.C., Denver, Seattle, Chicago, Minneapolis, Providence, Dallas, Oklahoma City, Miami, and Jacksonville. For each post, we collect the following data:

Using this data, we want to answer: What words do people use to indicate the types of relationships they want? What words do they use to describe themselves, and how does this compare to words used to describe ideal partners? What kinds of sentiments can we find in these posts, and can we find correlated factors to sentiment? By clustering posts, can we create general profiles of posters? What are the outliers in this case? How do profiles differ city to city, region to region, or by popular cultural perceptions of the city?

Visualization

Word Frequency in Craigslist Personal Ads

Machine Learning

Taking first steps toward building profiles, we try to predict the marital status of posters based on their posting type, the length of the title (in characters), the length of the text (in characters), age, body type, and height. Our training data is self-reported, so we are making the assumption that people are posting their true marital status.

Data

We had 87,789 samples with the 6 variables: 4 numerical and 2 categorical. The 2 categorical variables had 31 and 9 possible outputs, so we made them into 31 and 9 nominal variables; combined we had 44 variables. Our 6 variables are as follows:


Using these variables, we try to predict:

Method

We first clean and standardize the data so the result will not be heavily skewed by certain variables. Then we perform Weighted KNN with 10 neighbors using Euclidean distance, weighing each distance by the squared inverse. We use 5-fold cross validation. Our accuracy rate for the model was 87.6%. We tried other machine learning algorithms such as decision trees and SVMs, but they produced ~70% accuracy, likely due to the amount of nominal variables which could have skewed some of the calculations in SVMs.

Challenges

One challenge of using this model is that not everyone posts an age, body type, or height. We could work around this by figuring out a way to accommodate for the missing variables. In general, a challenge of using machine learning is picking something interest to predict. We thought about predicting whether or not a post intends to look for a relationship based on the title and text, but hand-classifying training data is largely tedious, and more importantly has a degree of unreliability since intentions are often ambiguous or cannot be ascertained from the post (even by a human).

Discussion

The greatest difficulty in our project so far has not come in the process of finding answers, but in finding the right questions to ask. We have a great wealth of data in the personal ads, but our efforts to find deeper meaning--a signal in the noise--are limited by the lack of causative information. That is, what we would have loved to have and what was entirely absent from our data by the confidential nature of Craigslist, is information on resulting interactions from each posting. Thus, it is by no means a simple task to find questions of sufficient complexity and interest like those we’ve ultimately chosen to pursue.

Our word frequency visualization allows us to draw some preliminary insights about who is using which words in their Craigslist personal ads.

Firstly, it's clear that posters in "man seeking man" use the words "guys", "clean", "host", and "cock" far more frequently than posters in any other category. It may be that the term "host" indicates that the poster is looking for an immediate hook-up, such as "I can host you at my place."

As for the term "love," female posters use it more often, both for women seeking men and women seeking women. They are following closely by male posters seeking women. From our visualization so far we are not willing to draw any conclusions about who is truly seeking love on craigslist, but we are looking forward to further investigating.

Please see Visualization and Machine Learning sections above.

We are still getting blocked by Craigslist servers occasionally, so the posts we have collected are not consistent over time. This means that we will not be able to accurately make any conclusions about when people are posting, e.g. time of day, day of the week.

With a little over a month left, we still want to 1) cluster posts to create profiles, 2) answer how people describe themselves compared to how they describe their ideal partners, 3) perform sentiment analysis, and 4) make visualizations! We believe the timeline is feasible.

Emphatically yes.