Craigslove by april1452

Midterm Report

For the past four weeks, our group has been scraping posts from the Personals section of Craigslist in fourteen U.S. cities: New York, San Francisco, Los Angeles, Las Vegas, D.C., Denver, Seattle, Chicago, Minneapolis, Providence, Dallas, Oklahoma City, Miami, and Jacksonville. For each post, we collect the following data:

url
city
subcity
datetime created
datetime updated
category
type
title
text
age
body
body art
diet
dislikes
drinks
drugs
education
ethnicity
eye color
facial hair
fears
hair
height
hiv/hsv/hpv
interests
kids, have
kids, want
likes
native language
occupation
personality
pets
politics
religion
resembles
smokes
status
weight
zodiac

Using this data, we want to answer: What words do people use to indicate the types of relationships they want? What words do they use to describe themselves, and how does this compare to words used to describe ideal partners? What kinds of sentiments can we find in these posts, and can we find correlated factors to sentiment? By clustering posts, can we create general profiles of posters? What are the outliers in this case? How do profiles differ city to city, region to region, or by popular cultural perceptions of the city?

Visualization

Word Frequency in Craigslist Personal Ads

Machine Learning

Taking first steps toward building profiles, we try to predict the marital status of posters based on their posting type, the length of the title (in characters), the length of the text (in characters), age, body type, and height. Our training data is self-reported, so we are making the assumption that people are posting their true marital status.

Data

We had 87,789 samples with the 6 variables: 4 numerical and 2 categorical. The 2 categorical variables had 31 and 9 possible outputs, so we made them into 31 and 9 nominal variables; combined we had 44 variables. Our 6 variables are as follows:

type: {'w4t': 14, 't4m': 4, 'w4w': 3, 'mm4ww': 16, 'w4m': 0, 'ww4ww': 26, 'ww4mm': 23, 'mm4mm': 29, 'w4mw': 10, 'm4w': 1, 'm4t': 6, 'ww4m': 27, 'w4mm': 25, 'ww4w': 30, 'm4m': 2, 'mw4m': 24, 't4mw': 21, 't4w': 11, 't4t': 5, 't4mm': 18, 'mw4t': 22, 't4ww': 28, 'mw4w': 15, 'mm4mw': 17, 'mm4w': 20, 'm4mw': 7, 'mw4mw': 9, 'm4mm': 13, 'mm4m': 19, 'm4ww': 8, 'w4ww': 12}
title length: int
text length: int
age: int
body: {'heavy': 3, 'fit': 6, 'athletic': 5, 'big': 0, 'average': 1, 'curvy': 2, 'thin': 4, 'skinny': 7, 'hwp': 8}
height: int (in cm)

Using these variables, we try to predict:

status: {'separated': 2, 'widowed': 3, 'never': 6, 'divorced': 1, 'married': 4, 'partnered': 5, 'single': 0}

Method

We first clean and standardize the data so the result will not be heavily skewed by certain variables. Then we perform Weighted KNN with 10 neighbors using Euclidean distance, weighing each distance by the squared inverse. We use 5-fold cross validation. Our accuracy rate for the model was 87.6%. We tried other machine learning algorithms such as decision trees and SVMs, but they produced ~70% accuracy, likely due to the amount of nominal variables which could have skewed some of the calculations in SVMs.

Challenges

One challenge of using this model is that not everyone posts an age, body type, or height. We could work around this by figuring out a way to accommodate for the missing variables. In general, a challenge of using machine learning is picking something interest to predict. We thought about predicting whether or not a post intends to look for a relationship based on the title and text, but hand-classifying training data is largely tedious, and more importantly has a degree of unreliability since intentions are often ambiguous or cannot be ascertained from the post (even by a human).

Discussion

What is hardest part of the project that you've encountered so far?

The greatest difficulty in our project so far has not come in the process of finding answers, but in finding the right questions to ask. We have a great wealth of data in the personal ads, but our efforts to find deeper meaning--a signal in the noise--are limited by the lack of causative information. That is, what we would have loved to have and what was entirely absent from our data by the confidential nature of Craigslist, is information on resulting interactions from each posting. Thus, it is by no means a simple task to find questions of sufficient complexity and interest like those we’ve ultimately chosen to pursue.

What are your initial insights?

Our word frequency visualization allows us to draw some preliminary insights about who is using which words in their Craigslist personal ads.

Firstly, it's clear that posters in "man seeking man" use the words "guys", "clean", "host", and "cock" far more frequently than posters in any other category. It may be that the term "host" indicates that the poster is looking for an immediate hook-up, such as "I can host you at my place."

As for the term "love," female posters use it more often, both for women seeking men and women seeking women. They are following closely by male posters seeking women. From our visualization so far we are not willing to draw any conclusions about who is truly seeking love on craigslist, but we are looking forward to further investigating.

Are there any concrete results you can show at this point? If not, why not?

Please see Visualization and Machine Learning sections above.

Going forward, what are the current biggest problems you're facing?

We are still getting blocked by Craigslist servers occasionally, so the posts we have collected are not consistent over time. This means that we will not be able to accurately make any conclusions about when people are posting, e.g. time of day, day of the week.

Do you think you are on track with your project? If not, what parts do you need to dedicate more time to?

With a little over a month left, we still want to 1) cluster posts to create profiles, 2) answer how people describe themselves compared to how they describe their ideal partners, 3) perform sentiment analysis, and 4) make visualizations! We believe the timeline is feasible.

Given your initial exploration of the data, is it worth proceeding with your project?

Emphatically yes.