NYC 311 Dataset
What is (are!) the data? All 311 service requests in the past 9 years. Includes several descriptors for each request, including agency directed to, description of complaint, and location of incedent.
What format is the data in? Can be exported in any major format (CSV, JSON, RDF, RSS, TSV, XML)
What are the dimensions of the data (rows and columns)? Each row is a singular complaint. Each column is a descriptor of that complaint. There are 41 of those descrptors, and 21,500,000(!!) rows for each complaint.
What are the "variables" (also known as "data items"). In a CSV these would be the column headings. Do you recognize the data types (numbers, strings, images, etc.)? * With 41 variables, there is quite a variety in what information is provided. It can, however, be generally sorted into request identifiers (id, date), agency directed to, description of request, and location of request.
Is there missing, incorrect, or otherwise problematic data? Location information is sometimes a bit spotty for some of the location properties, most notably location type. There's also a few columns meant for specific types of requests (like Taxi Pick Up Location) that have very few entries in them.
How and why was this data collected? This data was collected for governmental statistic keeping, most likely to monitor whether the 311 program was effective in and of itself and what city services received the most requests and complaints.
For whom is this data accurate or useful? What is this data unrepresentative of? (Who is missing and left out of the data?) This data is most useful to the government itself, as it is their services and their hotline that this data deals with. However, it could also be useful to citizens looking to pressure changes in governmental policy as well. Any problems that are not directed to 311 are left out of the data set, which includes a variety of in-person and other non-telephone-based communications.
Knowing what you know now about machine learning, what will a model trained on this data help you do? A model trained on this dataset could predict whether a certain request will be fufilled/closed based off of input parameters. This can be used as a pre-screen before filling out a request, or as a project that demonstrates whether there are biases within the system.
Are there are alternative (non-machine learning) methods you could use instead? With this much data, there's a lot of interesting statistics taht one could tease out of the dataset. These statistics can be presented in either an interactive or static project that shows trends in both reporting of and response to complaints.