Data Science Course IFT6758

Project Description

The goal of this project is to build a system for automatic recognition of the age, gender, and personality of social media users. When given as input users’ genderate content (e.g., text, image and relations), this system should return as output the age, gender and personality trait scores of that user.

Setup

To access a breif instruction to access the server click here.

Need more info to build your software for the project? check here.

Scoreboard

To see the score of your team, check the scoreboard.

You can find papers here that describe how other people have approached the same or very similar problems. To get a better understanding of the problem domain, it is highly recommended that you read one or more of these papers.

Project Grading

The project counts for 35% of your final grade. The project will be graded out of a total of 35 points as follows:

Results on the scoreboard on week 5 (September 30): 2 points. If your results are at least as good as the baseline forall prediction tasks, your team gets full credit for this part. [english, french]
In-class progress update on week 10 (November 5): 5 points. This is a group presentation, with slides. All teammembers should present. Students who do not present can not get credit for this part. [english, french]

Requirements: Give an overview of the prediction task, and present statistical analysis of the data using visualization tools. Present the method that you use for each data source/task. Provide an overview of the prediction results you obtained by applying the machine learning methods on the public trainset. Note that your score on the scoreboard (Evaluation #5: November 4) should beat the baseline for atleast one of the tasks.

Final in-class presentation on week 15 (December 3): 10 points. This is a group presentation, with slides. All teammembers should present. Students who do not present can not get credit for this part. We will decide at random at the beginning of the lecture on the last weel which teams will present.

Requirements: Give an overview of the prediction task, and present statistical analysis of the data using visualization tools. Present the method that you use for each data source/task. You need to use all three sources in your software. Provide an overview of the prediction results you obtained by applying the machine learning methods on the public trainset. Note that your score on the scoreboard (Evaluation #10: December 16) should atleast beat the baseline for all three tasks.

Code and documentation on week 15 (December 23): 8 points . This is evaluated based in the readability of the code and all teammembers need to answer coding questions at the final presentation regarding their software to get credit for this part.
Report on week 15 (December 23): 10 points (= 8 points group report + 2 points individual report)

Your grade for the progress updates is based on a 7 minute presentation in class (approx. 1-2 slides per each team member) and the results of your software so far.

Deliverables

Group report

(1 upload per team) Provide a write-up of your research in the form of an academic paper that could be submitted to a conference on data mining/machine learning. Your paper should be self-contained. Everyone who has read the assigned reading materials from the course should be able to read and understand your paper. That means that in your paper you can be brief about machine learning methods that are described in the assigned readings, but that you need to provide sufficient details about the problem domain, the dataset, as well as about any other machine learning methods that you used that were not covered in class. The reasons for this are: (1) a description of the problem domain and the dataset will allow to share your paper with interested parties who have not taken the course but who have general knowledge of machine learning; (2) a description of machine learning methods not covered in class will allow to evaluate whether you truly understood those methods instead of treating them as a black box. Your paper can for instance be divided into sections as follows (but if another structure works better for you, don’t feel restricted to the one below):

Introduction: a description of the problem (profiling of Facebook users), what the goals of the study are, and a very brief description of the results.
Methodology: a brief description of the machine learning methods used.
Dataset and metrics: a description of the datasets and the evaluation measures used.
Results: an overview of the results you obtained by applying the methods from section 2 to the dataset from section 3 using the metrics from section 3. In addition to reporting numbers, your analysis of the results should also contain your insights into the results, i.e. why did a particular method work well/did not work well?
Conclusion and future work: briefly summarize your results and list opportunities for future research that seem promising to you but for which you did not find the time within this quarter.

Formatting guidelines: up to 8 pages, double column, ACM Proceedings format. In case you need more than 8 pages, consider splitting your material in a main paper and an appendix.

Individual report

(1 upload per student) You will also submit a brief individual report (at most one page), which will:

Describe the parts of the project you worked on (which machine learning methods you applied, which preprocessing steps you performed on the data, which parts of the term paper you wrote, who you worked with on what parts, etc.) and what parts of the project your teammates worked on.
What you learned from the project.

The purpose of the individual report is to facilitate fair grading and to allow the instructor to understand well what you learned from the project.