Tarique Anwar

Data Mining - CS524

Objective The objective of this course is to set up the foundational concepts and techniques in databases and data mining with particular emphasis on analyzing different kinds of data, and thus extracting meaningful information.
Instructor Tarique Anwar
Office: Room 205, CSE Block, IIT Ropar
Email: tarique@iitrpr.ac.in
Teaching Assistants Aroof Aimen
Email: 2018csz0001@iitrpr.ac.in

Rakesh Kumar Meena
Email: 2018csm1017@iitrpr.ac.in

Harsimar Singh
Email: 2018csm1010@iitrpr.ac.in
Class Schedule Monday, 11:00am-11:50am, Lecture
Tuesday, 12:00pm-12:50pm, Lecture
Wednesday, 09:00am-09:50am, Lecture
Venue: M4, Lecture Hall Complex
Lab Schedule Monday, 02:00pm-03:45pm, Lab
Venue: L2, CSE Block
Credits 4 (3 Lectures + 0 Tutotials + 2 Labs + 7 Self-study hours, weekly)
Who can take this course Pre-requisites: CS201 (DS) / CS506 (DSA)
If already completed CS503 (ML), the reasonably good students (Grade A- or A in CS503) are recommended to take the Advanced Data Mining course (CS724) being offered in parallel.

Syllabus ♣

Serial No. Topics References Lecture Notes/Slides
1. Introduction to Data Mining and Analysis, Attribute Types, Kernel Methods, Statistical Descriptions of Data, Measuring Data Similarity and Dissimilarity, Data Preprocessing, Graph Data, High Dimensional Data Chapters* 1-3 [1] [2] [3]
2. Frequent Pattern Mining: Mining and Summarizing Itemsets, Graph Pattern Mining, Pattern and Rule Assessment, Correlation Analysis, Pattern Mining in Mulitlevel and Multidimensional Data, Mining High Dimensional Data and Colossal Patterns, Mining Compressed or Approximate Patterns Chapters* 6-7 [6] [7]
3. Cluster Analysis: Representative-based Clustering, Hierarchical Clustering, Probabilistic Hierarchical Clustering, Density-based Clustering, Grid-based Clustering, Evaluation of Clustering, Probabilistic Model-based Clustering, Fuzzy Clustering, Clustering High Dimensional Data, Biclustering, Spectral and Graph Clustering, Clustering with Constraints Chapters* 10-11 [10]
4. Outlier Detection: Supervised, Semi-supervised, and Unsupervised Methods for Outlier Detection, Statistical Approaches, Proximity-based Approaches, Clustering-based Approaches, Classification-based Approaches, Mining Contextual and Collective Outliers, Outlier Detection in High Dimensional Data Chapter* 12 Download
5. Classification: Probabilistic Classification,Rule-based Classification, Decision Tree Classifier, Classification by Backpropagation, Support Vector Machines, Classification using Frequent Patterns, Lazy Learners, Genetic Algorithms, Classification Assessment, Techniques to Improve Classification Accuracy, Random Forests Chapters* 8-9 Download
6. Data Mining Research Frontiers: Mining Complex Data Types, Data Mining Applications, Data Mining and Society Chapter* 13 Download

* Refer to the chapters of the book "Jiawei Han, Micheline Kamber and Jian Pei, Data Mining: Concepts and Techniques, Morgan Kaufmann Publishers, Third Edition".

Assessment Policy ♣

Lab Report Weightage: 10%
Hands-on Project Weightage: 15%
Quiz ♦ Weightage: 20%
Two Quizzes of 10% weightage each.
Mid-Semster Examination ♦ Weightage: 25%
Syllabus: Topics covered till the last class.
End-Semster Examination ♦ Weightage: 30%
Syllabus: Entire syllabus.
Attendance Policy All are expected to have 100% attendance.
> 75% attendance => Eligible for 1% bonus marks.
< 50% attendance => Grade will be lowered by 1 point.
< 30% attendance => Grade "F".
Grading Policy A combination of absolute and relative grading will be followed.

♣ Tentative
♦ Some Quizzes and Exams will be open-book/notes. The exact format will be announced one day before the scheduled date. Keep checking the announcements at the bottom of this page.

Textbooks

1. Jiawei Han, Micheline Kamber and Jian Pei, Data Mining: Concepts and Techniques, Morgan Kaufmann Publishers, 2011
2. Pang-Ning Tan, Michael Steinbach, Vipin Kumar, Introduction to Data Mining, Addison-Wesley, 2005
3. Mohammad J Zaki and Wagner Meira Jr., Data Mining and Analysis: Fundamental Concepts and Algorithms, Cambridge University Press, 2014
4. Charu C. Aggarwal, Data Mining: The Textbook, Springer, 2015

Lab Assignments

Number Assignment Due Date
Lab Assignment 1 Statistical Data Analysis (Download from here )
Additional Note on Task - T 7
06/02/2019
Lab Assignment 2 Frequent Pattern Mining (Download from here )
Dataset: Right click here and select "save link as" to download
Small Dataset: Right click here and select "save link as" to download
Dataset in PDF (Download from here)

Note: If the dataset in xml format is not opening in your computer because of its heavy size, there is no need to do that. Just use it as an input in your program to read from, and perform the tasks. To know and understand the contents of the xml file, see the Dataset in PDF (dblp50000.pdf). The contents of both these files are exactly the same.
20/02/2019 21/02/2019 and 03/03/2019

Update: 20/02/2019 21/02/2019 is the due date for submitting any two out of the three algorithms in task T1 plus all the remaining tasks using these implementations.

03/03/2019 is the due date for submitting the remaining implementation of T1 and the remaining part in T2.

Announcements

22/02/2019: The Mid-Semester Examination will be closed book/notes. Its syllabus includes all the topics covered till the last class, which are Chapters 1-3 (excluding Section 2.3), 6-7, and 10 (only upto Section 10.4.2, including this section) of the reference book (Data Mining: Concepts and Techniques, Han et al.).

20/02/2019: Assignment 2: I came to know that the computer systems in the Lab are not able to deal with the given dataset, because of its heavy size. Some students may be facing the same issue on their personal computers as well. The first task is independent of the dataset. So, this should not be a problem. For all the remaining tasks, you can alternatively download the small dataset of 20,888 records (dblp20888.xml) from the Lab Assignments section above, and work directly on this xml file, instead of the dblp5000.xml file.

Considering some requests, the due date for the first submission of Assignment 2 has been extended by 1 day. So the first part has to be submitted by the end of tomorrow (21/02/2019).

13/02/2019: Quiz 1 has been scheduled at 2pm on Monday (18/02/2019) in M6, Lecture Hall Complex. It will be closed book/notes and of a duration of 1 hour. Its syllabus includes all the topics covered till today's class, which are Chapters 1-3 (excluding Section 2.3), 6-7, and 10 (only upto Section 10.3.2, including this section) of the reference book (Data Mining: Concepts and Techniques, Han et al.).

Further details about Lab Assignment 2 have been added. Also the submission requirements and due date have been updated. Check the Lab Assignments section above.

10/02/2019: Lab Assignment 2 has been released. Download a copy from the Lab Assignments section above. Those who haven't submitted Assignment 1 yet, today is the last date for submission with penalty. Submission link for Assignment 1 will close tonight.

05/02/2019: The assignment submission link is open on Moodle now. Join the Data Mining course on Moodle, if not done yet, with enrolment key cs524_201820192.

29/01/2019: An additional note on Lab Assignment 1 has been released. Check the Lab Assignments section above.

27/01/2019: Lab Assignment 1 has been released. Download a copy from the Lab Assignments section above.

20/01/2019: Those who are taking the course but have not submitted the ADD request on CRP, need to submit the request at an earliest (CRP has listed the course code as CS5XX), and then send me an email mentioning the Grade obtained in Machine Learning (if taken previously). Those who have already submitted their request, but have not got approval yet, need to send me an email mentioning their Grade in Machine Learning (if taken previously). ADD requests will be approved only after I know the Grade in Machine Learning.

Also, please join the Data Mining course on Moodle, with enrolment key cs524_201820192.

13/01/2019: I will be on leave tomorrow. So there will be no Data Mining classes tomorrow (Monday, 14th January 2019). Any urgent communications can be directed through email.

08/01/2019: The confusion about the timing of this course has been clarified and confirmed. Please see the details above. Classes are going to start from tomorrow. The first class is going to take place at 09am tomorrow in M4, Lecture Hall Complex.

07/01/2019: Welcome to the course CS524 - Data Mining. It will be taught by Dr Tarique Anwar (myself). The timing and venue will be announced here shortly.



Note: This page will be updated regularly with all the helpful information and announcements. Students are recommended to keep checking the updates here.