Tarique Anwar

Data Mining - CS524

Objective The objective of this course is to set up the foundational concepts and techniques in databases and data mining with particular emphasis on analyzing different kinds of data, and thus extracting meaningful information.
Instructor Tarique Anwar
Office: Room 205, CSE Block, IIT Ropar
Email: tarique@iitrpr.ac.in
Teaching Assistants Aroof Aimen
Email: 2018csz0001@iitrpr.ac.in

Rakesh Kumar Meena
Email: 2018csm1017@iitrpr.ac.in

Harsimar Singh
Email: 2018csm1010@iitrpr.ac.in
Class Schedule Monday, 11:00am-11:50am, Lecture
Tuesday, 12:00pm-12:50pm, Lecture
Wednesday, 09:00am-09:50am, Lecture
Venue: M4, Lecture Hall Complex
Lab Schedule Monday, 02:00pm-03:45pm, Lab
Venue: L2, CSE Block
Credits 4 (3 Lectures + 0 Tutotials + 2 Labs + 7 Self-study hours, weekly)
Who can take this course Pre-requisites: CS201 (DS) / CS506 (DSA)
If already completed CS503 (ML), the reasonably good students (Grade A- or A in CS503) are recommended to take the Advanced Data Mining course (CS724) being offered in parallel.

Syllabus ♣

Serial No. Topics References Lecture Notes/Slides
1. Introduction to Data Mining and Analysis, Attribute Types, Kernel Methods, Statistical Descriptions of Data, Measuring Data Similarity and Dissimilarity, Data Preprocessing, Graph Data, High Dimensional Data Chapters* 1-3 [1] [2] [3]
2. Frequent Pattern Mining: Mining and Summarizing Itemsets, Graph Pattern Mining, Pattern and Rule Assessment, Correlation Analysis, Pattern Mining in Mulitlevel and Multidimensional Data, Mining High Dimensional Data and Colossal Patterns, Mining Compressed or Approximate Patterns Chapters* 6-7 [6] [7]
3. Cluster Analysis: Representative-based Clustering, Hierarchical Clustering, Probabilistic Hierarchical Clustering, Density-based Clustering, Grid-based Clustering, Evaluation of Clustering, Probabilistic Model-based Clustering, Fuzzy Clustering, Clustering High Dimensional Data, Biclustering, Spectral and Graph Clustering, Clustering with Constraints Chapters* 10-11 [10][11]
4. Outlier Detection: Supervised, Semi-supervised, and Unsupervised Methods for Outlier Detection, Statistical Approaches, Proximity-based Approaches, Clustering-based Approaches, Classification-based Approaches, Mining Contextual and Collective Outliers, Outlier Detection in High Dimensional Data Chapter* 12 [12]
5. Classification: Probabilistic Classification,Rule-based Classification, Decision Tree Classifier, Classification Assessment, Techniques to Improve Classification Accuracy Chapters* 8 [8]

* Refer to the chapters of the book "Jiawei Han, Micheline Kamber and Jian Pei, Data Mining: Concepts and Techniques, Morgan Kaufmann Publishers, Third Edition".

Assessment Policy ♣

Lab Report Weightage: 10%
Hands-on Project Weightage: 15%
Quiz ♦ Weightage: 20%
Two Quizzes of 10% weightage each.
Mid-Semster Examination ♦ Weightage: 25%
Syllabus: Topics covered till the last class.
End-Semster Examination ♦ Weightage: 30%
Syllabus: Entire syllabus.
Attendance Policy All are expected to have 100% attendance.
> 75% attendance => Eligible for 1% bonus marks.
< 50% attendance => Grade will be lowered by 1 point.
< 30% attendance => Grade "F".
Grading Policy A combination of absolute and relative grading will be followed.

♣ Tentative
♦ Some Quizzes and Exams will be open-book/notes. The exact format will be announced one day before the scheduled date. Keep checking the announcements at the bottom of this page.

Textbooks

1. Jiawei Han, Micheline Kamber and Jian Pei, Data Mining: Concepts and Techniques, Morgan Kaufmann Publishers, 2011
2. Pang-Ning Tan, Michael Steinbach, Vipin Kumar, Introduction to Data Mining, Addison-Wesley, 2005
3. Mohammad J Zaki and Wagner Meira Jr., Data Mining and Analysis: Fundamental Concepts and Algorithms, Cambridge University Press, 2014
4. Charu C. Aggarwal, Data Mining: The Textbook, Springer, 2015

Lab Assignments

Number Assignment Due Date
Lab Assignment 1 Statistical Data Analysis (Download from here )
Additional Note on Task - T 7
06/02/2019
Lab Assignment 2 Frequent Pattern Mining (Download from here )
Dataset: Right click here and select "save link as" to download
Small Dataset: Right click here and select "save link as" to download
Dataset in PDF (Download from here)

Note: If the dataset in xml format is not opening in your computer because of its heavy size, there is no need to do that. Just use it as an input in your program to read from, and perform the tasks. To know and understand the contents of the xml file, see the Dataset in PDF (dblp50000.pdf). The contents of both these files are exactly the same.
20/02/2019 21/02/2019 and 03/03/2019

Update: 20/02/2019 21/02/2019 is the due date for submitting any two out of the three algorithms in task T1 plus all the remaining tasks using these implementations.

03/03/2019 is the due date for submitting the remaining implementation of T1 and the remaining part in T2.
Lab Assignment 3 Cluster Analysis (Download from here )
Dataset: Right click here and select "save link as" to download

Note: If the "Brighkite_totalCheckins.txt" file is not opening in your computer because of its heavy size, there is no need to do that. Just use it as an input in your program to read from, and perform the tasks.
24/03/2019
Lab Assignment 4 Outlier Detection (Download from here )
Dataset: Right click here and select "save link as" to download

Note: If the "Brighkite_totalCheckins.txt" file is not opening in your computer because of its heavy size, there is no need to do that. Just use it as an input in your program to read from, and perform the tasks.
22/04/2019

Project

Project Dataset Due Date
BrightKite (Download from here ) Download from here Proposal: 24/03/2019 25/03/2019
Progress Report: 07/04/2019 10/04/2019
Final Report: 21/04/2019
Demonstration: 22/04/2019

Announcements

23/04/2019: Quiz 2 has been scheduled at 6pm on Thursday (25/04/2019). It will be closed book/notes and of a duration of 1 hour. Its syllabus includes all the topics covered after the Mid-Sem, which are Chapters 10 (from Section 10.4.3 to the end), 11 (excluding the topics optimization using the delta-cluster algorithm, and enumerating all biclusters using MaPle in Section 11.2.3, and Sections 11.2.4 and 11.4), 12 (excluding Section 12.8), and 8 (excluding Sections 8.2.4, 8.2.5, 8.5.5, and 8.5.6) of the reference book (Data Mining: Concepts and Techniques, Han et al.).

19/04/2019: The feedback on the progress report of the project has been updated on Moodle. There will be no extension in the submission due date of the final report. The final report should have the following information and structure.

----------------------
Project Title:
Group Members [Name, Entry No., Contribution percentage] (Note: Ordering of the names has no significance.)
Section 1: Introduction and Background
Section 2: Aim
Section 3: Approach and Methodology
Section 3.1: Task 1
Section 3.2: Task 2
.
.
.
Section 4: Experimental Results and Evaluation
Section 4.1: Experimental Settings
Section 4.2: Evaluation Measures
Section 4.3: Experimental Results
Section 4.4: Discussion on Results
Section 5: Conclusion
----------------------

17/04/2019: The attendance record has been updated. Check it from here, and email me if you find any discrepancy.

12/04/2019: Lab Assignment 4 has been released. Download a copy from the Lab Assignments section above.

06/04/2019: The attendance record from 15/01/2019 to 13/03/2019 can be checked from here. Email me if you find any discrepancy.

05/04/2019: Tomorrow is a Monday schedule day. Therefore we will have our regular Monday lecture in the morning, and an extra lecture at 2pm in CS1.

The marks and feedback on the Project Proposal have been uploaded on Moodle. Please check the comments on Moodle and discuss with me if required.

The due date for the submission of progress report is being extended to 09/04/2019 (Tuesday). The progress report has to contain the following basic information. You may follow the specimen below for its format. The content of the proposal may be reused with significant improvement based on the proposal feedback. All the sections in this submission are supposed to contain a more detailed, clear, and specific information. The tasks already completed should be written with full details of the technical method. The report has to have a length of 2-3 pages in the ACM SIG Proceedings style.

----------------------
Project Title:
Group Members [Name, Entry No.] (Note: Ordering of the names has no significance. All are expected to have equal participation.)
Section 1: Introduction and Background
Section 2: Aim
Section 3: Approach and Methodology
Section 3.1: Task 1
Section 3.2: Task 2
.
.
.
Section 4: Work done so far
Section 5: Future plan
Section 6: Expected Outcomes and Benefits (including potential commercial value, if any)
----------------------

24/03/2019: This announcement is regarding the project proposal. The due date of the proposal is being extended by 1 day. So the new due date is 25/03/2019.

The proposal has to contain the following basic information in brief (may be a paragraph for each section/subsection). You may follow the specimen below for its format.

----------------------
Project Title:
Group Members [Name, Entry No.] (Note: Ordering of the names has no significance. All are expected to have equal participation.)
Section 1: Introduction and Background
Section 2: Aim
Section 3: Approach and Methodology
Section 3.1: Task 1
Section 3.2: Task 2
.
.
.
Section 4: Expected Outcomes and Benefits (including potential commercial value, if any)
----------------------

Re-submissions are allowed on Moodle as many times as you want. The last submission will be considered for marking, and the late penalty will apply based on the date of the last submission.

17/03/2019: All classes of Data Mining will remain off from 18/03/2019 to 29/03/2018. The next class will take place at 11am on 01/04/2019 (usual Monday lecture) in M4, and two extra classes from 2pm to 3:45pm on the same day in CS1.

10/03/2019: This announcement is regarding Quiz 1, MidSem, Assignment 3, and the Project.

The answer sheets of Quiz 1 and MidSem examination will be shown tomorrow during the lab time (2pm) in the lab (L2, CSE block). The solution sketch of the MidSem question paper can be downloaded from here. Students are recommended to firstly check the solution of the questions, if there is any doubt regarding the deduction of marks, before asking for clarification. If the doubts still remain, we are very happy to clarify them.

Lab Assignment 3 has been released. Download a copy from the Lab Assignments section above.

The Project details has been released. Download a copy from the Project section above and start working.

26/02/2019: The solution sketch of Quiz 1 can be downloaded from here.

22/02/2019: The Mid-Semester Examination will be closed book/notes. Its syllabus includes all the topics covered till the last class, which are Chapters 1-3 (excluding Section 2.3), 6-7, and 10 (only upto Section 10.4.2, including this section) of the reference book (Data Mining: Concepts and Techniques, Han et al.).

20/02/2019: Assignment 2: I came to know that the computer systems in the Lab are not able to deal with the given dataset, because of its heavy size. Some students may be facing the same issue on their personal computers as well. The first task is independent of the dataset. So, this should not be a problem. For all the remaining tasks, you can alternatively download the small dataset of 20,888 records (dblp20888.xml) from the Lab Assignments section above, and work directly on this xml file, instead of the dblp5000.xml file.

Considering some requests, the due date for the first submission of Assignment 2 has been extended by 1 day. So the first part has to be submitted by the end of tomorrow (21/02/2019).

13/02/2019: Quiz 1 has been scheduled at 2pm on Monday (18/02/2019) in M6, Lecture Hall Complex. It will be closed book/notes and of a duration of 1 hour. Its syllabus includes all the topics covered till today's class, which are Chapters 1-3 (excluding Section 2.3), 6-7, and 10 (only upto Section 10.3.2, including this section) of the reference book (Data Mining: Concepts and Techniques, Han et al.).

Further details about Lab Assignment 2 have been added. Also the submission requirements and due date have been updated. Check the Lab Assignments section above.

10/02/2019: Lab Assignment 2 has been released. Download a copy from the Lab Assignments section above. Those who haven't submitted Assignment 1 yet, today is the last date for submission with penalty. Submission link for Assignment 1 will close tonight.

05/02/2019: The assignment submission link is open on Moodle now. Join the Data Mining course on Moodle, if not done yet, with enrolment key cs524_201820192.

29/01/2019: An additional note on Lab Assignment 1 has been released. Check the Lab Assignments section above.

27/01/2019: Lab Assignment 1 has been released. Download a copy from the Lab Assignments section above.

20/01/2019: Those who are taking the course but have not submitted the ADD request on CRP, need to submit the request at an earliest (CRP has listed the course code as CS5XX), and then send me an email mentioning the Grade obtained in Machine Learning (if taken previously). Those who have already submitted their request, but have not got approval yet, need to send me an email mentioning their Grade in Machine Learning (if taken previously). ADD requests will be approved only after I know the Grade in Machine Learning.

Also, please join the Data Mining course on Moodle, with enrolment key cs524_201820192.

13/01/2019: I will be on leave tomorrow. So there will be no Data Mining classes tomorrow (Monday, 14th January 2019). Any urgent communications can be directed through email.

08/01/2019: The confusion about the timing of this course has been clarified and confirmed. Please see the details above. Classes are going to start from tomorrow. The first class is going to take place at 09am tomorrow in M4, Lecture Hall Complex.

07/01/2019: Welcome to the course CS524 - Data Mining. It will be taught by Dr Tarique Anwar (myself). The timing and venue will be announced here shortly.



Note: This page will be updated regularly with all the helpful information and announcements. Students are recommended to keep checking the updates here.