R: 
C: 
INSTRUCTIONS: Write the answer in the designated lines. Your answers must be wellorganized and wellwritten. Write your name on all pages. Use PENCIL ONLY. 
QUESTIONS
 Assume data provided below is discretized with categorical Categories of each feature have the same significance. The distance between data points Pi and Pj is d(i,j) = 1 – (m/N), where m and N denote the number of matches and the number of features, respectively. Features are Age (A), Salary (S), Number of movie watched/month (N), Size of family (F).
 Obtain the contingency tables and calculate the dissimilarity DisSim(i,j) and the similarity Sim(i,j) between data points Pi and




 Calculate proximity Prx(Pi, Pj) between data points based on the supremum distance measure for the same data set assuming values are numeric (not categorical). What is the most similar data points? Why? Show your
Prx(Pi, Pj)  P1  P2  P3 
P1  
P2  
P3 
 Assume that each data point {P1, P2, P3} represents a transaction. Convert these transactions which are in the horizontal data format into the vertical data format. For an “attribute=value”, such as “A=1” use the format “A1” in the vertical data format. Add columns as
ITEM  
TID_SET 
 Given samples below,
 Find clusters using the algorithm Kmedoids (closest to the mean). Initial seed points are P1 for cluster1 (“o”) and P10 for cluster2 (“”). Use Manhattan Use ceiling for decimal values. In case of equal distance, keep the point in its current cluster. Show your calculations only for the first iteration. Plot points with their clusters (use curvy line) and mark centroid point with “*” at each iteration; use the charts provided. Start clustering on the first chart below.
P1  P2  P3  P4  P5  P6  P7  P8  P9  P10  
X  5  15  20  25  30  30  35  40  40  60 
Y  30  30  20  40  35  50  25  30  45  50 
P10 
P6 
P4 
P9 
P5 
P1 
P2 
P8 
P7 
P3 
 Find clusters using the hierarchical algorithm. Use the Complete Link (MAX) as the intercluster proximity measure and Manhattan Use ceiling for decimal values. Show clusters with curvy line at each iteration on the charts below; name them with “K” for cluster designation such as “K1” for the cluster1. Calculate proximity matrix values for each iteration and enter them into the table below; when a cluster is in equal proximity to others, then merge with the one with larger size. Start clustering on the first chart below.
P10 
P6 
P4 
P9 
P5 
P1 
P2 
P8 
P7 
P3 
MAX(i,j) 
MAX(i,j) 
MAX(i,j) 
MAX(i,j) 
3) Assume that you have a data set represented by three (3) attributes A, B, C. The value categories for each attribute are V(A) = {a1, a2}, V(B) = {b1, b2, b3}, and V(C) = {c1, c2}; note that an item (attribute=value), e.g. “A
= 1” is represented by “A1”. The distribution of the transactions is given in the table1.
 Fill up NOTshaded empty cells in table1. Show how to calculate each
Table1: The distribution of the transactions
A  Count(ai)  P(ai)  B  Count(ai,bj)  P(ai,bj)  C  Count(ai,bj,ck)  P(ai,bj,ck) 
a1  Count(a1)= _  P(a1)= _  b1  Count(a1,b1)= _  P(a1,b1)=  c1  Count(a1,b1,c1)=  0.05 
a1  b1  c2  0.10  
a1  b2  Count(a1,b2)= _  P(a1,b2)=  c1  0.20  
a1  b2  c2  0.05  
a1  b3  Count(a1,b3)= _  P(a1,b3)=  c1  0.30  
a1  b3  c2  0.10  
a2  Count(a2)= _  P(a2)= _  b1  Count(a2,b1)= _  P(a2,b1)=  c1  0.02  
a2  b1  c2  0.03  
a2  b2  Count(a2,b2)= _  P(a2,b2)=  c1  0.01  
a2  b2  c2  0.04  
a2  b3  Count(a2,b3)= _  P(a2,b3)=  c1  0.05  
a2  b3  c2  0.05  
TOTAL  200  1.00 
 Calculate support of each item. Then, provide results in the table below. Add column(s) as you
ITEM  
Support(ITEM) 
 Given the minimum support as 23%, using closure property find all frequent itemssets. Draw a latticetree to show generating combinations; use the lattice tree format given. The nodes must include item and the corresponding support value S(*).
Item1 S(item1) 
Item2 S(item1, item 
… 
ItemK S(item1, itemK) 
… 
… 
Item2 S(item2) 
… 
Item3 S(item2, item3) 
… 
ItemK S(item2, itemK) 
3.4) Assume S = {A2, B3, C2} is a frequent 3itemsset. Find frequent 2itemssets from given the set S.
 What would be the biggest minimumsupport value considering rules with 2 and 3itemssets only?
 Given the rule “C1 à A2, B3”,
 What is the local frequency of observing the consequent given the antecedent is observed?
 Compare this local frequency against the global frequency of the consequent. What do you think about association degree of the antecedent and the consequent?
 Discuss interestingness of the rule based on 3.6)3.1. Lift(C1 à A2, B3)
3.6)3.2. Lift(C1 à NOT {A2, B3} )
ANSWERS