Sunday 18 December 2016

Decision_Tree_Programming_Collective_Intelligence

YouTube Video here :- https://www.youtube.com/watch?v=n1bYyz2WqsU

Inspiration for Code and Approach for Decision Tree - Programming Collective Intelligence

Author Blog - https://kiwitobes.com/

Programming Collective Intelligence Book Orielly http://shop.oreilly.com/product/9780596529321.do#tab_04_2
Own observations and motivation :- Ensure as much as possible NOT a BLACK BOX activity
  • Not sure why a List is taken as initial data struct
  • Need to do similar processing with the IBM HR Sample DataSet and other DF's / Numpy Arrays
  • Three SO Questions ref. below made me think - implment this with Pandas DF or Numpy Arrays
In [53]:
# Source for Code - https://github.com/arthur-e/Programming-Collective-Intelligence/blob/master/chapter7/treepredict.py


my_data=[['slashdot','USA','yes',18,'None'],
        ['google','France','yes',23,'Premium'],
        ['digg','USA','yes',24,'Basic'],
        ['kiwitobes','France','yes',23,'Basic'],
        ['google','UK','no',21,'Premium'],
        ['(direct)','New Zealand','no',12,'None'],
        ['(direct)','UK','no',21,'Basic'],
        ['google','USA','no',24,'Premium'],
        ['slashdot','France','yes',19,'None'],
        ['digg','USA','no',18,'None'],
        ['google','UK','no',18,'None'],
        ['kiwitobes','UK','no',19,'None'],
        ['digg','New Zealand','yes',12,'Basic'],
        ['slashdot','UK','no',21,'None'],
        ['google','UK','yes',18,'Basic'],
        ['kiwitobes','France','yes',19,'Basic']]

print type(my_data)
<type 'list'>

init methods can take any number of arguments, and just like functions, the arguments can be defined with # default values.

Refer URL --

class decisionnode: # CLASS def init(self,col=-1,value=None,results=None,tb=None,fb=None): #init METHOD # col,value,results =ARGUMENTS self.col=col self.value=value self.results=results self.tb=tb # tb= True Branch self.fb=fb # fb = False Branch
In [54]:
class decisionnode:
  def __init__(self,col=-1,value=None,results=None,tb=None,fb=None): # __init__ methods can take any number of arguments, and just like functions, the arguments can be defined with default values, making them optional to the caller.
    self.col=col
    self.value=value
    self.results=results
    self.tb=tb    # tb= True Branch / Right Branch 
    self.fb=fb    # fb = False Branch / Left B 

# Divides a set on a specific column. Can handle numeric or nominal values
def divideset(rows,column,value):
    # Make a function that tells us if a row is in 
    # the first group (true) or the second group (false)
   split_function=None
   if isinstance(value,int) or isinstance(value,float):
      split_function=lambda row:row[column]>=value
   else:
      split_function=lambda row:row[column]==value
   
   # Divide the rows into two sets and return them
   set1=[row for row in rows if split_function(row)]
   set2=[row for row in rows if not split_function(row)]
   print set1               # Dhankar Code 
   print type(set1)         # Dhankar Code 
   print "_"*60 
   print set2               # Dhankar Code 
   print type(set2)         # Dhankar Code 

   return (set1,set2)

As seen below this Function - divideset(my_data,2,'yes') . Which takes THREE INPUT PARAMETERS - or ARGUMENTS.

  • 1st Parameter - ROWS , the Data Sets Rows , basically the RAW DataSet of Type LIST.
  • 2nd Parameter - COLUMN , numeric value for Column of the Data Set- on which the DECISION to SPLIT is based
  • 3rd Parameter - VALUE , the value to be compared with - in this case can be Numeric (int or float) or Non Numeric , in case of Non Numeric values in the Column - a Boolean Equality check is done.

Also as seen in output of code cell below - we obtain 2 Lists or "Child Sets" , in the 1st List or Child Set- we see all Values for Column 2 are == 'yes' and 'no' in the 2nd List.

Desired to be Predicted , the Target Variable or Feature - which is last set of values in the Raw Data Set List is = 'None' , 'Premium' or 'Basic' . In these initial child sets - there is a fairly equal mix of this Target Feature and thus we need more splits.

In [55]:
# Using Function -def divideset(rows,column,value):

divideset(my_data,2,'yes')
[['slashdot', 'USA', 'yes', 18, 'None'], ['google', 'France', 'yes', 23, 'Premium'], ['digg', 'USA', 'yes', 24, 'Basic'], ['kiwitobes', 'France', 'yes', 23, 'Basic'], ['slashdot', 'France', 'yes', 19, 'None'], ['digg', 'New Zealand', 'yes', 12, 'Basic'], ['google', 'UK', 'yes', 18, 'Basic'], ['kiwitobes', 'France', 'yes', 19, 'Basic']]
<type 'list'>
____________________________________________________________
[['google', 'UK', 'no', 21, 'Premium'], ['(direct)', 'New Zealand', 'no', 12, 'None'], ['(direct)', 'UK', 'no', 21, 'Basic'], ['google', 'USA', 'no', 24, 'Premium'], ['digg', 'USA', 'no', 18, 'None'], ['google', 'UK', 'no', 18, 'None'], ['kiwitobes', 'UK', 'no', 19, 'None'], ['slashdot', 'UK', 'no', 21, 'None']]
<type 'list'>
Out[55]:
([['slashdot', 'USA', 'yes', 18, 'None'],
  ['google', 'France', 'yes', 23, 'Premium'],
  ['digg', 'USA', 'yes', 24, 'Basic'],
  ['kiwitobes', 'France', 'yes', 23, 'Basic'],
  ['slashdot', 'France', 'yes', 19, 'None'],
  ['digg', 'New Zealand', 'yes', 12, 'Basic'],
  ['google', 'UK', 'yes', 18, 'Basic'],
  ['kiwitobes', 'France', 'yes', 19, 'Basic']],
 [['google', 'UK', 'no', 21, 'Premium'],
  ['(direct)', 'New Zealand', 'no', 12, 'None'],
  ['(direct)', 'UK', 'no', 21, 'Basic'],
  ['google', 'USA', 'no', 24, 'Premium'],
  ['digg', 'USA', 'no', 18, 'None'],
  ['google', 'UK', 'no', 18, 'None'],
  ['kiwitobes', 'UK', 'no', 19, 'None'],
  ['slashdot', 'UK', 'no', 21, 'None']])

As seen below - divideset(my_data,3,20). We again obtain 2 Lists or "Child Sets".

1st Child Set- Values for Column 3 Above 20. Again all 3 Target Variable or Feature values are present.

In [57]:
#2nd run of Function -divideset(my_data,3,20)

divideset(my_data,3,15)
[['slashdot', 'USA', 'yes', 18, 'None'], ['google', 'France', 'yes', 23, 'Premium'], ['digg', 'USA', 'yes', 24, 'Basic'], ['kiwitobes', 'France', 'yes', 23, 'Basic'], ['google', 'UK', 'no', 21, 'Premium'], ['(direct)', 'UK', 'no', 21, 'Basic'], ['google', 'USA', 'no', 24, 'Premium'], ['slashdot', 'France', 'yes', 19, 'None'], ['digg', 'USA', 'no', 18, 'None'], ['google', 'UK', 'no', 18, 'None'], ['kiwitobes', 'UK', 'no', 19, 'None'], ['slashdot', 'UK', 'no', 21, 'None'], ['google', 'UK', 'yes', 18, 'Basic'], ['kiwitobes', 'France', 'yes', 19, 'Basic']]
<type 'list'>
____________________________________________________________
[['(direct)', 'New Zealand', 'no', 12, 'None'], ['digg', 'New Zealand', 'yes', 12, 'Basic']]
<type 'list'>
Out[57]:
([['slashdot', 'USA', 'yes', 18, 'None'],
  ['google', 'France', 'yes', 23, 'Premium'],
  ['digg', 'USA', 'yes', 24, 'Basic'],
  ['kiwitobes', 'France', 'yes', 23, 'Basic'],
  ['google', 'UK', 'no', 21, 'Premium'],
  ['(direct)', 'UK', 'no', 21, 'Basic'],
  ['google', 'USA', 'no', 24, 'Premium'],
  ['slashdot', 'France', 'yes', 19, 'None'],
  ['digg', 'USA', 'no', 18, 'None'],
  ['google', 'UK', 'no', 18, 'None'],
  ['kiwitobes', 'UK', 'no', 19, 'None'],
  ['slashdot', 'UK', 'no', 21, 'None'],
  ['google', 'UK', 'yes', 18, 'Basic'],
  ['kiwitobes', 'France', 'yes', 19, 'Basic']],
 [['(direct)', 'New Zealand', 'no', 12, 'None'],
  ['digg', 'New Zealand', 'yes', 12, 'Basic']])
In [58]:
# Create counts of possible results (the last column of each row is the result)
def uniquecounts(rows):
   results={}
   for row in rows:
      # The result is the last column
      rr=row[len(row)-1]
      print rr                                   # Dhankar - r Changed to rr 
      if rr not in results: results[rr]=0
      results[rr]+=1
   return results
In [59]:
# Get Unique Count of Classes within the Target Feature 
print uniquecounts(my_data)
None
Premium
Basic
Basic
Premium
None
Basic
Premium
None
None
None
None
Basic
None
Basic
Basic
{'None': 7, 'Premium': 3, 'Basic': 6}

Gini Impurity

Its the expected Error rate for a SET of Obs . Error that may occur in predicting a class for an Obs.
  • If say Total - 12 Obs , of which all 12 belong to 1 Class Only then Gini-Impurity =0
  • If say Total - 12 Obs , of which 3 belong to One Class each , total 4 Classes , then Gini-Impurity =0.75
  • If say Total - 12 Obs , of which 6 belong to One Class each , total 4 Classes , then Gini-Impurity =0.50
In [60]:
# Probability that a randomly placed item will be in the wrong category
# Wiki --- Used by the CART (classification and regression tree) algorithm, Gini impurity 
# is a ---- "measure of how often"------ a randomly chosen element from the set would be incorrectly labeled 
# if it was randomly labeled according to the distribution of labels in the subset.

def giniimpurity(rows):
  total=len(rows)
  counts=uniquecounts(rows)
  imp=0
  for k1 in counts:
    p1=float(counts[k1])/total
    for k2 in counts:
      if k1==k2: continue
      p2=float(counts[k2])/total
      imp+=p1*p2
  return imp
In [61]:
# Check for Gini-Impurity 

set1,set2=divideset(my_data,3,20)

print "_"*60
print "Gini Impurity for Set 1 :-",giniimpurity(set1)
print "Gini Impurity for Set 2 :-",giniimpurity(set2)
print "_"*60
[['google', 'France', 'yes', 23, 'Premium'], ['digg', 'USA', 'yes', 24, 'Basic'], ['kiwitobes', 'France', 'yes', 23, 'Basic'], ['google', 'UK', 'no', 21, 'Premium'], ['(direct)', 'UK', 'no', 21, 'Basic'], ['google', 'USA', 'no', 24, 'Premium'], ['slashdot', 'UK', 'no', 21, 'None']]
<type 'list'>
____________________________________________________________
[['slashdot', 'USA', 'yes', 18, 'None'], ['(direct)', 'New Zealand', 'no', 12, 'None'], ['slashdot', 'France', 'yes', 19, 'None'], ['digg', 'USA', 'no', 18, 'None'], ['google', 'UK', 'no', 18, 'None'], ['kiwitobes', 'UK', 'no', 19, 'None'], ['digg', 'New Zealand', 'yes', 12, 'Basic'], ['google', 'UK', 'yes', 18, 'Basic'], ['kiwitobes', 'France', 'yes', 19, 'Basic']]
<type 'list'>
____________________________________________________________
Gini Impurity for Set 1 :- Premium
Basic
Basic
Premium
Basic
Premium
None
0.612244897959
Gini Impurity for Set 2 :- None
None
None
None
None
None
Basic
Basic
Basic
0.444444444444
____________________________________________________________

Entropy is the sum of p(x)log(p(x)) across all the different possible results. In lay man terms - Entropy is Inversely Proportional to Purity or HomoGenity of the SET.

If All Observations in the SET are of 1 CLASS only - say PREMIUM - then the SET is Totally PURE or HOMOGENOUS . In such a case ENTROPY is Lowest
In [62]:
#
def entropy(rows):
   from math import log
   log2=lambda x:log(x)/log(2)  
   results=uniquecounts(rows)
   # Now calculate the entropy
   ent=0.0
   for r in results.keys():
      p=float(results[r])/len(rows)
      ent=ent-p*log2(p)
   return ent
In [63]:
# Check for Entropy

set1,set2=divideset(my_data,3,20)

print "_"*60
print "Entropy for Set 1 :-",entropy(set1)
print "Entropy for Set 2 :-",entropy(set2)
print "_"*60
print " As seen with Values of Class Labels printed out - SET which is PURE has LOWER ENTROPY "
[['google', 'France', 'yes', 23, 'Premium'], ['digg', 'USA', 'yes', 24, 'Basic'], ['kiwitobes', 'France', 'yes', 23, 'Basic'], ['google', 'UK', 'no', 21, 'Premium'], ['(direct)', 'UK', 'no', 21, 'Basic'], ['google', 'USA', 'no', 24, 'Premium'], ['slashdot', 'UK', 'no', 21, 'None']]
<type 'list'>
____________________________________________________________
[['slashdot', 'USA', 'yes', 18, 'None'], ['(direct)', 'New Zealand', 'no', 12, 'None'], ['slashdot', 'France', 'yes', 19, 'None'], ['digg', 'USA', 'no', 18, 'None'], ['google', 'UK', 'no', 18, 'None'], ['kiwitobes', 'UK', 'no', 19, 'None'], ['digg', 'New Zealand', 'yes', 12, 'Basic'], ['google', 'UK', 'yes', 18, 'Basic'], ['kiwitobes', 'France', 'yes', 19, 'Basic']]
<type 'list'>
____________________________________________________________
Entropy for Set 1 :- Premium
Basic
Basic
Premium
Basic
Premium
None
1.44881563573
Entropy for Set 2 :- None
None
None
None
None
None
Basic
Basic
Basic
0.918295834054
____________________________________________________________
 As seen with Values of Class Labels printed out - SET which is PURE has LOWER ENTROPY 
In [64]:
def buildtree(rows,scoref=entropy):
  if len(rows)==0: return decisionnode()
  current_score=scoref(rows)

  # Set up some variables to track the best criteria
  best_gain=0.0
  best_criteria=None
  best_sets=None
  
  column_count=len(rows[0])-1       ### Commented below in Markdown Cell 
  for col in range(0,column_count): ###
    # Generate the list of different values in this column
    column_values={}                ###   
    for row in rows:                ###   
       column_values[row[col]]=1
    # Now try dividing the rows up for each value in this column
    for value in column_values.keys():
      (set1,set2)=divideset(rows,col,value)
      print  "Col Val Keys :---",column_values.keys()  ##### OWN CODE for Print ............
      
      # Information gain
      p=float(len(set1))/len(rows)
      gain=current_score-p*scoref(set1)-(1-p)*scoref(set2)
      if gain>best_gain and len(set1)>0 and len(set2)>0:
        best_gain=gain
        best_criteria=(col,value)
        best_sets=(set1,set2)
  # Create the sub branches   
  if best_gain>0:
    trueBranch=buildtree(best_sets[0]) # Right Branch 
    falseBranch=buildtree(best_sets[1]) # Left Branch 
    return decisionnode(col=best_criteria[0],value=best_criteria[1],
                        tb=trueBranch,fb=falseBranch)
  else:
    return decisionnode(results=uniquecounts(rows))
To determine which attribute is the best to divide on, the information gain is calculated. Information gain is the difference between the current entropy and the weighted-average entropy of the two new groups.
In [65]:
# SandBox - Info_Gain
#
scoref=entropy
rows=my_data
p=float(len(set1))/len(rows)                        ### As-Is from above 
current_score=scoref(rows)                          ### As-Is from above 

# gain=current_score-p*scoref(set1)-(1-p)*scoref(set2)
# gain = is difference between current entropy and weighted-average entropy of the two sets. 

print "Length of set1:--------------" ,float(len(set1))
print "Length of rows or my_data:--- ",len(my_data)
print "Current Score :--------------", scoref(my_data) # 1.50524081494
print "Value of Gain :--------------" , current_score-p*scoref(set1)-(1-p)*scoref(set2)

best_sets=(set1,set2)
print "_"*90
print "Best Sets is a :--",type(best_sets)
print "_"*90
print (best_sets[0])
print "_"*90
print (best_sets[1])

#buildtree(best_sets[0])
None
Premium
Basic
Basic
Premium
None
Basic
Premium
None
None
None
None
Basic
None
Basic
Basic
Length of set1:-------------- 7.0
Length of rows or my_data:---  16
Current Score :-------------- None
Premium
Basic
Basic
Premium
None
Basic
Premium
None
None
None
None
Basic
None
Basic
Basic
1.50524081494
Value of Gain :-------------- Premium
Basic
Basic
Premium
Basic
Premium
None
None
None
None
None
None
None
Basic
Basic
Basic
0.354842567659
__________________________________________________________________________________________
Best Sets is a :-- <type 'tuple'>
__________________________________________________________________________________________
[['google', 'France', 'yes', 23, 'Premium'], ['digg', 'USA', 'yes', 24, 'Basic'], ['kiwitobes', 'France', 'yes', 23, 'Basic'], ['google', 'UK', 'no', 21, 'Premium'], ['(direct)', 'UK', 'no', 21, 'Basic'], ['google', 'USA', 'no', 24, 'Premium'], ['slashdot', 'UK', 'no', 21, 'None']]
__________________________________________________________________________________________
[['slashdot', 'USA', 'yes', 18, 'None'], ['(direct)', 'New Zealand', 'no', 12, 'None'], ['slashdot', 'France', 'yes', 19, 'None'], ['digg', 'USA', 'no', 18, 'None'], ['google', 'UK', 'no', 18, 'None'], ['kiwitobes', 'UK', 'no', 19, 'None'], ['digg', 'New Zealand', 'yes', 12, 'Basic'], ['google', 'UK', 'yes', 18, 'Basic'], ['kiwitobes', 'France', 'yes', 19, 'Basic']]

column_count=len(rows[0])-1

Count COLUMNS or Number of Values in first_ROW [0] of my_data ,minus 1(Target_var)

for col in range(0,column_count):

for loop - iterate within range 0 to column_count . Loop will start in Row 0 Column 0 or Value [0,0]

column_values={}

Create Empty LIST to hold values

for row in rows:

column_values[row[col]]=1

Next nested For Loop will RUN or Iterate through ALL ROWS or my_data While Iterating through All ROWS - it shall provide to the FUNCTION - divideset(rows,col,value) The required THREE PARAMETERS
  • rows == As "for row in rows" will iterate through all rows
  • col == As "for col in range(0,column_count):" will iterate through all columns or values of each row
  • value == As "for value in column_values.keys():" will iterate over all values in each column
We PRINT out the -- for value in column_values.keys(): in the Code Cell below and see the PRINT Output change from
  • Col Val Keys :--- ['(direct)', 'digg', 'google', 'slashdot', 'kiwitobes'] #### to
  • Col Val Keys :--- ['New Zealand', 'UK', 'USA', 'France'] #### after FIVE Iterations
In [67]:
#tree_1=buildtree(my_data) # Dont print - large OutPut 
In [68]:
#
def printtree(tree,indent=''):
   # Is this a leaf node?
   if tree.results!=None:
      print str(tree.results)
   else:
      # Print the criteria
      print str(tree.col)+':'+str(tree.value)+'? '

      # Print the branches
      print indent+'T->',
      printtree(tree.tb,indent+'  ')
      print indent+'F->',
      printtree(tree.fb,indent+'  ')
In [69]:
# Function - printtree()

printtree(tree_1)
0:google? 
T-> 3:21? 
  T-> {'Premium': 3}
  F-> 2:yes? 
    T-> {'Basic': 1}
    F-> {'None': 1}
F-> 0:slashdot? 
  T-> {'None': 3}
  F-> 2:yes? 
    T-> {'Basic': 4}
    F-> 3:21? 
      T-> {'Basic': 1}
      F-> {'None': 3}
In [72]:
def getwidth(tree):
  if tree.tb==None and tree.fb==None: return 1 # 1 is Default WIDTH of TREE if NO BRANCHES 
  return getwidth(tree.tb)+getwidth(tree.fb)

def getdepth(tree):
  if tree.tb==None and tree.fb==None: return 0
  return max(getdepth(tree.tb),getdepth(tree.fb))+1


from PIL import Image,ImageDraw

def drawtree(tree,jpeg='tree.jpg'):
  w=getwidth(tree)*100
  h=getdepth(tree)*100+120

  img=Image.new('RGB',(w,h),(255,255,255))
  draw=ImageDraw.Draw(img)

  drawnode(draw,tree,w/2,20)
  img.save(jpeg,'JPEG')
  
def drawnode(draw,tree,x,y):
  if tree.results==None:
    # Get the width of each branch
    w1=getwidth(tree.fb)*100
    w2=getwidth(tree.tb)*100

    # Determine the total space required by this node
    left=x-(w1+w2)/2
    right=x+(w1+w2)/2

    # Draw the condition string
    draw.text((x-20,y-10),str(tree.col)+':'+str(tree.value),(0,0,0))

    # Draw links to the branches
    draw.line((x,y,left+w1/2,y+100),fill=(255,0,0))
    draw.line((x,y,right-w2/2,y+100),fill=(255,0,0))
    
    # Draw the branch nodes
    drawnode(draw,tree.fb,left+w1/2,y+100)
    drawnode(draw,tree.tb,right-w2/2,y+100)
  else:
    txt=' \n'.join(['%s:%d'%v for v in tree.results.items()])
    draw.text((x-20,y),txt,(0,0,0))
In [73]:
drawtree(tree_1,jpeg='treeview2.jpg')
In [74]:
def classify(observation,tree): # OBS = 1 Row of Data all Features besides Target. # Tree = tree created above. 
  if tree.results!=None:
    return tree.results
  else:
    v=observation[tree.col]
    branch=None
    if isinstance(v,int) or isinstance(v,float):
      if v>=tree.value: branch=tree.tb
      else: branch=tree.fb
    else:
      if v==tree.value: branch=tree.tb
      else: branch=tree.fb
    return classify(observation,branch)
In [79]:
print classify(['slashdot','USA','yes',18],tree_1)
print classify(['google','USA','yes',6],tree_1) # Change Ref Site - to GOOGLE - we get Premium. 
{'None': 3}
{'Basic': 1}
In [ ]:
def prune(tree,mingain):
  # If the branches aren't leaves, then prune them
  if tree.tb.results==None:
    prune(tree.tb,mingain)
  if tree.fb.results==None:
    prune(tree.fb,mingain)
    
  # If both the subbranches are now leaves, see if they
  # should merged
  if tree.tb.results!=None and tree.fb.results!=None:
    # Build a combined dataset
    tb,fb=[],[]
    for v,c in tree.tb.results.items():
      tb+=[[v]]*c
    for v,c in tree.fb.results.items():
      fb+=[[v]]*c
    
    # Test the reduction in entropy
    delta=entropy(tb+fb)-(entropy(tb)+entropy(fb)/2)

    if delta<mingain:
      # Merge the branches
      tree.tb,tree.fb=None,None
      tree.results=uniquecounts(tb+fb)
In [14]:
def mdclassify(observation,tree):
  if tree.results!=None:
    return tree.results
  else:
    v=observation[tree.col]
    if v==None:
      tr,fr=mdclassify(observation,tree.tb),mdclassify(observation,tree.fb)
      tcount=sum(tr.values())
      fcount=sum(fr.values())
      tw=float(tcount)/(tcount+fcount)
      fw=float(fcount)/(tcount+fcount)
      result={}
      for k,v in tr.items(): result[k]=v*tw
      for k,v in fr.items(): result[k]=v*fw      
      return result
    else:
      if isinstance(v,int) or isinstance(v,float):
        if v>=tree.value: branch=tree.tb
        else: branch=tree.fb
      else:
        if v==tree.value: branch=tree.tb
        else: branch=tree.fb
      return mdclassify(observation,branch)

def variance(rows):
  if len(rows)==0: return 0
  data=[float(row[len(row)-1]) for row in rows]
  mean=sum(data)/len(data)
  variance=sum([(d-mean)**2 for d in data])/len(data)
  return variance
In [43]:
###SANDBOX IMPORTANT --- Refer URL -- https://docs.python.org/2/tutorial/classes.html


class MyClass:
    """A simple example class - No _init_"""
    # Everything defined below is an ATTRIBUTE in the Classes NAMESPACE
    # They can all be called upon as CLASS_DOT_ATTRIBUTE
    
    i = 12345
    g = "Cheers Mate"
    
    
    def f(self):
        return 'hello world'
    
print MyClass.i
print MyClass.g    

print MyClass.f
12345
Cheers Mate
<unbound method MyClass.f>