Datamining SJTU

Posted by chinaljr on April 11, 2018

Data-intro by wnzhang


  • Database
    • rows -> data objects
    • columns -> attributes

Fundamental Data Mining Algorithms


Frequent patterns and association rule mining


  • Apriori
  • FG-Growth algorithm
  • Frequent pattern: 就是频繁出现的set
  • Association rule: 就是有关系的两个玩意
  • absolute support/support count:集合X出现的次数
  • relative support:比例
  • minsup:最小支持度
  • probability && conditional probability == support && confidence
  • closed patterns:不会有比X更庞大,并且效果更好的支持集
  • max-patterns:不会有比X更好的支持集
  • Downward Closure Property:向下封闭属性,我的子集也得是frequent
  • scalable mining methods:可拓展的挖掘方法




  • 生成所有的1-itemset
  • k+1 从 k 变过来:self-join
  • test,扔掉不合法的:hash tree
  • 不能拓展了就terminate

generate condidates(2):

  • self-joining:K个Lk 拼起来 有K+1个元素,组成Lk+1
  • pruning

count supports of candidates(3)

  • candidate itemset stored in a hash-tree艹,不会
    • leaf node is a list of itemsets and counts
    • interior node contains a hash table
    • subset funtion:找到所有candidates

FP-Grohth algorithms

Nieghborhood methods


  • method
    • for each input instance x ,find k cloest instances
    • prediction based on average
    • for classificatin ,选概率最大的
  • generalized version
    • define similarity function 相似函数
    • 相似度加权predictioon
  • Non-Parametic KNN
    • each instance is a parameter
    • N/k 个 有效 parameters ,有N/k个社区,每个社区有一个参数
  • Hyperparameter k
    • 显然不呢个用sum of squared去确定k
    • 根据测试集选择k
  • efficiency concerns
    • worst O(n)
    • build inverse index 逆向索引
    • parallelized computing 并行计算
    • pre-calculation 预处理
    • triangle inequality 三角不等式
    • learning hash code ????
    • Approximation methods 近似算法

supervised learning

Introduction to Machine Learning

Machine Learning Categories

  • Supervised learning
  • Unsupervised learning
  • Reinforcement learning 强化学习

Learning objective

  • Make the prediction close to the corresponding label 使得误差函数和最小
  • Loss function measures the error between the label and prediction
  • the definition of loss function deponds on the data and task
  • 最常用的误差函数是,1/2 差的平方,squared loss
    • 越远误差越大
    • 允许小的错误

Model Selection

underfitting 欠拟合 overfitting 过拟合

Regularization (Penalty term)

Among competing hypotheses, the one with the fewest assumptions should be selected


Define higher level concepts about the model such as complexity or capacity to learn

Cannot be learned directly from the data in the standard model training process and need to be predefined

Can be dicided by setting different values ,training different models and choosing the values that test best

1.tyical regularization



Sparse(稀疏) model learning with q not higher than 1 Seldom use of q > 2

E.Generalization Ability


Generalization Error L(y,f(x)) * p(x,y) dx dy

Linear Models

  • discriminative model 判别模型
  • Deterministic
  • Probabilistic

linear regression


Batch Gradient Descent


Stochastic Gradient Descent



Stochastic Faster / Uncertainty or fluctuation

Mini-Batch Gradient Descent

combination of batch GD and stochastic GD

Choosing Learning Rate

Algebra Perspective(代数视角)

Matrix Form


X(t)X might be singular

solution: regularization 引入

Linear regression with Gaussian noise model

Linear Classification

p function

loss function

Logistic Regression

Evaluation Measures

  • Accuracy: the ratio of cases when prediction = label
  • Precision: the ratio of true class 1 cases in those with prediction 1 1号标签,预测 成功 占 我 所有预测的比例
  • Recall: the ratio of cases with prediction 1 in all true class 1 cases 1号标签,预测成功 占 所有1 的比例

Multi-Class Classification