TF-IDF
TF: Term Frequency
- : 詞 t 在文件 d 中的出現次數
- : 文件 d 的詞數
這個數字是對詞數(term count)的標準化
Example
d: a b c d a a b
= 7
TF("a", d) = 3 / 7
TF("b", d) = 2 / 7
TF("c", d) = TF("d", d) = 1 / 7
IDF: Inverse Document Frequency
D: Total number of documents
df_t: Number of documents containing term t
Example
= a
= a b
= a b c
= a b c
= 4
IDF("a") = log(4 / 4)
IDF("b") = log(4 / 3)
IDF("c") = log(4 / 2)