TfIdfDataSketchesUtil (pap4j-boot3

cn.net.pap.common.datasketches.util.TfIdfDataSketchesUtil

public class TfIdfDataSketchesUtil extends Object

使用 Apache DataSketches + Count-Min Sketch 实现近似 TF-IDF

Constructor Details
- TfIdfDataSketchesUtil
  
  public TfIdfDataSketchesUtil(int mapSize, int cmsWidth, int cmsDepth, int seed)
Method Details
- processDocuments
  
  public void processDocuments(String filePath) throws IOException
  
  处理文档流，构建 TF 和 DF 统计
  
  Throws:
  
  IOException
- processDocument
  
  public void processDocument(String documentText)
  
  处理单个文档（一行）
- calculateTfIdf
  
  public double calculateTfIdf(String word, int documentIndex)
  
  计算单个词的 TF-IDF 值
- getDocumentTfIdfScores
  
  public Map<String,Double> getDocumentTfIdfScores(int documentIndex, int topCandidateWords)
  
  获取文档中所有词的 TF-IDF 分数（注意：CMS 无法直接枚举所有词）这里建议结合 dfSketch 的 frequent items 作为候选
- getTfIdfVector
  
  public double[] getTfIdfVector(String word)
  
  获取所有文档中某个词的 TF-IDF 向量
- getFrequentWords
  
  public List<String> getFrequentWords(int topN)
  
  获取高频词列表
- merge
  
  public void merge(TfIdfDataSketchesUtil other)
  
  合并多个 Sketch
- getTotalDocuments
  
  public long getTotalDocuments()
- getVocabularySize
  
  public int getVocabularySize()
- getMaximumError
  
  public double getMaximumError()
- printStatistics
  
  public void printStatistics()

Class TfIdfDataSketchesUtil