Skip to content

Latest commit

 

History

History
76 lines (55 loc) · 3.16 KB

sentiment_analysis.md

File metadata and controls

76 lines (55 loc) · 3.16 KB

情感分析

1. requirements

constraint

  • latency: how long it takes a single request
  • throughput: how many request can be handled in a given amount of time

2. ML task & pipeline

3. data collection

  • 收集data
    • GDPR(privacy),数据脱敏,数据加密
  • 分析data。考虑label的distribution
  • 考虑feature是不是只有text的,还是有numeric,nominal的。missing data怎么处理

4. feature

  • text的feature怎么生成embedding,好处坏处有哪些。(word embedding, fasttext, BERT)
  • numeric的missing data,如何normalize
  • 实际工作中,都是每个ML组都有自己不同的embedding set。互相使用别人的embedding set。怎么pre-train, fine-train, 怎么combine feature

5. model

  • 模型选择: 传统模型还是神经网络
  • 考虑系统方面的constraint, 如prediction latency, memory. 怎么合理的牺牲模型的性能以换取constraint方面的benefit
  • 模型蒸馏

6. evaluation

  • train, test, validation split data
  • evaluation matrix
  • feature的ABtest怎么做

7. deploy & serving

  • GPU or CPU
  • 单机多进程 or Spark + Broadcast, KF-serving
  • dynamic batching
  • Dynamic Model Input (输入数据的长度)
  • quantization (cast)
  • distill/or smaller model
  • onnx
    • 不同的硬件和推理引擎兼容
    • 进一步优化: 算子融合、内存优化和硬件加速
  • caching responses to reduce the request

8. Monitor & maintenance

  • hardware usage
  • serving usage: qps
  • model performance
  • business object

9. 优化与问答

  • train/test data和product上distribution不一样怎么办
  • data distribution 随着时间改变怎么办

reference