Unsupervised Document Classification integrating Web Scraping, One-Class SVM and LDA Topic Modelling
作者:
时间:2020-11-27
阅读量:311次
  • 演讲人: Christoph WEISSER
  • 时间:2020年12月17日 周四16:00(北京时间)
  • 地点:钉钉群 ID:35996110


ABSTRACT

Unsupervised document classification for imbalanced data sets poses a major challenge. To obtain accurate classification results, training data sets are often created manually by humans which requires expert knowledge, time and money resources. Depending on the imbalance of the data set, this approach also either requires human labelling of all the data or it fails to adequately recognize underrepresented categories. We propose an integration of web scraping, one-class Support Vector Machines (SVM) and Latent Dirichlet Allocation (LDA) topic modelling as a multi-step classification rule that circumvents manual labelling. Unsupervised one-class document classification with the integration of out-of-domain training data is achieved and >80% of the target data is correctly classified. The proposed method thus even outperforms common machine learning classifiers and is validated on multiple data sets.

*Joint work with Anton Thielmann, Astrid Krenz, and Benjamin Säfken from the University of Göttingen


BIOGRAPHY

Christoph WEISSER has obtained degrees from the University of Oxford and University of St. Andrews, United Kingdom and Georg-August-University Göttingen, Germany. He has been a scholar of the German National Academic Foundation, which is the most selective and prestigious scholarship programme in Germany. After graduating from Oxford University, Christoph started his career in London and worked as an investment banker and quantitative portfolio manager internationally. In 2019 he returned to Göttingen University in order to pursue a PhD in Economics and Applied Statistics with a focus on Machine Learning. He has published a software package recently and edited a book on Deep Learning Algorithms published by the Göttingen University Press. Besides, he is a Senior Data Scientist at an IT consulting firm consulting for the United Nations on a Natural Language Processing project. Christoph has taught graduate courses on Spatial Statistics and Deep Learning Algorithms at the University of Göttingen, courses on Biostatistics, Python and R at Max-Delbrück-Center for Molecular Medicine Berlin and Machine Learning at a summer school on Artificial Intelligence at the University of Cambridge.


Contact

Andre Python(apython@zju.edu.cn)