Abstract
In this chapter we present an overview of text mining approaches that can be used to conduct science and technology studies that rely on assessing the similarity between patent documents and/or scientific publications. We highlight the rationale behind vector space models, latent semantic analysis, and probabilistic topic models. In addition, several validation studies pertaining to patent documents and publications are presented. These studies reveal that choices in terms of algorithms, pre-processing, and calculation options have non-trivial consequences in terms of outcomes and their validity. As such, scholars should pay attention to the technicalities implied when engaging in text mining efforts in order for outcomes to become relevant and informative.