\documentclass[a4paper]{article} \usepackage[margin=2cm]{geometry} \usepackage[utf8]{inputenc} \usepackage[round]{natbib} \usepackage{url} \newcommand{\acronym}[1]{\textsc{#1}} \newcommand{\class}[1]{\mbox{\textsf{#1}}} \newcommand{\code}[1]{\mbox{\texttt{#1}}} \newcommand{\pkg}[1]{{\normalfont\fontseries{b}\selectfont #1}} \newcommand{\proglang}[1]{\textsf{#1}} %% \VignetteIndexEntry{Introduction to the tm Package} %% \VignetteDepends{SnowballC} \begin{document} <>= library("tm") data("crude") @ \title{Introduction to the \pkg{tm} Package\\Text Mining in \proglang{R}} \author{Ingo Feinerer} \maketitle \section*{Introduction} This vignette gives a short introduction to text mining in \proglang{R} utilizing the text mining framework provided by the \pkg{tm} package. We present methods for data import, corpus handling, preprocessing, metadata management, and creation of term-document matrices. Our focus is on the main aspects of getting started with text mining in \proglang{R}---an in-depth description of the text mining infrastructure offered by \pkg{tm} was published in the \emph{Journal of Statistical Software}~\citep{Feinerer_etal_2008}. An introductory article on text mining in \proglang{R} was published in \emph{R News}~\citep{Rnews:Feinerer:2008}. \section*{Data Import} The main structure for managing documents in \pkg{tm} is a so-called \class{Corpus}, representing a collection of text documents. A corpus is an abstract concept, and there can exist several implementations in parallel. The default implementation is the so-called \class{VCorpus} (short for \emph{Volatile Corpus}) which realizes a semantics as known from most \proglang{R} objects: corpora are \proglang{R} objects held fully in memory. We denote this as volatile since once the \proglang{R} object is destroyed, the whole corpus is gone. Such a volatile corpus can be created via the constructor \code{VCorpus(x, readerControl)}. Another implementation is the \class{PCorpus} which implements a \emph{Permanent Corpus} semantics, i.e., the documents are physically stored outside of \proglang{R} (e.g., in a database), corresponding \proglang{R} objects are basically only pointers to external structures, and changes to the underlying corpus are reflected to all \proglang{R} objects associated with it. Compared to the volatile corpus the corpus encapsulated by a permanent corpus object is not destroyed if the corresponding \proglang{R} object is released. Within the corpus constructor, \code{x} must be a \class{Source} object which abstracts the input location. \pkg{tm} provides a set of predefined sources, e.g., \class{DirSource}, \class{VectorSource}, or \class{DataframeSource}, which handle a directory, a vector interpreting each component as document, or data frame like structures (like \acronym{CSV} files), respectively. Except \class{DirSource}, which is designed solely for directories on a file system, and \class{VectorSource}, which only accepts (character) vectors, most other implemented sources can take connections as input (a character string is interpreted as file path). \code{getSources()} lists available sources, and users can create their own sources. The second argument \code{readerControl} of the corpus constructor has to be a list with the named components \code{reader} and \code{language}. The first component \code{reader} constructs a text document from elements delivered by a source. The \pkg{tm} package ships with several readers (e.g., \code{readPlain()}, \code{readPDF()}, \code{readDOC()}, \ldots). See \code{getReaders()} for an up-to-date list of available readers. Each source has a default reader which can be overridden. E.g., for \code{DirSource} the default just reads in the input files and interprets their content as text. Finally, the second component \code{language} sets the texts' language (preferably using \acronym{ISO} 639-2 codes). In case of a permanent corpus, a third argument \code{dbControl} has to be a list with the named components \code{dbName} giving the filename holding the sourced out objects (i.e., the database), and \code{dbType} holding a valid database type as supported by package \pkg{filehash}. Activated database support reduces the memory demand, however, access gets slower since each operation is limited by the hard disk's read and write capabilities. So e.g., plain text files in the directory \code{txt} containing Latin (\code{lat}) texts by the Roman poet \emph{Ovid} can be read in with following code: <>= txt <- system.file("texts", "txt", package = "tm") (ovid <- VCorpus(DirSource(txt, encoding = "UTF-8"), readerControl = list(language = "lat"))) @ For simple examples \code{VectorSource} is quite useful, as it can create a corpus from character vectors, e.g.: <>= docs <- c("This is a text.", "This another one.") VCorpus(VectorSource(docs)) @ Finally we create a corpus for some Reuters documents as example for later use: <>= reut21578 <- system.file("texts", "crude", package = "tm") reuters <- VCorpus(DirSource(reut21578, mode = "binary"), readerControl = list(reader = readReut21578XMLasPlain)) @ \section*{Data Export} For the case you have created a corpus via manipulating other objects in \proglang{R}, thus do not have the texts already stored on a hard disk, and want to save the text documents to disk, you can simply use \code{writeCorpus()} <>= writeCorpus(ovid) @ which writes a character representation of the documents in a corpus to multiple files on disk. \section*{Inspecting Corpora} Custom \code{print()} methods are available which hide the raw amount of information (consider a corpus could consist of several thousand documents, like a database). \code{print()} gives a concise overview whereas more details are displayed with \code{inspect()}. <<>>= inspect(ovid[1:2]) @ Individual documents can be accessed via \code{[[}, either via the position in the corpus, or via their identifier. <>= meta(ovid[[2]], "id") identical(ovid[[2]], ovid[["ovid_2.txt"]]) @ A character representation of a document is available via \code{as.character()} which is also used when inspecting a document: <>= inspect(ovid[[2]]) lapply(ovid[1:2], as.character) @ \section*{Transformations} Once we have a corpus we typically want to modify the documents in it, e.g., stemming, stopword removal, et cetera. In \pkg{tm}, all this functionality is subsumed into the concept of a \emph{transformation}. Transformations are done via the \code{tm\_map()} function which applies (maps) a function to all elements of the corpus. Basically, all transformations work on single text documents and \code{tm\_map()} just applies them to all documents in a corpus. \subsection*{Eliminating Extra Whitespace} Extra whitespace is eliminated by: <<>>= reuters <- tm_map(reuters, stripWhitespace) @ \subsection*{Convert to Lower Case} Conversion to lower case by: <<>>= reuters <- tm_map(reuters, content_transformer(tolower)) @ We can use arbitrary character processing functions as transformations as long as the function returns a text document. In this case we use \code{content\_transformer()} which provides a convenience wrapper to access and set the content of a document. Consequently most text manipulation functions from base \proglang{R} can directly be used with this wrapper. This works for \code{tolower()} as used here but also e.g.\ for \code{gsub()} which comes quite handy for a broad range of text manipulation tasks. \subsection*{Remove Stopwords} Removal of stopwords by: <>= reuters <- tm_map(reuters, removeWords, stopwords("english")) @ \subsection*{Stemming} Stemming is done by: <>= tm_map(reuters, stemDocument) @ \section*{Filters} Often it is of special interest to filter out documents satisfying given properties. For this purpose the function \code{tm\_filter} is designed. It is possible to write custom filter functions which get applied to each document in the corpus. Alternatively, we can create indices based on selections and subset the corpus with them. E.g., the following statement filters out those documents having an \code{ID} equal to \code{"237"} and the string \code{"INDONESIA SEEN AT CROSSROADS OVER ECONOMIC CHANGE"} as their heading. <<>>= idx <- meta(reuters, "id") == '237' & meta(reuters, "heading") == 'INDONESIA SEEN AT CROSSROADS OVER ECONOMIC CHANGE' reuters[idx] @ \section*{Metadata Management} Metadata is used to annotate text documents or whole corpora with additional information. The easiest way to accomplish this with \pkg{tm} is to use the \code{meta()} function. A text document has a few predefined attributes like \code{author} but can be extended with an arbitrary number of additional user-defined metadata tags. These additional metadata tags are individually attached to a single text document. From a corpus perspective these metadata attachments are locally stored together with each individual text document. Alternatively to \code{meta()} the function \code{DublinCore()} provides a full mapping between Simple Dublin Core metadata and \pkg{tm} metadata structures and can be similarly used to get and set metadata information for text documents, e.g.: <>= DublinCore(crude[[1]], "Creator") <- "Ano Nymous" meta(crude[[1]]) @ For corpora the story is a bit more sophisticated. Corpora in \pkg{tm} have two types of metadata: one is the metadata on the corpus level (\code{corpus}), the other is the metadata related to the individual documents (\code{indexed}) in form of a data frame. The latter is often done for performance reasons (hence the named \code{indexed} for indexing) or because the metadata has an own entity but still relates directly to individual text documents, e.g., a classification result; the classifications directly relate to the documents but the set of classification levels forms an own entity. Both cases can be handled with \code{meta()}: <<>>= meta(crude, tag = "test", type = "corpus") <- "test meta" meta(crude, type = "corpus") meta(crude, "foo") <- letters[1:20] meta(crude) @ \section*{Standard Operators and Functions} Many standard operators and functions (\code{[}, \code{[<-}, \code{[[}, \code{[[<-}, \code{c()}, \code{lapply()}) are available for corpora with semantics similar to standard \proglang{R} routines. E.g., \code{c()} concatenates two (or more) corpora. Applied to several text documents it returns a corpus. The metadata is automatically updated, if corpora are concatenated (i.e., merged). \section*{Creating Term-Document Matrices} A common approach in text mining is to create a term-document matrix from a corpus. In the \pkg{tm} package the classes \class{TermDocumentMatrix} and \class{DocumentTermMatrix} (depending on whether you want terms as rows and documents as columns, or vice versa) employ sparse matrices for corpora. Inspecting a term-document matrix displays a sample, whereas \code{as.matrix()} yields the full matrix in dense format (which can be very memory consuming for large matrices). <<>>= dtm <- DocumentTermMatrix(reuters) inspect(dtm) @ \section*{Operations on Term-Document Matrices} Besides the fact that on this matrix a huge amount of \proglang{R} functions (like clustering, classifications, etc.) can be applied, this package brings some shortcuts. Imagine we want to find those terms that occur at least five times, then we can use the \code{findFreqTerms()} function: <<>>= findFreqTerms(dtm, 5) @ Or we want to find associations (i.e., terms which correlate) with at least $0.8$ correlation for the term \code{opec}, then we use \code{findAssocs()}: <<>>= findAssocs(dtm, "opec", 0.8) @ Term-document matrices tend to get very big already for normal sized data sets. Therefore we provide a method to remove \emph{sparse} terms, i.e., terms occurring only in very few documents. Normally, this reduces the matrix dramatically without losing significant relations inherent to the matrix: <<>>= inspect(removeSparseTerms(dtm, 0.4)) @ This function call removes those terms which have at least a 40 percentage of sparse (i.e., terms occurring 0 times in a document) elements. \section*{Dictionary} A dictionary is a (multi-)set of strings. It is often used to denote relevant terms in text mining. We represent a dictionary with a character vector which may be passed to the \code{DocumentTermMatrix()} constructor as a control argument. Then the created matrix is tabulated against the dictionary, i.e., only terms from the dictionary appear in the matrix. This allows to restrict the dimension of the matrix a priori and to focus on specific terms for distinct text mining contexts, e.g., <<>>= inspect(DocumentTermMatrix(reuters, list(dictionary = c("prices", "crude", "oil")))) @ \section*{Performance} Often you do not need all the generality, modularity and full range of features offered by \pkg{tm} as this sometimes comes at the price of performance. \class{SimpleCorpus} provides a corpus which is optimized for the most common usage scenario: importing plain texts from files in a directory or directly from a vector in \proglang{R}, preprocessing and transforming the texts, and finally exporting them to a term-document matrix. The aim is to boost performance and minimize memory pressure. It loads all documents into memory, and is designed for medium-sized to large data sets. However, it operates only under the following contraints: \begin{itemize} \item only \code{DirSource} and \code{VectorSource} are supported, \item no custom readers, i.e., each document is read in and stored as plain text (as a string, i.e., a character vector of length one), \item transformations applied via \code{tm\_map} must be able to process strings and return strings, \item no lazy transformations in \code{tm\_map}, \item no meta data for individual documents (i.e., no \code{"local"} in \code{meta()}). \end{itemize} \bibliographystyle{abbrvnat} \bibliography{references} \end{document}