Read CH 1 – 4 of Text Mining in R (https://www.tidytextmining.com/index.html) as an overview of how R is used for sentiment analysis and learn the basic idea of tokenization. Write a brief report (2-3) pages as to how you do sentiment analysis in R.
Text Mining
Sentiment analysis is the process that allows the analysis of tidy text format to understand the emotional intent of words in the text, whether a section of text is positive or negative, or is characterized by some other more nuanced emotion such as disgust or surprise. Sentiment analysis can be conducted through several approaches.
One of the methods of sentiment analysis is through tidytext package. The package incorporates three general-purpose lexicons that are based on single words (unigrams). The lexicons contain different English words that are assigned scores for either positive or negative sentiment. The words are also assigned based on the possibility of emotions such as anger, sadness, and joy. These lexicons include NRC lexicon which categorizes words in a binary fashion (yes or no) into sentiment groups of positive, negative, fear, disgust, anger, surprise, anticipation, trust, joy, and sadness (Silge & Robinson, 2019). Another lexicon is AFINN which uses a score that runs between -5 and 5 to assign word-score, whereby a negative score of a word indicates negative sentiment while positive indicates positive sentiment. The bing lexicon indicates words positive or negative score by categorizing words in a binary fashion.
Another method used to analysis sentiments is through an inner join. Using inner join analysis involves separating words by converting the text to tidytext format using unnest_tokens() and removing stop words and sentiment lexicons. The conversion enables the tidy format to appear with one word per row, making it easier to use the inner_join() to perform the sentiment analysis (Silge & Robinson, 2019). In a data frame that contains both positive and negative words, sentiment analysis can be conducted by implementing count() for both word and sentiment, which helps to identify how each word contributed to each sentiment. The words can then be grouped based on the positive contribution or negative contribution to sentiments using ggplot2. Sentiment analysis can also be done by tokenizing at the word level, especially when looking at units beyond just words. Tokenizing analysis tries to understand the sentiment of the sentence as a whole by tokenizing text into sentences.
To explore the relationship of individual words with sentiments within a text can be conducted using different tokenizers. Basic tokenizers are the most commonly used, which perform the function of tokenizing words, sentences, lines, characters, and paragraphs. Basic tokenizers can be used to create at most two levels of tokenization, whereby a text can be split sentences or paragraphs and then into word tokens (Silge & Robinson, 2019). Other methods achieving words tokens include the use of chunk_text, which breaks texts into smaller segments, with each having the same number of words. The technique allows longer documents such as a novel to be viewed as a set of smaller documents making it easy for sentiment analysis. The n-gram tokenizer function can also be used. The functions tokenize inputs into different kinds of n-grams which contain a specific character vector length. The vector characters are normally tokenized separately with each element of the list having a length of 1. When applying a single character tokenizer, which also functions like n-gram, it singles units of characters rather words. The character single tokenizer function enables the determination of whether non-alphanumeric characters such as punctuation should be discarded or retained.
References
Silge, J., & Robinson, D. (2019, September 02). Text Mining with R. Retrieved from https://www.tidytextmining.com/