Hooked on Feelings

Read the Submission Directions

Please submit a PDF using an Rmarkdown file for this task1 both to eCampus and Slack.

You can use a chapter book of your choice or get one using the process below a go.

Getting Text

In addition to the packages listed in the example, you can use the library gutenbergr which gives you the ability to download and process public domain works from the Project Gutenberg collection. Follow the mini walkthrough below to see how to get a text of your choice2.

library(tidyverse)
library(gutenbergr)
library(tidytext)
library(flextable)
## 
## Attaching package: 'flextable'
## The following object is masked from 'package:purrr':
## 
##     compose
## The following objects are masked from 'package:kableExtra':
## 
##     as_image, footnote
gutenberg_metadata
## # A tibble: 51,997 x 8
##    gutenberg_id title  author  gutenberg_autho… language gutenberg_books… rights
##           <int> <chr>  <chr>              <int> <chr>    <chr>            <chr> 
##  1            0  <NA>  <NA>                  NA en       <NA>             Publi…
##  2            1 "The … Jeffer…             1638 en       United States L… Publi…
##  3            2 "The … United…                1 en       American Revolu… Publi…
##  4            3 "John… Kenned…             1666 en       <NA>             Publi…
##  5            4 "Linc… Lincol…                3 en       US Civil War     Publi…
##  6            5 "The … United…                1 en       American Revolu… Publi…
##  7            6 "Give… Henry,…                4 en       American Revolu… Publi…
##  8            7 "The … <NA>                  NA en       <NA>             Publi…
##  9            8 "Abra… Lincol…                3 en       US Civil War     Publi…
## 10            9 "Abra… Lincol…                3 en       US Civil War     Publi…
## # … with 51,987 more rows, and 1 more variable: has_text <lgl>

Let’s say we want The War of the Worlds. We can get it by running the following

war <- gutenberg_works() %>%
            filter(title == "The War of the Worlds")

war
## # A tibble: 1 x 8
##   gutenberg_id title  author  gutenberg_author… language gutenberg_books… rights
##          <int> <chr>  <chr>               <int> <chr>    <chr>            <chr> 
## 1           36 The W… Wells,…                30 en       Movie Books/Sci… Publi…
## # … with 1 more variable: has_text <lgl>

Well that’s a bit difficult to read. Instead of using the DT package, let’s give flextable a go.

war_flex <- flextable(war) %>%
                  autofit() %>%
                  theme_booktabs()

war_flex

That helps! If you’re interested, more customization options can be found on the package site.

Anyway we actually need the number, or index given by the ‘gutenberg_id’ column because we haven’t done anything barring grabbing the catalog. In our case that index number is 36. We can use that identification along with the gutenberg_download() function to get the entire text.

war_get <- gutenberg_works() %>%
                filter(title == "The War of the Worlds")

war_get
## # A tibble: 1 x 8
##   gutenberg_id title  author  gutenberg_author… language gutenberg_books… rights
##          <int> <chr>  <chr>               <int> <chr>    <chr>            <chr> 
## 1           36 The W… Wells,…                30 en       Movie Books/Sci… Publi…
## # … with 1 more variable: has_text <lgl>
war_text <- gutenberg_download(36)
## Determining mirror for Project Gutenberg from http://www.gutenberg.org/robot/harvest
## Using mirror http://aleph.gutenberg.org
war_flex_text <- war_text %>%
                      head(n = 15) %>%
                      flextable() %>%
                      autofit() %>%
                      theme_booktabs()

war_flex_text

But recall what we need is at least the title but here we’ll also get chapter information.

war_text_tc <- gutenberg_download(36,
                                    meta_fields = "title")

war_flex_text_tc <- war_text_tc %>%
                          head(n = 15) %>%
                          flextable() %>%
                          autofit() %>%
                          theme_booktabs()

war_flex_text_tc

Now divide into documents, each representing one chapter. Please Note that this assumes that the text column includes the term chapter. So you can ether amend the mutate below or look at the data set to make sure it includes the term to indicate the different chapters.

war_chapters <- war_text_tc %>%
  group_by(title) %>%
  mutate(chapter = cumsum(str_detect(text, regex("^chapter ", ignore_case = TRUE)))) %>%
  ungroup() %>%
  filter(chapter > 0) %>%
  unite(document, title, chapter)

war_chapters %>%
  head(n = 15) %>%
  flextable() %>%
  autofit() %>%
  theme_booktabs()

Then split the into words

war_chapters_word <- war_chapters %>%
                      unnest_tokens(word, text)

and assess document-word counts

war_word_counts <- war_chapters_word %>%
                    anti_join(stop_words) %>%
                    count(document, word, sort = TRUE) %>%
                    ungroup()
## Joining, by = "word"
war_word_counts
## # A tibble: 15,367 x 3
##    document                 word         n
##    <chr>                    <chr>    <int>
##  1 The War of the Worlds_16 brother     50
##  2 The War of the Worlds_25 ulla        28
##  3 The War of the Worlds_14 brother     26
##  4 The War of the Worlds_16 road        25
##  5 The War of the Worlds_14 people      24
##  6 The War of the Worlds_16 people      24
##  7 The War of the Worlds_12 people      20
##  8 The War of the Worlds_16 lane        20
##  9 The War of the Worlds_12 water       19
## 10 The War of the Worlds_19 martians    19
## # … with 15,357 more rows

or better yet

war_word_counts %>%
  head(n = 15) %>%
  flextable() %>%
  autofit() %>%
  theme_booktabs()

Now that is a form you should be able to work with!

Tasks

Perform the following

  1. a chapterwise or bookwise sentient analysis of your text using AFINN, Bing, and NRC with a visual of each.

  2. BONUS3: Follow the approach given in Text Mining with R to construct a topic model of the work with a visual of the top 10 topics by prevalence.

:::


  1. If you use an external file for your data, please submit that as well.↩︎

  2. As long as it exists in the public domain.↩︎

  3. Will replace the 2016 general election map activity↩︎