Extracting Emotions from Personal Diary Entries

Tue, Nov 10, 2020 3-minute read 496 words

In this post, I focus on extracting individual emotions from my personal diary entries use NRC module. This is a second part of the series that describe my experiments with sentiment analysis in R.

Extracting Emotions from Diary Entries

The NRC Emotion Lexicon is a dictionary that links English words to eight basic human emotions: anger, anticipation, fear, trust, surprise, sadness, joy and disgust, in addition to sentiments. Just like AFINN and Bing dictionaries, NRC can be called using TidyText package. For the purposes of this analysis, I extract only emotions:

nrcemo<-diary %>% inner_join(get_sentiments("nrc")) %>%
    filter(!sentiment %in% c("positive","negative"))

First, let’s take a look at the distribution of eight emotions across the months:

nrcemo<-nrcemo %>% mutate(month=month(dmy,label=T)) %>% 
    group_by(month,sentiment) %>% 
    summarise(word_count = n()) %>%
    ungroup() %>%
    mutate(sentiment = reorder(sentiment, word_count)) %>%
    ggplot(aes(sentiment, word_count, fill = -word_count)) +
    geom_col() +
    guides(fill = FALSE) +
    theme_minimal() +
    labs(x = NULL, y = "Word-based Instances") +
    ggtitle("Emotions Detected in Daily Diary Entries") +
    coord_flip()+
    facet_wrap(~month,ncol=3)
print(nrcemo)

extracting emotions from personal diary monthly distributions

Vague and “generic” emotions like anticipation and trust dominate the entries. Other six emotions are more interesting from the information perspective. For instance, October turned out to be less joyful and more sad.

Validating Emotions against Lifelog Metrics

Ideally, these emotions should be picked up by my lifelog indicators. To check that, I extract emotions within each daily entry, detrend and normalize the instances, and merge with the daily log:

diary<-indiary %>% unnest_tokens(sentence,text,token="lines") %>% 
    mutate(date=ifelse(substr(sentence,1,4) %in% 
                           c("jan ","feb ","mar ","apr ","may ","jun ","jul ","aug ","sep ",
                             "oct ","nov ","dec "),sentence,NA))%>% fill(date) %>% filter(sentence != date) %>% 
    unnest_tokens(word,sentence,token="words") %>%  
    anti_join(stop_words) %>% 
    mutate(date=paste("2020",date)) %>% mutate(dmy=ymd(date)) %>% 
    select(dmy,word)

nrc<-diary %>% inner_join(get_sentiments("nrc")) %>% 
    group_by(dmy) %>% count(sentiment) %>% spread(sentiment,n) %>% 
    replace(is.na(.),0) %>% ungroup()%>% 
    mutate_at(vars(anger:trust),normalize) %>% 
    mutate_at(vars(anger:trust),~scale(.,center=TRUE,scale=FALSE)) %>% 
    mutate_at(vars(anger:trust),detrend)%>% drop_na
    
nrclog<-inner_join(nrc,lifelog,by="dmy") %>%  
    select(anger, anticipation,disgust,fear,joy,sadness,surprise,trust,healthdt,energydt,stressdt,happinessdt,flowdt) %>% 
    rename(anticip=anticipation,health=healthdt,energy=energydt,stress=stressdt,happ=happinessdt,flow=flowdt) %>% 
    unnest(anger:flow)

Now we can look at the correlations between eight emotions and five major daily metrics:

corrplot.mixed(cor(nrclog),sig.level=0.05,insig="blank",upper="ellipse")

emotions vs daily life metrics

Only daily Happiness scores picks up just few emotions, all of them negative: sadness, anger, disgust and maybe fear. Suprisingly, Joy is not correlated with any life metric. And Stress scores do not reflect any of the emotions. I had to dig deeper to find out why.

Well, turns out that NRC dictionary is extremely limited. For example, it does not assign any emotions to word “stress”. Furthermore, NRC only picks up the root words (e.g., “stress”), but not necessarily derivatives (stressful, stressed, etc).

Just to make sure the diary entries actually pick up the emotional context I compare the days with entries that contained any “stress” words with those that did not:

stresslog<-inner_join(diary,lifelog,by="dmy") %>%
    mutate(diary_entries=ifelse(str_detect(word,"stress"),"with stress words","no stress words")) %>% 
    select(diary_entries,stressdt,happinessdt) %>% 
    group_by(diary_entries) %>% 
    summarize(stress=mean(stressdt),happiness=mean(happinessdt), .groups="drop")

strlong <- gather(stresslog,"lifemetrics","average",-diary_entries) 

ggplot(strlong, aes(diary_entries, average, fill=lifemetrics)) + 
    geom_bar(position="dodge", stat="identity")+
    theme_minimal()+
    ggtitle("Presence of Stress-Words in Diary Entries vs Lifemetrics")

I am relieved to see clearly pronouced differences:

daily life metrics in diary entries with and without stresswords

For now, it looks like to get more useful and reliable insights from my diary I would need to expand the NRC dictionary, or perhaps even build a custom lexicon with custom emotions.