Sentiment Analysis of Personal Diary

Tue, Sep 22, 2020 3-minute read 498 words

In this post, I describe sentiment analysis of my diary entries using tidytext and sentimentr packages in R.

I am always looking for new ways to validate my lifelog indicators using some external measures (e.g., see Health vs symptoms, and Energy vs Oura ring metrics). In this analysis, I validate avarege daily lifelog metrics against the sentiment scores of my diary entries.

Diary

I keep my diary in Notes on my iPhone. Each entry is a brief summary of events on that day. For obvious reasons, I won’t share the entire text file, but here is a sample entry from February 5, 2020:

“Feb 5. Slept very well, woke up around 7:30. Paltry hotel breakfast. Checked out at 11, took Uber to the airport. Funny Greek driver. Realized I can change the flight using check-in machine, changed to 3:35 flight, so left Ft Lauderdale 5 hours earlier! Beautiful weather here - 79 today, going back to 40s sucks. The flight was fast, landed after 6:30. Got home without any adventures.”

This analysis uses diary entries from this year.

Sentiment Scores

You can think of sentiment scores as a numeric representation of the text being positive, negative or neutral. Here I use two approaches to calculating these scores. The first used tidytext package and AFINN dictionary. Each word is scored separately:

> rawdiary<-readLines("./DATAIN/diarysample.txt")
> indiary<-tibble(text=rawdiary)
> 
> tic("method 1: afinn")
> diary<-indiary %>% unnest_tokens(sentence,text,token="lines") %>% 
+     mutate(date=ifelse(substr(sentence,1,4) %in% 
+     c("jan ","feb ","mar ","apr ","may ","jun ","jul ","aug ","sep ",
+     "oct ","nov ","dec "),sentence,NA))%>% fill(date) %>% filter(sentence != date) %>% 
+     unnest_tokens(word,sentence,token="words") %>%  
+     anti_join(stop_words) %>% 
+     mutate(date=paste("2020",date)) %>% mutate(dmy=ymd(date)) %>% 
+     select(dmy,word)
Joining, by = "word"
> 
> afinn<-diary %>% inner_join(get_sentiments("afinn")) %>% group_by(dmy) %>% 
+     summarize(afinnscore=mean(value),.groups='drop')
Joining, by = "word"
> toc()
method 1: afinn: 0.7 sec elapsed
> 

In the second, I use sentimentr package, which supposedly uses a more advanced scoring model that takes into account various valence shifters. The scores are assigned to sentences so I keep the “stop words”.

> tic("method 2: sentimentr")
> diary<-indiary %>% unnest_tokens(sentence,text,token="lines") %>% 
+     mutate(date=ifelse(substr(sentence,1,4) %in% 
+                            c("jan ","feb ","mar ","apr ","may ","jun ","jul ","aug ","sep ",
+                              "oct ","nov ","dec "),sentence,NA))%>% fill(date) %>% filter(sentence != date) %>% 
+     unnest_tokens(word,sentence,token="sentences") %>%  
+     mutate(date=paste("2020",date)) %>% mutate(dmy=ymd(date)) %>% 
+     select(dmy,word)
> sentmtr<- diary %>% get_sentences() %>% sentiment() %>% group_by(dmy)%>%
+     summarise(sentrscore=mean(sentiment),.groups='drop')
> toc()
method 2: sentimentr: 25.52 sec elapsed
> 

This is how both scores look after normalizing and detrending:

AFINN vs sentimentr scores of my personal diary

Diary Sentiment vs Lifelog Scores

We can now look at the correlations between sentiment scores and lifelog metrics:

correlation between sentiment and lifelog scores

Stress and Happiness are the only variable where both sentiment metrics agree on correlations. When all lifelog metrics fed into regression model, only Happiness remains the main driver of the sentiment, overpowering everything else. The model only works for AFINN-based sentiment:

lifelog metrics as predictors of diary entry sentiment (AFINN)

In the next post, I will look at extraction of categorical measures of sentiment and emotions from the diary text. Hopefully, we find much stronger and more interesting patterns there. Stay tuned!