Happy Birthday To Me!

Today is my birthday. To celebrate I decided to look at what was in the news on January 27 every year since I was born. Mainly I want to see if the news were positive or negative.

Getting the data

I start buy creating a list of dates. I feel like there is probably a more straightforward way of doing this but I am currently obsessed with map so this is how I did it.

dates <-map(list(0:30), ~ymd("2018-01-27") - years(.x)) %>% 
  map(~as.list(.x)) %>% 
  flatten()

The next step is to actually get the data. I used the get_guardian command from the guardianR package to get all articles. You need to register at http://open-platform.theguardian.com/access to get the API but this only took a few minutes. I get a few warnings about “Unequal factor levels” but I ignore them because this is most likely because I can’t find any articles for a few of the earlier years.

The data need some cleaning.

glimpse(theguardian_jan27)
## Observations: 5,536
## Variables: 27
## $ id                   <chr> "uk-news/2018/jan/27/diners-rescued-in-bo...
## $ sectionId            <chr> "uk-news", "australia-news", "lifeandstyl...
## $ sectionName          <chr> "UK news", "Australia news", "Life and st...
## $ webPublicationDate   <chr> "2018-01-27T00:43:29Z", "2018-01-27T04:30...
## $ webTitle             <chr> "West London flooded as burst pipe forces...
## $ webUrl               <chr> "https://www.theguardian.com/uk-news/2018...
## $ apiUrl               <chr> "https://content.guardianapis.com/uk-news...
## $ newspaperPageNumber  <chr> NA, NA, "72", "6", "65", "55", "85", "83"...
## $ trailText            <chr> "Fire crews deploy rescue boats in Hammer...
## $ headline             <chr> "West London flooded as burst pipe forces...
## $ showInRelatedContent <chr> "true", "true", "true", "true", "true", "...
## $ lastModified         <chr> "2018-01-27T00:51:54Z", "2018-01-27T04:33...
## $ hasStoryPackage      <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ score                <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ standfirst           <chr> "<p>Fire crews deploy rescue boats in Ham...
## $ shortUrl             <chr> "https://gu.com/p/823jc", "https://gu.com...
## $ wordcount            <chr> "188", "470", "1517", "727", "417", "692"...
## $ commentable          <chr> "false", NA, "false", "true", "true", "fa...
## $ allowUgc             <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ isPremoderated       <chr> "false", "false", "false", "false", "fals...
## $ byline               <chr> "Patrick Greenfield", "Christopher Knaus"...
## $ publication          <chr> "theguardian.com", "theguardian.com", "Th...
## $ newspaperEditionDate <chr> NA, NA, "2018-01-27T00:00:00Z", "2018-01-...
## $ shouldHideAdverts    <chr> "false", "false", "false", "false", "fals...
## $ liveBloggingNow      <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ commentCloseDate     <chr> "2018-01-30T00:45:00Z", NA, NA, "2018-01-...
## $ body                 <chr> "<p>People have been evacuated from their...

First of all, I want to change the variable names so that they have underscores instead of capital letters.

df_names <- names(theguardian_jan27)
snake_names <- to_snake_case(df_names)
names(theguardian_jan27)<- snake_names
names(theguardian_jan27)
##  [1] "id"                      "section_id"             
##  [3] "section_name"            "web_publication_date"   
##  [5] "web_title"               "web_url"                
##  [7] "api_url"                 "newspaper_page_number"  
##  [9] "trail_text"              "headline"               
## [11] "show_in_related_content" "last_modified"          
## [13] "has_story_package"       "score"                  
## [15] "standfirst"              "short_url"              
## [17] "wordcount"               "commentable"            
## [19] "allow_ugc"               "is_premoderated"        
## [21] "byline"                  "publication"            
## [23] "newspaper_edition_date"  "should_hide_adverts"    
## [25] "live_blogging_now"       "comment_close_date"     
## [27] "body"

I am also going to get rid of some variables that I don’t need.

theguardian_jan27 <- theguardian_jan27 %>% 
  select_if(function(x) !all(is.na(x))) %>% # Drop columns with all missing values
  select_if(function(x) !(is.factor(x) && nlevels(x)==1)) # Drop factors with only one level

Time to look at the titles using tidytext. I want my analysis to be done by year so I create a new variable with the year of web publication.

theguardian_jan27 <- theguardian_jan27  %>% 
  mutate(web_publication_year=year(web_publication_date))

tidy_theguardian_jan27<- theguardian_jan27 %>%
  unnest_tokens(output=word, web_title, to_lower = TRUE)

cleaned_theguardian_jan27 <- tidy_theguardian_jan27 %>% 
  anti_join(stop_words, by="word") # remove stop words (words that are very common)

cleaned_theguardian_jan27 %>%
  count(word, sort = TRUE) %>% 
  head(10)
## # A tibble: 10 x 2
##    word        n
##    <chr>   <int>
##  1 â         469
##  2 review    212
##  3 uk         89
##  4 cup        80
##  5 media      75
##  6 video      73
##  7 letters    72
##  8 day        70
##  9 london     64
## 10 david      63

As you can see the most common word is just the letter a with a hat. Since it is a not a real word I am going to remove it.

cleaned_theguardian_jan27 <- cleaned_theguardian_jan27 %>% 
  filter(word!="â")

cleaned_theguardian_jan27 %>%
  count(word, sort = TRUE) %>% 
  head(10)
## # A tibble: 10 x 2
##    word        n
##    <chr>   <int>
##  1 review    212
##  2 uk         89
##  3 cup        80
##  4 media      75
##  5 video      73
##  6 letters    72
##  7 day        70
##  8 london     64
##  9 david      63
## 10 world      62

I also checked for other short words but there were no other suspiciously short words.

cleaned_theguardian_jan27 <- cleaned_theguardian_jan27 %>% 
mutate(word_length=nchar(word, type = "chars", allowNA = FALSE, keepNA = NA)) 

 table(cleaned_theguardian_jan27$word_length)
## 
##    1    2    3    4    5    6    7    8    9   10   11   12   13   14   15 
##  159  706 1971 4474 5453 5154 4686 3000 1842 1139  499  242  117   48   13 
##   16   20   21 
##    4    1    1

I want to assign a sentiment to each word to see if the words used in the article titles were positive or negative. (I realise that the “hat a that”" I removed would have disappeared when doing the inner join but I think that there is value in doing some cleaning.)

bing <- get_sentiments("bing")
cleaned_theguardian_jan27_sentiment<- cleaned_theguardian_jan27 %>%
  inner_join(bing, by="word") 

Now each word has been classified as either negative or positive and we can look at the distribution over time.

cleaned_theguardian_jan27_sentiment %>% 
  select(web_publication_year, sentiment) %>% 
  group_by(web_publication_year, sentiment) %>% 
  mutate(sentiment_count=n()) %>% 
  distinct() %>% 
ggplot(data=., aes(x=web_publication_year, y=sentiment_count, fill=sentiment)) + geom_col() + ylab("Count") + xlab("Year of publication") + ggtitle("Sentiment of words in titles")

Unfortunately it looks like most of the words used in the titles on my birthday over the years have been negative. Causality? I am not going to let that stop me from having a good time today.