Introduction

I love watches! For many, it is a dying (or dead) art. We can tell time using phones, computers, and smart watches, and they each perform many more functions making the traditional watch all but obsolete. But, to me, mechanical watches are the perfect example of form and function. I find it amazing how a selection of tiny gears and a long spring can measure something like time with very reasonable accuracy.

As a result, I enjoy reading and watching horological content and, well, entering the occasional contest. So when WorldTempus had their annual advent calendar contest, I entered whenever I saw something of interest. Sometime after Christmas, they announced the winners on their web site and I was immediately struck by how many winners were from France. I wondered what else I could glean from the small amount of data on that page so I set off to learn how to use rvest to scrape the data and regular expressions to clean and parse it. I also had the idea to visualize how some features are related and finally learned how to construct a simple Sankey diagram.

Setup

First step is to load in some libraries and set up some aesthetics.

library(tidyverse)
library(lubridate)
library(rvest)
library(networkD3)
library(ghibli)

Scraping the web

Below I use rvest’s read_html function to load the url. The content is parsed using html_nodes that I identified using the SelectorGadget browser plugin.

url <- "http://en.worldtempus.com/article/events/arts-and-culture/advent-calendar-here-are-the-winners-26936.html?utm_source=Worldtempus+Concours+EN&utm_campaign=d835784c43-EMAIL_CAMPAIGN_2018_avent_gagnant_generale_EN&utm_medium=email&utm_term=0_1b496ba624-d835784c43-119672677"

df <- read_html(url) %>%
  html_nodes(".theContent , .theContent p , .theContent li") %>%
  html_text() 

# drop first two rows
df <- df[3:48] 

# view the top 10 rows
knitr::kable(head(df,10), format = "html")
x
01.12.2018 – A Bulgari "Thé Bleu" gift set from Bulgari
Jacques Fontaine from Boussu (Belgium)
02.12.2018 – A leather vanity kit and card holder offered by Bovet 1822
Christine Billon from Nancy (France)
03.12.2018 – A set with a matryoshka and a City Guide Moscow from Louis Vuitton offered by Hublot
Dominique Albrecq from Jemeppe-sur-Sambre (Belgium)
04.12.2018 – A leather travel wallet from Ulysse Nardin
Sandor Szucs from Budapest (Hungary)
05.12.2018 – A travel notepad from Louis Vuitton
Julien Klingenmeyer from Montbeliard (France)

Note that df is a single column with date and prize in one row and the winner and location in an alternating row. I’ll deal with that in the next step.

Data Cleaning and Munging

Cleaning the data and pulling out features (like ‘Country’ and ‘Brand’) required brushing up on regular expressions. It took some trial and error, but it was very good practice.

# split single column of data into two columns
col1 <- character(length(df)/2)
col2 <- character(length(df)/2)
x <- 1
for (i in 1:length(df)) {
  if (i %% 2 == 0) {
    col1[x] <- df[i-1]
    col2[x] <- df[i]
    x <- x + 1
  }
}

data <- data.frame(col1,col2)

# create basic structure including the prize, winner's name, and date
data <- data %>% 
  separate(col1, c("Date", "Prize"), sep = "–", extra = "merge") %>%
  mutate(Prize = str_trim(Prize),
         Date = dmy(Date)) %>%
  rename(Winner = col2)

# helper function that splits out brand name from prize description
splitby <- function(x) {
  case_when(str_detect(x, " from ") ~ str_sub(x ,str_locate(x ," from ")[2]+1), 
            str_detect(x, " by ") ~ str_sub(x ,str_locate(x ," by ")[2]+1))
}

# create features from from data 
data <- data %>%
  mutate(City = str_match(Winner, "from ([a-zA-Z\\-\\é]+)")[,2],
         Country = str_match(Winner, "\\(([a-zA-Z\\-\\é ]+)\\)")[,2],
         Winner = str_match(Winner, "(.*) from")[,2],
         Brand = sapply(Prize, splitby))

# manual cleanup that general functions didn't catch
data$Brand[3] = "Louis Vuitton"
data$Brand[11] = "Ralph Lauren"

# reorder variables and drop Winner since I won't be using it
data <- data %>% select(Date, Prize, Brand, City, Country)

A quick view of the data to make sure it looks ok.

knitr::kable(head(data,10), format = "html")
Date Prize Brand City Country
2018-12-01 A Bulgari "Thé Bleu" gift set from Bulgari Bulgari Boussu Belgium
2018-12-02 A leather vanity kit and card holder offered by Bovet 1822 Bovet 1822 Nancy France
2018-12-03 A set with a matryoshka and a City Guide Moscow from Louis Vuitton offered by Hublot Louis Vuitton Jemeppe-sur-Sambre Belgium
2018-12-04 A leather travel wallet from Ulysse Nardin Ulysse Nardin Budapest Hungary
2018-12-05 A travel notepad from Louis Vuitton Louis Vuitton Montbeliard France
2018-12-06 The book "From Seamaster to Seamaster : the first 70 years" from Omega Omega Benfeld France
2018-12-08 A Ladies’ purse from Seiko Seiko Saint-Front-de-Pradoux France
2018-12-09 A leather folder with a notepad from Carl F. Bucherer Carl F. Bucherer Saint-Athan UK
2018-12-10 A desk clock from Montblanc Montblanc Moncton Canada
2018-12-11 A memory game from Hermès Hermès Athis-Mons France



Which countries had the most contest winners? And what brands were most commonly offered as prizes?

data %>%
  add_count(Country, sort = T) %>%
  mutate(Country = fct_reorder(Country,n)) %>%
  ggplot(aes(Country, fill=Brand)) +
  geom_histogram(stat="count") +
  coord_flip() +
  labs(x="",y="") +
  scale_fill_manual(values = ghibli_palette(18, name = "MononokeMedium", type = "continuous")) +
  theme(axis.text = element_text(size = 14),
        legend.position = c(0.75,0.3),
        legend.text = element_text(size=6)) +
  guides(fill = guide_legend(ncol = 2, keywidth = 1, keyheight = 1))

From the histogram, we see that nearly half of the winners are from France and about 20% are from the UK. However, with 18 brands, it’s very difficult to answer my second question above. I’m still curious about how the brands are distributed among countries so I’ll experiment with network analysis in the next section.

What is the relationship between countries and brands?

network_data <- data %>% select(Country, Brand) 

simpleNetwork(network_data, 
              fontSize = 14, fontFamily = my_font, 
              opacity = 0.8, linkDistance = 100, 
              charge = -10, zoom = F)

This diagram certainly helps identify connections between countries and brands, but it’s not ideal.

Can I build a cleaner visualization of this relationship?

I think a Sankey diagram would work well here, but I’ve never successfully built one before.

# building dataframes to help construct the Sankey diagram
nodes <- data %>%
  count(Country, sort = T) %>%
  mutate(node = seq(0, n_distinct(Country) - 1)) # this basically numbers each row 0 to 7

targets <- data %>%
  count(Brand, sort=T) %>%
  mutate(node = seq(n_distinct(data$Country), n_distinct(data$Country) + n_distinct(Brand) - 1)) # numbers rows 8 to 25

links <- data %>%
  select(Country, Brand) %>%
  left_join(nodes, by = "Country") %>%
  left_join(targets, by = "Brand") %>%
  select(source = node.x, target = node.y ) %>%
  mutate(value = 1) %>%
  arrange(source)

names <- data.frame(node = seq(0, n_distinct(data$Country) + n_distinct(data$Brand) - 1)) %>%
  left_join(nodes, by = "node") %>%
  select(node, Country) %>%
  left_join(targets, by = "node") %>%
  mutate(name = if_else(is.na(Country),Brand,Country)) %>%
  select(node, name) 

# create sankey chart using NetworkD3
sankeyNetwork(Links = links, Nodes = names, Source = "source",
              Target = "target", Value = "value", NodeID = "name",
              colourScale = JS("d3.scaleOrdinal(d3.schemeCategory20);"),
              fontSize = 12, fontFamily = my_font, nodeWidth = 50, nodePadding	= 2)

I’m glad I put the extra effort into learning how to create this chart. I definitely learned a lot!

I think the simple Sankey diagram above does a very nice job of clearly visualizing the distribution of winners by country and brand. We clearly see that most of the winners were from France, the UK, and Belgium. We also see that products from Louis Vuitton, Omega, Carl F Bucherer, and Bovet were among the more popular brands offered as prizes. Additionally, brands that were offered more than once were distributed equally among France and the UK, with the exception of Louis Vuitton, which was distributed among France, the UK, and Belgium.