This is how you create a one year long newspaper

8 min readJan 2, 2021

An idea of a UX Designer and a Data Scientist [This post is co-written with Mine Çetinkaya-Rundel (my sister).]

I am a night over-thinker. Sometimes, like really pretty rarely, that serves a purpose. This time, I was, unfortunately, thinking about everything that happened in 2020 and had this idea of a drawing — a word cloud. Then I started to expand it… Could the words be organized by month? Could this all be based on real data? Could I get data from newspapers all over the world, then identify the most commonly used words in headlines? It might be interesting… Eventually, I fell asleep thinking that I should check in with my sister about this in the morning since data is her thing.

Why am I writing about this?

This small project was important for me for 3 reasons:

It was a process of finding alternative solutions due to constraints. For example, since The New York Times data was easily available via their Archive API, we started to mock up the visualization with this data. After some research and a community plea for alternate data sources, we realized that this was the most feasible data we had to work with. So we had to shrink our scope. This is a very common problem I deal with at work as a UX Designer. Although the design methodology I follow is well defined, constraints such as time, resources, technology, etc. always show up. I need to be creative and proactive to come up with an alternative solution.
I watched the coding process live. And even helped solve some problems by googling 😁 or sometimes by just being the 3rd eye. As a UX designer it’s important to know how to work with developers. A design idea may not always be feasible or doable.
Just before new year’s eve, I was stuck at home so it was either going to be a virtual Tombola or this. Especially during these times as a human being, it’s important to find distractions to balance your mental health.

Step 1 — Define the purpose, access data

Yes, a lot to think about as a first step. Even after clarifying your objective and starting to imagine the layout of the final work, there might be constraints that interfere along the way. I believe that, at this stage, it’s important to answer the question “what is the story we want to tell?” along with accessing the relevant data.

Since our story wasn’t serving a serious journalism purpose but was only to entertain people (ourselves) we took a step back to redefine our main purpose. In the beginning, we wanted to map news from a more global point of view, by collecting data from several newspapers around the world. But we couldn’t easily reach the data we would need (some sources were not free and others were unfamiliar and required more time than we had to ramp up), so the main objective ended up being;

Visualize the whole year with words that topped The New York Times headlines.

Step 2 — Process data

Nathan Yau, in his book, Visualize This mentions that we should be on the lookout for patterns and relationships in the story that we want to tell: patterns as in what is changing over time and how it’s changing and relationships between different datasets.

When we started to process data and got our first set of results, it wasn’t telling an interesting story. Without any surprise words like “president” and “trump” were repetitive in each month. The change from month to month in headlines was not visible, or interesting. Note that when we refer to “headlines” throughout this project, we mean the title as well as the abstract of the article. This was a decision we made early on in the process we felt that the titles alone didn’t reflect the actual contents of the articles.

Words that topped The New York Times headlines in 2020

We also considered the proportion of headlines where a particular word appeared. For example, “trump” topped the headlines every month but the proportion of headlines in which it appeared might have been different.

Words that topped The New York Times headlines in 2020 (with appearance proportions)

Ultimately, we iterated and restated our main objective as;

Visualize the whole year with words that first topped The New York Times headlines.

Step 3 — Design the layout

Although the proportions graph told an interesting story we decided not to go down this path because since the first moment we were going for simplicity. I was imagining the project as a simple reflection of a newspaper. So, although along the data processing part we brainstormed and sketched a lot of different ideas, many of these didn’t seem to fit the “simplicity” bill and we quickly abandoned them.

This is why our final design has only black and white with shades of grey, and a secondary color, yellow.

Step 4 — Checking the results

It’s worth pausing at this stage to verify the results, call it a human touch. There might be several types of data entries that might hijack our end story. Either a typo, or just a different usage of a particular word.

In our case, we found the word “19” sitting by itself alone. And “ruth”, “bader”, and “ginsburg” as 3 different words, although they were all coming referring to “Ruth Bader Ginsburg”. So we grouped such words that always appear together to tell a more clear story. Additionally, we also grouped together words that appeared to be about the same news stories, e.g. “protests” and “protesters”, “vaccine” and “vaccines”, etc. You can find a full list of the data cleaning steps that address this issue here.

Our final list of words per month is as follows.

January   american . u.s . week . york . trial . president . iran . 2020 . impeachment . trump                       
February  democratic . iowa . people . week . york . china . president . 2020 . coronavirus . trump                  
March     world . u.s . time . york . people . home . virus . pandemic . trump . coronavirus                         
April     covid_19 . health . world . york . people . home . virus . trump . pandemic . coronavirus                  
May       president . city . time . york . people . virus . home . trump . pandemic . coronavirus                    
June      u.s . primary . election . people . black . protests . pandemic . police . trump . coronavirus             
July      time . week . city . u.s . people . president . virus . pandemic . coronavirus . trump                     
August    people . night . democratic . national . election . president . pandemic . coronavirus . convention . trump
September time . people . biden . u.s . court . president . election . coronavirus . pandemic . trump                
October   court . debate . virus . biden . house . president . pandemic . election . coronavirus . trump             
November  pandemic . biden . coronavirus . president . congressional . district . maps . trump . results . election  
December  york . biden . president . covid_19 . people . 2020 . pandemic . vaccine . coronavirus . trump

However, we had a small debate about whether to remove words like “week”, “day”, or “time”. On one hand, they don’t necessarily tell a particular story. On the other hand, removing them seemed arbitrary. This brings us to our final step.

Step 5 — Give it a meaning

Rather than removing such words, we decided to provide context along with them and hence decided to create our data visualization as an interactive Shiny app. Below is a screenshot of our final app, which you can also view and interact here.

Let’s talk a little bit about some of the technical choices we made when creating this app, which might be of interest to those who build apps with R and Shiny.

Building the reactivity: Each of the words in the app is an [actionLink()]. When a user clicks on a word, the app creates a table of all headlines in that month that mentioned that word. With 10 words per month that all trigger the same type of reactivity, writing these out as individual action links and observers would make the app code very repetitive. Instead, we used map() and friends functions to create them programmatically. You can see how the action links on the sidebar are created here and how the observers that generate the tables are created here.
Making the table: We used the gt package to make the table because of the styling options it offers. Particularly, we were interested in grouping the articles by day and this is as easy as passing a grouped tibble to gt(). Additionally, the package is so well documented that even I (someone who doesn't code in R) was able to figure out the functions we need to use for a particular look.
Styling: We used a custom CSS file to style the app. Neither of us are CSS experts but I came up with the overall look and color scheme, and we both googled our way around and landed on a look we liked by trial and error. Other styling features included an icons8 💩 icon (in the body of the app and in the browser tab) and a custom background with 2020 in block letters that I made in Sketch.
Data size: Our dataset of all words in all 2020 New York Times titles and abstracts, minus the stop words, which amounts to over 850K words. We didn’t want to load the entire dataset every time the app was launched so we first subsetted the data for only the words that appeared in the list of top ten words per month. This substantially reduced load time for the app.
Mobile handicap: We are aware that we don’t offer a very nice experience for mobile users but our (not legitimate) excuse is that the mobile version looks as shitty as 2020. But this is certainly not the worst thing that happened in 2020…

https://cdn-images-1.medium.com/max/800/1*0PzhaUzFsIodg19LkgvdYg.gif

Bonus Step — EmbelliSHIT

Yes, pun intended. The app takes a little bit of time to load and we probably could have addressed this by tuning the app to make it fast, but we instead decided to embellish it with a custom loading page. We used the waiter package in R to implement the functionality and I made the GIF in Sketch using the same iconography.

What’s next?

We learned a lot during this process, and one of our main takeaways was to plan for obstacles with locating source data early on. Our plan for next year is to periodically collect data from news sources we would like to have access to (e.g. international news agencies, newspapers from other countries, etc.). While data collection is easy to automate, making sense of vast amounts of data in a simple representation like this one is not easy. There may also be challenges around news sources in different languages and combining them together.

You can find the source code for data prep, preliminary visualizations, and the Shiny app on GitHub. If you have suggestions for improving this visualization and app and/or future directions, we’d love to hear from you!