My First Data Analytics Project, Successful?
Google Data Analytics Professional Certificate Case Study Capstone Project
Introduction
This post will guide you through the process of my Google Data Analytics Professional Certificate case study: Cyclistic Bike-Share Analysis. This project will analyze the trends, behaviors, and characteristics of the rides and riders in order to draw some insights. The purpose of this project is to provide insights in order to make data-driven decisions for marketing strategies that would convert casual riders into members. This project will follow the 6 Data Analytics stages: Ask, Prepare, Process, Analyze, Share, Act.
I used PostgreSQL for data processing, data cleaning, validation, and exploration, and I also used Tableau Desktop for data visualization.
Ask
Identifying Business Task:
The company is seeking marketing strategies that will effectively convert casual riders into annual members. In order to do this, they require data-based insights into the characteristics and behaviors of their casual riders and annual members. So, the questions to answer are,
- What are the findings of the two types of riders?
- Why do casual riders purchase an annual membership?
- What could motivate and convince casual riders to become annual members?
Identifying Stakeholders:
- Lily Moreno — manager and director of marketing, responsible for the development of campaigns and initiatives to promote the bike-share program.
- Cyclistic executive team — the executive team will decide whether to approve the recommended marketing program.
- Cyclistic marketing analytics team — a team of data analysts responsible for collecting, analyzing, and reporting data that helps guide Cyclistic’s marketing strategy.
Prepare
Data Source:
For this analysis, Cyclistic’s historical trip data will be used. This is public data that has been made available by Motivate International Inc. under the license.
Note: Data privacy issues prohibit me from using riders’ personally identifiable information. This means that I won’t be able to connect pass purchases to credit card numbers to determine if casual riders live in the Cyclistic service area or if they have purchased multiple single passes.
Data organization & information:
I downloaded 12 ‘.cvs’ files for each month from September/2020 to September/2021 and organized them into a folder with consistent names, for example, 2020090_trip_data.
I stored these files on my laptop and I duplicated the data for backup purposes.
Each file consists of the same 13 columns with a total of 5 million rows.
Process
Before analyzing, the data had to be cleaned. I skimmed through a few files first to get an idea of the data formatting and quality on Microsoft Excel. But, for further process, I decided to use PostgreSQL for data processing, cleaning, and exploration as SQL is capable of handling large datasets while Microsoft Excel cannot process 5 million rows of data simultaneously.
Data Importation:
For the importation of the ‘.csv’ files, I created tables in PostgreSQL. Since the ‘ride_id’ column is the primary key in this dataset, I did not allow NULL value for the ‘ride_id’ column — thankfully, there were no NULL values in the ‘ride_id’ column.
I created VIEW to combine the 12 months of data.
Data Cleaning:
I checked to see if there are significant numbers of NULL values in the relevant columns for our analysis. ‘Started_at’ and ‘end_at’ (Time) and ‘member_casual’ columns did not have NULL values which is great news since they are important values for this analysis. There were around 5,000 NULL values for ‘latitude’ and ‘longitude’ columns, but this fact was not overwhelmingly damaging to the validity of the dataset considering the total 5 million rows.
There were noticeable numbers of NULL values in columns such as ‘start_station_name’, ‘start_station_id’, ‘end_station_name’, and ‘end_station_id’. Since I did not see the significance of these columns at the time, I decided not to include those columns in my analysis. (While I still did not use the ‘station_id’ columns, I later decided to include the ‘station_name’ column in my analysis. This will be addressed further below)
Data Exploration:
To get a general understanding of the dataset and to see if the data is well-cleaned and corresponding, I wrote a few queries. There was no errors or suspicion of poor data — the data was ready to be used for analysis.
Data Transformation:
Using WITH clause, I made ‘final_data’ table that consists of a number of columns that I needed for my anaylsis, including ‘ride_id’, ‘started_at’, ‘ended_at’, ‘start_lat’, ‘start_lng’, ‘end_lat’, ‘end_lng’, ‘member_casual’, ‘ride_minutes’, ‘started_on_month’, ‘started_on_day’, ‘started_on_hour’.
I attempted to save the result in one ‘.csv’ file, which was failed as the file was too large. The result was saved into two files divided by membership status.
Analyze & Share
I combined those two files into one dataset in Tableau by using union tables. It was succesfully and thus ready to be anaylzed. I analzed the data and created data visualization in Tableau. I was able to find a number of trends, behaviours, charateristics, and insights that seem helpful to make data-driven decisions with.
As follows:
Over the 12 months period, there were more member rides by 10%. However, the total ride time of casual riders was almost double the time than the total ride time of the members.
The casuals’ and members’ bike usage trends, by month, are similar. As the usage goes up, the total ride time follows without any unexpected changes. Moreover, the usage rate increases significantly from April to September, with an exceptional drop in July. The peak season is from August to September.
It is noticeable that the casual riders’ bike usage rapidly increases during the weekend, starting from Friday. It is also significant to note that members' bike usage barely changes over the week.
The casuals’ and members’ bike usage trends, by the hour, hold similarities, and yet also demonstrate similar slight variances. While the members’ usage rapidly rises twice daily — from 6 am to 7 am and again at 3 pm to 5 pm, the casual usage gradually increases from 5 am until 5 pm.
Problem / Problem Solving / Change in the Process
As addressed above, the first dataset I used for analysis above did not include ‘station_name’ and ‘station_id’. I attempted to analyze the data on locations only with latitude and longtitude, which did not work out as I could not group latitude and longtitude by station without station name.
I went back to SQL to create another table that contains station names together with latitude and longtitude.
I imported this data into Tableau and “Left Joined” to the union of two files that I first imported.
Back to Analysis
I, then, was able to find and analyze the trends and behaviours of riders with regards to the locations.
This map shows the number of causal riders’ check-ins and check-outs at various bike stations across the Chicago area. It is noticeable that the density is significantly higher at the coastal locations than further inland.
This map shows the number of causal riders’ check-ins and check-outs at various bike stations across the Chicago area. It is observed that the density markers are more evenly distributed across the coastline and inland areas of Chicago.
These two ranking tables show which starting stations and which ending stations are the most visited stations for casual riders.
Act
I have performed my analysis, and gained some insights into the data, created visualization. It is time to use these insights to make data-driven recommandations for marketing stretagies.
Again, the goal is to convert casual riders into annual members
- The ride time for members are comparatively shorter than casual riders. So, members tend to ride more frequetly for reletively short distance whereas casual riders ride longer distance but less requetly than members. This is perhaps because many members ride for commute to work daily whereas casual riders ride for tourism purposes or lesuire activities along the coastline. This conclusion is also supported by the fact that there are significantly more casual rides during weekend whilst there is little or no change in number of members’ ride over a the whole week, and that causal riders ride along turistic locations along the coastline while members seem to ride inland as much as they ride coastal areas.
- Since the significant bike usage begins from April until September, April can be a good time to launch the campaign which builds up until September, perhaps with additional focus in the most peak season between August and September.
- However, plan needs to have the execptional drop in July in mind so as to modify and adjust the marketing plan. It would be necessary to execute a survey to find out why there is a significant drop in July in order to develop marketing plan accordingly.
- The best days of the week to reach casual riders are Saturday, Sunday, and Friday respectively.
- The best time of the day to reach casual riders is 12pm to 6pm.
- Streeter Dr & Grand Ave, Millennlum Park, Michgan Ave & Oak St are the top 3 stations casual riders visited most, and thus locations where campaign can reach most casual riders.
- In order to attract and motivate casual riders to purchase an annual membership, company can consider launching a new membership program that rewards/incentivizes the riders for their usage. For example, riders will collect points depend on the ride length which they can exchange for rewards. Another instance would be creating ranking system or rider’s community where only members can participate and enjoy excluded and unique features. These program and system not only attract new members but also make existing members to continue their membership.
- Since casual riders ride a lot more during the weekend compared to weekday, company could consider introducing different types of membership such as weekend membership where causal riders who only ride during weekend can be easily attracted.