CSCI 526 – Database Systems
Project 1
Due on 10/07/2022, 11:55PM
Requirement
• This project requires students’ independent effort. No collaborative work is needed. But
you can share the ideas of solving the questions.
• Submit your solutions to D2L in time, no late submission will be accepted.
Context
You are starting a new position as Data Analytic Engineer in an E-commerce company. The
company is a online retailer who has just launched their first product.
• Responsibilities:
– As a member of the startup team, you will work with the CEO, the Head of Marketing,
and the Website Manager to help steer the business.
– You will analyze and optimize marketing channels, measure and test websperfor mance, and use data to understand the impact of new product launches.
• Goal: to use SQL to:
– Access and explore the Maven Fuzzy Factory database • Become the data expert
for the company, and the go-to person for mission critical analyses
– Analyze and optimize the business’ marketing channels, website, and product port folio
use the mysql workbench / terminal / sqlectron to connect a database. The database URL and
the password will be provided in an email. You are required:
• connect the database on AWS using the provided credentials
• check through the schema of database tables, and understand what is relation between
different tables
• DO NOT modify the data inside the database, because your classmates are using the
same database and tables for their project.
HINTS: when the question is complex, you could try to use WITH clause to create temporary
relation, we have discussed WITH clause. Also, you could use TEMPORARY table as well.
Temporary table will be accessible before you exit the current mysql session. After you ter minate the current session and re-login mysql with a new session, the temporary table will
disappear. Here is the example:
CREATE TEMPORARY TABLE test
SELECT * FROM products;
we are creating a temporary table and namomg it as test, test table contains all data in products
table.
1
Here is the Database which stores the online activities and order information: There are 6
tables in the database:
website_sessions, and website_pageview
Data about the website activities provided by Urchin Tracking Module (UTM)
orders, order_items, and order_item_refunds
2
product
Once you understand the database and tables, try to analyze the data inside the database and
help to run the online business more efficiently.
Urchin Tracking Module (UTM)
Urchin Tracking Module (UTM) parameters are five variants of URL parameters used by mar keters to track the effectiveness of online marketing campaigns across traffic sources and
publishing media. They were introduced by Google Analytics’ predecessor Urchin and, con sequently, are supported out-of-the-box by Google Analytics. The UTM parameters in a URL
identify the campaign that refers traffic to a specific website, and attributes the browser’s web site session and the sessions after that until the campaign attribution window expires to it. The
parameters can be parsed by analytics tools and used to populate reports.
The businesses run paid marketing campaign using UTM, and keep track of the activities. The
UTM parameters (utm_source, utm_campaign, utm_content) in website_sessions table are
associated with the paid traffic. The UTM parameters can answer the following questions:
• Where is the traffic coming from?
• How is it getting to me?
• Why is it coming to me?
Here are some more information about the UTM parameter: The Complete Guide to UTM
Codes: How to Track Every Link and All the Traffic From Social Media.
In the table, utm_source is not null, meaning the web sessions are driven by paid cam paigns.
Tasks
1. Your Data Analytics Manager comes to you, and wants to know all types of utm_content
and the number of sessions associated with all different kinds of utm_content. Give
your manager a clear picture by ordering the results in descending order on number of
session.
2. One of the stakeholders in marketing department needs to know the number of orders
are driven by all kinds of utm_content, then she can have idea on the effectiveness of
different utm_content.
3. 2014-12-31 is set as milestone, and a manager from the marketing team wants to know
where the bulk of sessions were from through 2014-12-31. She wants to see a break down by UTM source, campaign, and referring domain.
3
4. Looks like “gsearch” and “nonbrand” is the major traffic source, the manager in the mar keting team wants to know the conversion rate from sessions to orders, meaning, you
need to provide the number of sessions, number of orders, and the conversion rate be fore 2014-12-31 on “gsearch” and “nonbrand”.
5. The market manager wants to check the trend of the session on “gsearch” and “non brand” through 2014-12-31. In the results, you will need to show the year, month, and
the number of distinct sessions associated with the that time frame. You may need to use
YEAR(), MONTH() functions in your SQL statement.
6. A manager from the website management team want to know the most popular landing
zone, meaning that, the most popular pageview_url that a website session starts. This
would help the clickstream team to identify which channel is most effective. You are
required to return a page which contains 2 columns (landing_page, num_of_session).
The landing page is the page which a session starts from.
7. The website manager wants to check the bounce rate of each landing page. Bounce rate
is defined as: the number of web sessions ONLY stay at the landing page and didn’t go
further, divided by the number of total sessions. When you return the result to the man ager, you need to include the columns, like: lnading_page, sessions, bounced_sessions,
bounce_rate
8. Consider the steps from browsing products through finishing a transaction, the following
urls can be considered as sequential steps to finish an transaction:
(a) /products
(b) /the-original-mr-fuzzy
(c) /cart
(d) /shipping
(e) /billing
(f) /thank-you-for-your-order
You are requested to calculate the conversion rate for these steps through 2014-12-31.
The returned results should be contains columns like: num_of_sessions, rate_to_products,
rate_to_mrfuzzy, rate_to_cart, rate_to_shipping, rate_to_billing, rate_to_thankyou.
• rate_to_products: the conversion rate from num_of_sessions to (a)
• rate_to_mrfuzzy: the conversion rate from /products to /the-original-mr-fuzzy
• rate_to_cart: the conversion rate from /the-original-mr-fuzzy to /cart
9. A new product “/the-forever-love-bear” was added to the website on 2013-01-06, and the
marketing manager wants to pull out the website activities from 2012-10-06 to 2013-01-
06, and 2013-01-06 to 2013-04-06, after the users hit “/product” page and see where they
went next. Also, she wants to check the impact of clickstreams on the new product and
the old product (“/the-original-mr-fuzzy”) after the new product was added. The pulled
out data should include 2 phases:
• 2012-10-06 to 2013-01-06: you could name it as pre-product
• 2013-01-06 to 2013-04-06: post-product period
for each of these two period, there should be the following columns:
4
• number of sessions hitting to “/product” in the period
• number of sessions have next page from “/product”
• the proceeding rate from “/product”
• number of session went to “/the-original-mr-fuzzy” from “/product”
• the conversion rate to “/the-original-mr-fuzzy”
• number of session went to “/the-forever-love-bear” from “/product”
10. The marketing team added a recommended product at the “/cart” page to suggest users
to buy the products together (cross-sell). This function was launched at 2013-09-25. The
director wanted to pull out the order/product information from 2013-08-25 to 2013-09-25,
and 2013-09-25 to 2013-10-25, to check the impact of the cross-sell function in these two
periods. The data that you pull out should include:
• time periods: from 2013-08-25 to 2013-09-25, and 2013-09-25 to 2013-10-25
• number of cart sessions: the number sessions reach to “/cart” in the two periods
• number of clickthrough: the number of sessions proceed through “/cart” page.
• clickthrough_rate:
• number of placed order for these two periods.
• number of products for these two periods.
• products per order
• revenue in these two periods: sum of the “price_usd”
• average of order value:
• revenue per cart:
HINTs: in the question, you will need to use tables like: orders, order_items, web site_pageviews, website_sessions
5

Published by
Essays
View all posts