Anja's Costco Coupon Searcher

2.1: Process

I created a program that inputted each item one-by-one into the search bar and scraped the first item returned on the RCS. I chose to just pull the first item because it, theoretically, should be the closest match. Costco only has one bin that holds the brand name and item name, meaning that the name cannot be separated into two separate sections. Whereas, RCS has a bin for the brand and a bin for the item name, meaning that I was able to separate the brand name from the item name when storing it into the dataframe.

Initially, I ran the full name of the item through the search bar and stored the matched brand and item name in a data frame called “rcs_costco”.

Next, what I did was manually separate the brand from the item name for the Costco items and then ran a search with just the item name through RCS. I stored the brand name and the item name of both the Costco and RCS items in the “stripped_rcs_costco” data frame.

I wanted to know which data frame would be better to use based on which had more and better matches.

Right away, the items that did not have a brand name returned almost 2 times the amount of results (106 versus 54 iems). Then it was important to see how closely related the two items were from the different stores.

My first approach was to use NLTK. I compared the percentage similarities with both stopwords included and stopwords not included. When I analyzed the items that came back as a very high match vs a very low match there wasn’t a clear indication that the similarity percentage actually dictated the closeness of the items. So I abandoned this method.

Since I did not have many items (max of 106), I decided to go by hand and put 0 for not matched and 1 for matched, then proceeded to remove the not matches from the data frame (“matched_rcs_costco”).

Then I was able to start analyzing the data to get my answer to my question of are Costco deals really better?

2.2: Findings

Costco beat the price for the same product at RCS 82% of the time. Which is great, but you have to pay for a membership at Costco so you need to decide if the price difference is really worth it. This led me to look at the distribution of the price difference, as seen below. Figure 2.2.1: Spread of price differences when comparing Costco and RCS prices (using Costco quantities multiplied by the unit prices from each store).
When the Costco price is cheaper than the RCS, the average (median, due to outliers) price difference is $18.53, vs. only $8.32 when RCS is cheaper.
RCS had one item that was significantly cheaper than at Costco: a Lorex security camera system. However, since scraping the data the camera has been removed from RCS, meaning it was probably a clearance price that was scraped initially.
It seems there is significant savings at Costco. Which would mean that the membership is worth it - if you shop the deals.

2.3: Caveats

I started with 144 items from a couple batches of Costco’s online coupons. There were not too many overlapping items between Costco and RCS: 54 items returned at least one match from RCS’s search. Out of those, 44 were actual product matches when I compared manually. So the sample size of my work (44 products) wasn’t large.
The price difference I used was by getting the unit price at both stores and multiplying it by the amount you would get at Costco to have a graph that more clearly showed if Costco really was better, as we all know how much quantity you get when you shop at Costco.

Part 1: Pulling Costco Deal Item Details

Part 1.1: Setting up
Part 1.2: Accessing Webpage

Part 1.3: Pulling Data
Part 1.4: Data Analysis

Part 1.5: Completed Code

1.1: Setting up

As before we must import our libraries:

import requests
from bs4 import BeautifulSoup
import pandas as pd

Then we have to specify what URL we will be using and access it using requests

URL = 'https://www.costco.ca/coupons.html'
page = requests.get(URL)

Then we can try to make it into a soup:

soup = BeautifulSoup(page.content, 'html.parser')

However, you will notice that this does not work. The reason it does not work is because costco.ca has information saved from cookies, so we must put in information before we use the get(). We specifically need a location for the Costco catalogue. Therefore, we must input the location cookies in order to access the flyer savings. The general work around of inputting via cookies (as can be found under my Fun Things Learnt → Cookies) does not work in this case and I had to further explore a new method. This is how I came to look at Selenium (how to set it up can be found under Libraries → Selenium).

Setting up: Take 2

As before we must import our libraries. I will list each of the packages I will be working with here at the start and quick description of what they do.

When we initially open the Costco webpage there is a pop-up asking for location and we must select which province we live in to get the correct flyers. So we must import selenium to allow us to interact with HTML elements, such as clicking on a page using automation.

from selenium import webdriver

To allow us type and click buttons from our code we need to import Keys:

from selenium.webdriver.common.keys import Keys

To allow us to make sure web pages are fully loaded before trying to search for data on them, we need to import:

from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

And for data manipulation we will be importing Pandas.

import pandas as pd

As seen in the set up of Selenium, we must specify where the web driver is located and create a variable that will call on the web driver for chrome (or whatever browser you have chosen to use).

path = "/Applications/chromedriver-92"
driver = webdriver.Chrome(path)

Create a variable for the URL you want to use.

url = "https://www.costco.ca/"

1.2: Accessing Webpage

For this section, I got a lot of help from a specific youtube programmer.
To access/open the webpage:

driver.get(url)

We need to enter the province in the pop-up in order to get the correct location for the flyers. I have found the Firefox has done a better job with the inspection of html code compared to Chrome for pop-ups, so I used that to find the specific container of the modal (pop-up that blocks the rest of the webpage from working).

Our div container with the class attributes of “modal-content” contains the information we need. Upon further inspection, we can see:

Here is a more clear picture of the source code for the location radio buttons:

Finding elements using selenium can be done many different ways, for this we will be using xpath because the common ones we would normally use (finding by id, name, or class) will not be very useful since they are not unique. This tutorial really helped. Since the provinces are radio buttons, we must use the function .click(). We will be selecting Ontario for the purpose of this post.

We want to find the container/attribute that is different. In this case, we can see within the input container the value attribute is different based on the province/territory. So we are able to specify our path to be: //input[@value = 'ON']. // is to skip to the input container, [@value = ‘ON’] lets us select the specific attribute we want. (Side note - There is an easier way to get an xpath, which I go into in the pulling data section, but I think it is important to understand the process before taking shortcuts, hence why I wanted to show this). So our code would look like this:

select_region = driver.find_element_by_xpath("//input[@value = 'ON']").click()

Next we have to “click” the button that actually sets language and region:

Using the same logic as above we get:

set_region = driver.find_element_by_xpath("//input[@value = 'Set Language and Region']").click()

Now we are in!

We still need to get to the “More Savings at your local warehouse”. This means that we would just have to search in the search bar for “coupons”, sure enough that brought me to the correct page that I want on the Costco website. So in order to find the search field in the page source, I did “cmd-f” for type="search" and found four results, of those 4 the one that I wanted had placeholder="Search Costco", so I knew this was the one I needed. Then since it has a unique id = “search-field”, I was able to find element by using the id search:

coupon_search = driver.find_element_by_id("search-field")

After we find it, we want to enter the word “coupon” in the search field, so we will need to use the .send_keys() method.

coupon_search.send_keys("coupon")

Then we will use the Keys package that was imported from selenium to hit return/enter.

coupon_search.send_keys(Keys.RETURN)

When I clicked on the page manually, it brought me to this website: https://www.costco.ca/coupons.html. Then I realized, I didn’t need to go to the search bar to look for coupons because I already had the webpage for it. Lessons learned? Think before doing more. So back to my code I went and I edited the url and deleted the steps I just typed in the last paragraph. So here is the code so far:

from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import pandas as pd

path = "/Applications/chromedriver-92"
driver = webdriver.Chrome(path)
url = "https://www.costco.ca/coupons.html"

driver.get(url)

select_region = driver.find_element_by_xpath("//input[@value = 'ON']").click()
set_region = driver.find_element_by_xpath("//input[@value = 'Set Language and Region']").click()

Now that I have the correct region and webpage open, we can get started on pulling the coupon data.

1.3: Pulling Data

We need to make sure the page is fully loaded before we can try to pull the data from it, so we need to use wait.

First, we need to find the ID container:

From the picture we can see that the class containing the coupons is “couponbox” so I went to the page source and “cmd-f” “couponbox”. This brought me to:

As we can see, the ID is “coupons-lp”.

I pulled this code directly from the Selenium explicit wait function. I only changed two things:

My variable name I wanted it to be specific to my project, so I chose all_coupons, because this will pull all of the coupon data
The ID we are using to locate the coupons is "coupons-lp".

try:
all_coupons = WebDriverWait(driver, 10).until(
EC.presence_of_element_located((By.ID, "coupons-lp"))
)
CODE TO BE ADDED
finally:
driver.quit()

Before we pull the data, we must create lists to store the information we want: product name, price, discount, start and end dates.

names = []
prices = []
savings = []
start_dates = []
end_dates = []

Now the following code that will be pulling the data from the container using various attributes will be all written within the try section.

We must find a container that contains all five of those details:

This “CLpbulkcoup” class contains the information we need for each individual coupon, so we want to find all elements (plural) that have this path. We will be using xpath to find all elements. On Chrome, you are able to copy the xpath and just paste it. This can cause errors depending on which path you have copied, so I always just use it as a starting point and then modify based on what I need.

What I got was “//*[@id='coupons-lp']/ul[1]/li[1]/div[1]” - in this case this is not helpful to us, as it will only return the first container and we want all containers. This is why it is important to know the basics. So I modified my path to be: "//div[@class='CLpbulkcoup']".

So we will be saving all the HTML code we pull for each coupon, using xpath for the CLpbulkcoup class, into our coupons_detail variable.

coupons_detail = all_coupons.find_elements_by_xpath("//div[@class='CLpbulkcoup']]")

We will be using a for-loop to cycle through each iteration of the coupons. I want to enumerate each item, so I can track if I have collected all the coupons and if they corresponded to the correct location. You do not need this for the code to work properly!

for i, coupon_detail in enumerate(coupons_detail):

Now, let’s breakdown how we will .find_element for each part we need:

Product Names:
Under the class='CLpbulkcoup', there is a unique class name for the product name so I was able to use it. If you try finding the element by the “spanBlock sl1” class name, you will run into issues. It is enough to just specify “sl1” because it is unique under the 'CLpbulkcoup’ container. To extract the product name, I ended up with:
coupon_name = coupon_detail.find_element_by_class_name("sl1")
Savings:
Similar to above, I found the class attribute of “price” was unique and would pull the savings received on the coupon. To extract the product savings, I ended up with:
coupon_savings = coupon_detail.find_element_by_class_name("price")
New Price:

For the new price extraction, I ran into issues primarily because the second coupon pulled did not have a price, only a % amount saved. So there was a lot of trial and error. First, I followed similar logic to what I had above:
And ended up with:
coupon_price = coupon_detail.find_element_by_class_name("eco-priceTable")
However, when I ran that code, I got the following error message:
selenium.common.exceptions.NoSuchElementException: Message: no such element: Unable to locate element: {"method":"css selector","selector":".eco-priceTable"}

What can be done now is import that exception and handle it. What that would look like is:
from selenium.common.exceptions import NoSuchElementException
Then because the error is saying there is no such element, it most likely means that since the second coupon pulled is missing a price we must set the value to none to deal with the exception. So it would look like this:
try:
coupon_price = coupon_detail.find_element_by_class_name("eco-priceTable")
print(coupon_price.text)
except NoSuchElementException:
coupon_price = None

We have to make sure we print the coupon_price.text in the try section because you cannot return .text of None attribute. The output I got was:

1) $39.99
3) $24.99
4) $12.69
5) $16.99
6) $16.99
7) $9.99
8) $11.49
9) $14.99
10) $11.99
...
I was so excited because yay! prices! and the second entry is set to null! But then I compared the printed prices to the flyer and I found that it was pulling the original price as opposed to the new sale price. So I went back to the HTML code and found out why: As can be seen, there are three iterations of the class “eco-priceTable”.

Now I learned something fancy (which in hindsight is really not that fancy but I was still very excited by it): you can find elements and specify which index of element you want in one line. How you ask? Well first we want to find all elements with the class “eco-priceTable”:
coupon_detail.find_elements_by_class_name("eco-priceTable")
Looking at the containers, we can see that the specific class we want is the third one, so our index must be [2].
find_elements_by_class_name("eco-priceTable")[2]

However, we run into an error because at least one of our examples does not have a price table, so we must account for it by specifying if there are no indices we will return none. First, we will store all the elements we have collected with the correct class name in the price_rows variable:
price_rows = coupon_detail.find_elements_by_class_name("eco-priceTable")

Then we can check if price_rows has data stored and store the third container text to give coupon_price or we set the coupon_price to None:
if price_rows:
coupon_price = coupon_detail.find_elements_by_class_name("eco-priceTable")[2].text
else:
coupon_price = None

This gives us the correct output without any errors. So we can actually delete the imported exception from selenium.
Start Date of Sale:
Using the xpath copy we get - “//*[@id='coupons-lp']/ul[1]/li[1]/div[1]/span/time[1]” but I want to generalize this to all of the coupons, I edited it to only contain the detailed span and which time bucket it falls into: “//span[@class='CLP-validdates']/time[1]” I ended up with:
start_date = coupon_detail.find_element_by_xpath("//span[@class='CLP-validdates']/time[1]")
End Date of Sale:
Using the xpath copy we get - “//*[@id='coupons-lp']/ul[1]/li[1]/div[1]/span/time[2]” but I want to generalize this to all of the coupons, I edited it to only contain the detailed span and which time bucket it falls into: “//span[@class='CLP-validdates']/time[2]” I ended up with:
end_date = coupon_detail.find_element_by_xpath("//span[@class='CLP-validdates']/time[2]")

We must not only pull the data from each coupon, but we must also append the text of each entry to the correct list. So now that we have found all of the paths and ways to access the data we want, this is the code we will have:

coupon_name = coupon_detail.find_element_by_class_name("sl1").text
names.append(coupon_name)
coupon_savings = coupon_detail.find_element_by_class_name("price").text
savings.append(coupon_savings)
price_rows = coupon_detail.find_elements_by_class_name("eco-priceTable")
if price_rows:
coupon_price = coupon_detail.find_elements_by_class_name("eco-priceTable")[2].text
else:
coupon_price = None
prices.append(coupon_price)
start_date = coupon_detail.find_element_by_xpath("//span[@class='CLP-validdates']/time[1]").text
start_dates.append(start_date)
end_date = coupon_detail.find_element_by_xpath("//span[@class='CLP-validdates']/time[2]").text
end_dates.append(end_date)

Now we just need to add the lists to a pandas dataframe:

coupons = pd.DataFrame({
'Start Date' : start_dates,
'End Date' : end_dates,
'Product' : names,
'New Price' : prices,
'Savings' : savings,
})

Here is what our data frame looks like:

1.4: Data Analysis

Now that we have all the data collected it is time to search for our keyword. First, let's set what our key word is (for testing purposes, we will pick chicken, as there are two iterations of it in the data):

key_word = "chicken"

Pandas has a series string finder, which I found very helpful guidance on Geeks-for-Geeks. We will be adding a column ("Search") to the data to give us a number when a keyword is found. The number that is given in the column determines the character location of the word. If a result of -1 is shown, this means the word was not found.

coupons["Search"] = coupons["Product"].str.find(key_word)

Breaking this further down:

coupons["Search"] creates the new column in our data frame
coupons["Product"].str is accessing the string of each of the rows within the "Products" column
.find(key_word) searches through the strings from above to get find if there is the key word we selected

If the key word I'm looking for was chicken, this is the result I would get:

Next, we will need to get the index of row(s), primarily we are looking for any row that has a value greater than -1 in the "Search" column. .loc[] in pandas allows us to find rows corresponding to specific set criteria. We need to specify the data frame we are working with (in this case coupons) along with what label we are looking for (in this case the column ‘Search’) and we want to only find values that are greater than -1.

relevant_coupons = coupons.loc[coupons['Search'] > -1].copy()

We want to make a .copy() because we do not want to alter our original coupons data frames which has all the coupon information stored. Since we are working with a very small data set, it is okay to make a copy of the coupons data frame.

In order to print the data, I do not want it to have the "Search" column in what is displayed (for aesthetic reasons), so I had to drop a column. The .drop() method I am using has two parameters inputs: the first one being - the label of the column/row you want to drop and the second one being - whether you are dropping a row (axis = 0) or a column (axis = 1). So our code would look like this:

relevant_coupons = relevant_coupons.drop(['Search'], axis = 1)

Before we print the relevant_coupon data, we have to consider if the .loc[] will find nothing. So we want to make a if/else statement. We would expect it to be false if there was a key term found. So if the relevant_coupons.empty is false then we can 1) drop the ‘Search’ column and 2) print out the data (I add another print line just for aesthetics).

if not relevant_coupons.empty:
relevant_coupons = relevant_coupons.drop(['Search'], axis = 1)
print(relevant_coupons)
print("\n ------------------------------------------------------------------------------------------ \n")

Otherwise, we want to return that no coupons were found.