Scrapy + Oracle Cloud Compute + APEX PWA = Discount Beer???

This blog was written for #JoelKallmanDay

I loved being around Joel, and I know he liked a nice Beer 🍺. So how about I take the immense power of Oracle Cloud and the amazing latest features of Oracle APEX, and ... um... scrape the web for the best Beer deals? - why not!?

Discount Beer? yes, I'm sick of missing the best deals and them selling out quickly. To counter this... I'm using an Oracle Cloud Compute with the Scrapy Python engine to run a spider to crawl a brewery website to retrieve all their products in JSON format. If a new product appears or a sudden price reduction of an existing product takes it below the magical 0.5 pence per ml, then APEX will send me a Push Notification on my Phone... oh and I'll create a tidy APEX application to browse my beers as well.

Whats Scrapy?

I researched for the best web scraping tool around... and the resounding answer was https://scrapy.org/ which you can build and run your web scraping spiders.

I took a quick tutorial and now my entire web scraping knowledge comes from this 23 minute video which I highly recommend you watch.

You'll see my scrapy code for the brewery website I'm using. I won't explain how it works because that video explains it way better than I could in this short blog.

Install Scrapy For Oracle Unix or WSL

I installed it on both as I could Develop scrapy in VSCode in WSL on my Windows box, and then deploy in "production" on my mirco (always FREE) Oracle Unix Compute box.

Find your processor by using
```
 uname -m
```

Mine is x86_64 so I will run the code below to download Miniconda. This is a tool that will install Scrapy for us

 curl -O https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
 chmod +x Miniconda3-latest-Linux-x86_64.sh
 ./Miniconda3-latest-Linux-x86_64.sh

Press ENTER
Exit Reading with CTRL+C
Type yes to accept licence terms
Press ENTER to confirm the location
Type yes to perform a conda init
Type exit to exit Terminal
Start another Terminal Session
Type the following to confirm the conda version
```
conda --version
```
Type the following to install Scrapy
```
conda install -c conda-forge scrapy
```
Type y to proceed
Make sure nano is installed too
```
sudo install nano -y
```

Brewery Project

Start a scrapy project
```
 scrapy startproject Thornbridge
```
This has created a set of folders

Create a spider

 nano Thornbridge/Thornbridge/spiders/thornbridge_spider.py

Paste in the following. If you want to know how I wrote this scrapy code, you have to watch the video I posted at the top of this blog.

    import scrapy

    class ThornbridgeSpider(scrapy.Spider):
        name = "thornbridge"
        start_urls = ["https://thornbridgebrewery.co.uk/collections/beercases",
                      "https://thornbridgebrewery.co.uk/collections/flash-sales",
                      "https://thornbridgebrewery.co.uk/collections/mixed-cases"]

        def parse(self, response):
            products = response.css("li.grid__item")

            for product in products:
                product_name = product.css("a.full-unstyled-link::text").get()
                price = product.css("span.price-item.price-item--regular::text").get()
                price_regular = product.css("div.price--on-sale div.price__sale s.price-item.price-item--regular::text").get()
                price_sale = product.css("span.price-item.price-item--sale::text").get()
                sold_out = product.css("div.price.price--sold-out::text").get()
                image = product.css("div.card__media div.media img").attrib['src']
                link = product.css("a.full-unstyled-link").attrib['href']

                # sold_out = False
                if sold_out:
                    sold_out = True
                else:
                    sold_out = False    

                price = price.replace("£", "").replace(" GBP", "")

                if price_regular is None:
                    price_regular = price
                else:
                    price_regular = price_regular.replace("£", "").replace(" GBP", "")

                yield {
                    "Name": product_name.strip(),
                    "Image": "https:" + image,
                    "Link": "https://thornbridgebrewery.co.uk" + link,
                    "Price": price.strip(),
                    "Regular": price_regular.strip(),
                    "Sold_Out": sold_out
                }


            # Follow pagination links if available (assuming they use the "Next" button)
            next_page = response.css("a.pagination__item--prev::attr(href)").get()
            if next_page:
                yield response.follow(next_page, self.parse)

CTRL+X to save and exit nano

Create a Schema

Create a beer_feed table to store the incoming feeds

   CREATE TABLE "BEER_FEED" 
    (    
     "ID" NUMBER GENERATED BY DEFAULT ON NULL AS IDENTITY MINVALUE 1 MAXVALUE 9999999999999999999999999999 INCREMENT BY 1 START WITH 1 CACHE 20 NOORDER  NOCYCLE  NOKEEP  NOSCALE  NOT NULL ENABLE, 
     "FEED_DATE" DATE DEFAULT ON NULL SYSDATE NOT NULL ENABLE, 
     "FEED_JSON" CLOB, 
     "STATUS" VARCHAR2(60 CHAR) DEFAULT ON NULL 'NEW' NOT NULL ENABLE, 
      CHECK (feed_json is json) ENABLE, 
      CONSTRAINT "BEER_FEED_ID_PK" PRIMARY KEY ("ID")
   USING INDEX  ENABLE
    )

Create a beer table to hold the beer, but linked to the feed.

 -- create tables
 create table beer (
     id                             number generated by default on null as identity 
                                    constraint beer_id_pk primary key,
     beer_feed_id                   number
                                    constraint beer_beer_feed_id_fk
                                    references beer_feed on delete cascade,
     name                           varchar2(256 char),
     quantity                       number,
     volume                         number,
     vessel                         varchar2(32 char),
     image                          varchar2(4000 char),
     link                           varchar2(4000 char),
     price                          number,
     regular                        number,
     sold_out                       varchar2(5 char),
     price_per_ml                   number
 )
 ;

 -- table index
 create index beer_i1 on beer (beer_feed_id);

Create a trigger to calculate the price per ml

 create or replace trigger "BEER_T"
 before
 insert on "BEER"
 REFERENCING NEW AS NEW OLD AS OLD
 for each row
 begin
    :NEW.price_per_ml := ROUND( :new.PRICE / (:new.QUANTITY * :new.VOLUME ), 5 );
 end;
 /

Create a Procedure to process the feed & send notifications if the price per millimetre is favourable for reduced or new products.

YOU MUST CHANGE THE APP_ID CONSTANT AT THE TOP OF THE FILE

    create or replace PROCEDURE P_PROCESS_BEER_FEED(p_feed_id beer_feed.id%TYPE) 
    AS
      l_app_id_c            CONSTANT apex_applications.application_id%TYPE DEFAULT 109;  
      l_target_price_per_ml CONSTANT NUMBER DEFAULT 0.005; 
    BEGIN
        -- Insert new records from JSON into the beer table
        INSERT INTO beer (beer_feed_id, name, quantity, volume, vessel, image, link, price, regular, sold_out)
        SELECT
            p_feed_id,
            jt.Name AS Name,
            CASE
                WHEN REGEXP_SUBSTR(jt.Name, '(\d+)x') IS NOT NULL THEN REPLACE( REGEXP_SUBSTR(jt.Name, '(\d+)x'), 'x' )
                WHEN REGEXP_SUBSTR(jt.Name, '(\d+)\s*x') IS NOT NULL THEN REGEXP_SUBSTR(jt.Name, '(\d+)\s')
                WHEN REGEXP_SUBSTR(jt.Name, '(\d+)\s+Beers') IS NOT NULL THEN REGEXP_SUBSTR(jt.Name, '(\d+)\s')
                ELSE REGEXP_SUBSTR(jt.Name, '(\d+)\s*ml')
            END AS quantity,
            REPLACE(
                CASE
                    WHEN REGEXP_SUBSTR(name, '(\d+)\s*ml') IS NOT NULL THEN REGEXP_SUBSTR(name, '(\d+)\s*ml')
                    ELSE NULL
                END, 'ml', '') AS volume,
            CASE
                WHEN REGEXP_SUBSTR(name, 'cans|bottles|Bottles|Cans') IS NOT NULL THEN LOWER(REGEXP_SUBSTR(name, 'cans|bottles|Bottles|Cans'))
                ELSE NULL
            END AS vessel,
            jt.Image AS Image,
            jt.Link AS Link,
            TO_NUMBER(jt.Price) AS Price,
            TO_NUMBER(jt.Regular) AS Regular,
            jt.Sold_Out AS Sold_Out
        FROM
            beer_feed,
            JSON_TABLE(feed_json, '$[*]'
                COLUMNS (
                    Name VARCHAR2(255) PATH '$.Name',
                    Image VARCHAR2(255) PATH '$.Image',
                    Link VARCHAR2(255) PATH '$.Link',
                    Price VARCHAR2(20) PATH '$.Price',
                    Regular VARCHAR2(20) PATH '$.Regular',
                    Sold_Out VARCHAR2(5) PATH '$.Sold_Out'
                )
            ) jt
        WHERE beer_feed.id = p_feed_id;

        -- Delete duplicate records based on the link
        DELETE FROM beer
        WHERE ROWID NOT IN (
            SELECT MAX(q.ROWID)
            FROM beer q
            WHERE q.beer_feed_id = p_feed_id
            GROUP BY link
            HAVING COUNT(*) > 1
        );



        -- Start sending out notifications
        FOR x IN (
            WITH
            latest AS (SELECT b.* FROM beer b, beer_feed f WHERE b.beer_feed_id = f.id AND status = 'LATEST'),
            previous AS (SELECT b.* FROM beer b, beer_feed f WHERE b.beer_feed_id = f.id AND status = 'PREVIOUS'),
            interesting AS (
                SELECT l.*
                FROM latest l, previous p
                WHERE l.link = p.link
                    AND l.price < p.price
                    AND l.sold_out = 'false'
                UNION
                SELECT l.*
                FROM latest l
                WHERE l.link NOT IN (SELECT p.link FROM previous p)
                AND EXISTS ( SELECT p.link FROM previous p )
            )
            SELECT *
            FROM interesting
            WHERE PRICE_PER_ML <= l_target_price_per_ml
        )
        LOOP

            for l_subscription in ( 
                select distinct user_name
                  from apex_appl_push_subscriptions
                 where application_id = l_app_id_c )
            loop
                 IF V('SESSION') IS NULL 
                 THEN
                        apex_session.create_session (
                            p_app_id   => l_app_id_c,
                            p_page_id  => 1,
                            p_username => 'EXAMPLE USER' );
                 END IF;
                apex_pwa.send_push_notification (
                    p_application_id => l_app_id_c,
                    p_user_name      => l_subscription.user_name,
                    p_title          => 'Thornbridge Discount',
                    p_body           => x.name,
                    p_target_url     => x.link);
            end loop;

        END LOOP;
    END P_PROCESS_BEER_FEED;
    /

REST Endpoint

In APEX, go to RESTful Services
Create a Module called scrapy with a Base Path of /scrapy/
Create a Resource Template called feed

Create a Handler of type POST with a PL/SQL of

 DECLARE
   l_feed_id beer_feed.id%TYPE DEFAULT NULL;
 BEGIN

   UPDATE beer_feed SET status = 'HISTORIC' WHERE status = 'PREVIOUS';
   UPDATE beer_feed SET status = 'PREVIOUS' WHERE status = 'LATEST';

  INSERT
    INTO beer_feed( feed_json, status) 
  VALUES (:BODY, 'LATEST') RETURNING id INTO l_feed_id;

   p_process_beer_feed(l_feed_id);

 END;

Create a Privilege called scrapy and protect the scrapy module like this
Click Create Privilege

Create a Client

 BEGIN
   OAUTH.create_client(
     p_name            => 'scrapy_client',
     p_grant_type      => 'client_credentials',
     p_owner           => 'Myself',
     p_description     => 'A client for Scrapy',
     p_support_email   => 'me@example.com',
     p_privilege_names => 'scrapy'
   );

   COMMIT;
 END;
 /

Fetch the Client ID and Client Secret and save in Notepad

 SELECT client_id, client_secret
   FROM user_ords_clients 
  WHERE name = 'scrapy_client'

Create a shell script to push to the End Point

create a file
```
 nano /home/opc/scrape_and_post.sh
```

Paste in the contents below. Make sure you change the CLIENT_ID, CLIENT_SECRET and the URL (this appears twice).

 #!/bin/bash

 # Enter project folder
 cd /home/opc/Thornbridge

 # Scrape to beer.json
 scrapy crawl thornbridge -O beer.json

 # Define your client ID and client secret
 CLIENT_ID="uEw.."
 CLIENT_SECRET="pRw.."

 # Get the access_token and store it in a variable
 ACCESS_TOKEN=$(curl -s -k --user "$CLIENT_ID:$CLIENT_SECRET" --data "grant_type=client_credentials" https://g4-lufatp.adb.eu-frankfurt-1.oraclecloudapps.com/ords/tb/oauth/token | grep -o '"access_token":"[^"]*' | cut -d'"' -f4)

 # Use the token in the second curl command
 curl \
 --location 'https://g4-lufatp.adb.eu-frankfurt-1.oraclecloudapps.com/ords/tb/scrapy/feed' \
 --header 'Content-Type: application/json' \
 --header "Authorization: Bearer $ACCESS_TOKEN" \
 --data @beer.json

CTRL+X to save the file and exit nano
Make it executable
```
 chmod +x scrape_and_post.sh
```
Run it once like this. Or run it as often as you wish with the below command.
```
 ./scrape_and_post.sh
```
Schedule it
```
 export EDITOR=nano; crontab -e
```

Paste this on the first line to run the scrape daily

 0 0 * * * /bin/bash /home/opc/scrape_and_post.sh

CTRL+X to save the file and exit nano
Check to see if it is scheduled
```
 crontab -l
```

APEX Application

In Object Browser, create an Application from the beer table
Select Install Progressive Web App and Push Notifications and click Create Application
Run the application. I've modified mine a little, however, you should see an Interactive Report
Click Your Username > Settings and turn Push Notifications on
(Optional) Install the app on your device and also turn Push Notifications on
Wait for beer discounts to appear on your phone!

Whats the picture? Its Crimple Valley Viaduct. Visit Yorkshire!