Tokenize Strings in PL/SQL

Tokenize Strings in PL/SQL

Several ways of pulling strings apart

When pulling delimited strings apart in PL/SQL (or SQL statements in these examples) you'd usually use something like this:

WITH test_data AS (SELECT 'Diego Javier Llorente Ríos' as full_name FROM dual)
SELECT SUBSTR(full_name, 1, INSTR(full_name, ' ') - 1) as word_1, 
       SUBSTR(full_name, INSTR(full_name, ' ')+1, INSTR(full_name, ' ', 1, 2) - INSTR(full_name, ' ') - 1) as word_2,
       SUBSTR(full_name, INSTR(full_name, ' ', 1, 2)+1, INSTR(full_name, ' ', 1, 3) - INSTR(full_name, ' ', 1, 2) - 1) as word_3,
       SUBSTR(full_name, INSTR(full_name, ' ', 1, 3)+1) as word_4
FROM test_data;

This would produce:

It's standard practice to use something like the above.

The SUBSTR function is used to extract a specified number of characters from a string, starting from a given position.

The INSTR function, on the other hand, returns the position of a specified character within a string.

By using these two functions in combination, it is possible to extract specific information from a string, such as a first name or last name.

However, I always despair when I see how much code it produced. Just look at it! you can see, the number of lines of code required to extract all 4 words has increased and it becomes harder to read and understand.

Alternatively, the REGEXP_SUBSTR function can be used to extract specific information from a string using a regular expression pattern.

For example, to extract all 4 words from the string 'Diego Javier Llorente Ríos' and store them in separate columns, the following SQL statement can be used:

WITH test_data AS (SELECT 'Diego Javier Llorente Ríos' as full_name FROM dual)
SELECT REGEXP_SUBSTR(full_name, '[^ ]+', 1, 1) AS word_1, 
       REGEXP_SUBSTR(full_name, '[^ ]+', 1, 2) AS word_2,
       REGEXP_SUBSTR(full_name, '[^ ]+', 1, 3) AS word_3,
       REGEXP_SUBSTR(full_name, '[^ ]+', 1, 4) AS word_4
FROM test_data;

As you can see, the regular expression method is much more concise and easy to understand. It only requires one line of code to extract all 4 words.

It would be great if Oracle could provide a built-in function that makes this task even easier and more efficient.

In the meantime, I have built a handy PLSQL function to do just this. Which can also be used in SQL statements.

 FUNCTION f_split (
    p_str        IN VARCHAR2,
    p_sep        IN VARCHAR2 DEFAULT apex_application.LF,
    p_occurance  IN VARCHAR2 DEFAULT 1,
    p_limit      IN PLS_INTEGER DEFAULT NULL )
  RETURN VARCHAR2 DETERMINISTIC
  IS
    -- Variables
    l_tb            apex_t_varchar2;
    l_return        VARCHAR2(32767) DEFAULT NULL;
  BEGIN

    BEGIN
      l_tb := apex_string.split (
                p_str   => p_str,
                p_sep   => p_sep,
                p_limit => p_limit );
    EXCEPTION
      WHEN OTHERS 
      THEN
        -- Log something if you wish
        NULL;    
    END;

    IF l_tb.exists(p_occurance)
    THEN
        l_return := l_tb(p_occurance);
    END IF;
    RETURN l_return;

  END f_split;

To be honest, apex_string.split can be used directly in PL/SQL. However, because it uses a PLSQL table object type it cannot be used in SQL without a wrapper such as the above.

Let's give it a go:

WITH test_data AS (SELECT 'Diego Javier Llorente Ríos' as full_name FROM dual)
SELECT f_split(full_name, ' ', 1) AS word_1, 
       f_split(full_name, ' ', 2) AS word_2,
       f_split(full_name, ' ', 3) AS word_3,
       f_split(full_name, ' ', 4) AS word_4
FROM test_data;

Enjoy