Tutorial

Symbolic sequences can arise in very different contexts from abstract symbolic dynamics, to encoding of behavioral sequences, through information theory, discrete Markov processes, etc. Here we take an applied perspective to stay in the spirit of large audience tutorial.

The Sequence objects

The basic objects of scyseq are symbolic sequences (instances of the Sequence class). A sequence represents an ordered series of symbolic states. Each state is encoded internally as an integer, and those integers are interpreted through a finite set of allowed symbols called an alphabet (an instance of the Alphabet class).

Each element of an alphabet is a Symbol. A Symbol has a readable string label (sval) and, once it belongs to an alphabet, an integer code (ival). This means a Sequence can be read in two complementary ways: as integer values (ivals) or as string labels (svals).

ScySeq object hierarchy

scyseq
├── Symbol
│   └── represents one possible state, with:
│       ├── an integer code assigned by an Alphabet (ival)
│       └── a readable string label (sval)
│
├── Alphabet
│   └── defines the finite set of allowed Symbols
│
└── Sequence
    ├── represents an ordered symbolic time series
    ├── stores integer codes internally (ivals)
    ├── uses one Alphabet to interpret those codes
    └── can be read in two ways:
        ├── integer representation: ivals
        └── string representation: svals

Example

Let’s take the example of the observation of a person waiting for her plane at the airport. We can imagine to encode her behavior as sitting or standing.

Symbols

[2]:
import scyseq as sq

st = sq.Symbol('sitting')
print(f"A single Symbol i.e. outside an alphabet: {st}") # a single symbol does not have integer representation
sd = sq.Symbol('standing')
behav = sq.Alphabet([st,sd]) # symbols in an alphabet get integer representation
print(f"Symbols inserted in an alphabet: {behav}")
A single Symbol i.e. outside an alphabet: (- | sitting)
Symbols inserted in an alphabet: ((0 | sitting), (1 | standing))

Alphabets

The symbols in an alphabet cannot be accessed too easily to avoid mistakes:

  • You cannot change a Symbol:

[3]:
try:
    behav[0] = sq.Symbol("test") # raises an exception
except sq.AlphabetAccessError as e:
    print(f"Raises an AlphabetAccessError: {e}.")
Raises an AlphabetAccessError: 'Alphabet' object does not support item assignment.
  • For each Symbol, ival is read-only:

[4]:
try:
    behav[0].ival = 99 # raises an exception
except sq.SymbolAccessError as e:
    print(f"Raises a SymbolAccessError: {e}.")
Raises a SymbolAccessError: ival is read-only.
  • But you can rename a symbol i.e. change its sval:

[5]:
behav[0].sval = "assis"
behav
[5]:
Alphabet(Symbol(0 | assis), Symbol(1 | standing))

or rename it explicitely using a dictionary

[6]:
behav.rename({0: "sitting", 1: "debout"})
behav
[6]:
Alphabet(Symbol(0 | sitting), Symbol(1 | debout))

When you give an sval which already exists for a specific symbol, a Warning is provided:

[7]:
behav.rename({0: "assis", 1: "debout"})
/home/zarpe/Nextcloud/Projets/development/scyseq/src/scyseq/sequence.py:388: UserWarning: 'debout' is already the sval of Symbol 1.
  return _rename(self, replacement)

But it raises an exception if it exists somewhere else in the Alphabet:

[8]:
try:
    behav.rename({1: "assis"})
except sq.AlphabetAccessError as e:
    print(f"Raises a AlphabetAccessError: {e}.")
Raises a AlphabetAccessError: Symbol 'assis' already exists in alphabet.

Sequences

Let’s define a sequence

[9]:
behav = sq.Alphabet([st,sd]) # back to English :-)
bseq = sq.Sequence([0,1,0,1,1], behav)
print(f"Sequence as a string: {bseq}") # __str__(self)
print("\nSequence representation:")
bseq # __repr__(self)
Sequence as a string: [0 1 0 1 1]

Sequence representation:
[9]:
Sequence: [0 1 0 1 1]
Alphabet(Symbol(0 | sitting), Symbol(1 | standing))
N = 5 ; k = 2

It can also be done more easily by passing an ordered sequence of strings:

[25]:
lseq = sq.Sequence([0,1,0,1,1], ('sitting', 'standing'))
all(lseq == bseq)
[25]:
True

Values of the sequence can be accessed with:

[10]:
bseq.ivals # returns a **read-only** numpy.array
[10]:
array([0, 1, 0, 1, 1], dtype=uint8)
[12]:
bseq.svals # returns a tuple
[12]:
('sitting', 'standing', 'sitting', 'standing', 'standing')

Integer values are read-only:

[27]:
try:
    bseq.ivals[1] = 22 # raises an exception
except ValueError as e:
    print(f"Raises a ValueError: {e}.")
Raises a ValueError: assignment destination is read-only.

A first workflow with symbolic sequences

Now that we have seen the three basic objects, we can combine them in a small workflow. The usual pattern is:

  1. load or generate symbolic data,

  2. attach an alphabet so the integer codes have meaning (if needed since loaded sequences usually come with their own alphabet),

  3. transform the sequence when needed,

  4. summarize it with simple counts or frequencies.

The examples below stay small on purpose. They are meant to show the package mechanics before moving to richer analyses.

Generating a symbolic sequence

For examples, simulations, or tests, scyseq.generator can create sequences for us. uniform_sequence draws integer symbols uniformly between 0 and k - 1. If we want readable labels, we can reuse those integer values with a named Alphabet.

[16]:
import numpy as np
from scyseq.generator import generate, uniform_sequence

np.random.seed(123)

behaviour_alphabet = sq.Alphabet(["sitting", "standing", "walking"])
generated_codes = uniform_sequence(12, 3)
generated_behaviour = sq.Sequence(generated_codes.ivals, behaviour_alphabet)

generated_behaviour
[16]:
Sequence: [2 1 2 2 0 2 2 1 2 1 2 1]
Alphabet(Symbol(0 | sitting), Symbol(1 | standing), Symbol(2 | walking))
N = 12 ; k = 3

The same generator can also be called through generate. This is useful when the method name is chosen from a configuration, a script argument, or another part of a workflow.

[17]:
np.random.seed(123)
generated_with_dispatch = generate("uniform", 12, 3)
sq.Sequence(generated_with_dispatch.ivals, behaviour_alphabet).svals
[17]:
('walking',
 'standing',
 'walking',
 'walking',
 'sitting',
 'walking',
 'walking',
 'standing',
 'walking',
 'standing',
 'walking',
 'standing')

Discretizing continuous observations

Real observations are often continuous numbers rather than symbols. The partition function turns a numeric series into bins, then stores the bin numbers as a Sequence. Here a simple linearly increasing signal is divided into three symbolic levels.

[18]:
from scyseq.discretize import partition

waiting_signal = np.linspace(0, 10, 11)
level_sequence = partition(waiting_signal, method="histogram", nbin=3)

print(waiting_signal)
level_sequence
[ 0.  1.  2.  3.  4.  5.  6.  7.  8.  9. 10.]
[18]:
Sequence: [0 0 0 0 1 1 1 2 2 2 2]
Alphabet(Symbol(0 | 0), Symbol(1 | 1), Symbol(2 | 2))
N = 11 ; k = 3

The integer labels 0, 1, and 2 are valid symbols, but names make the example easier to read. We can attach a small alphabet describing the three levels.

[19]:
level_alphabet = sq.Alphabet(["low", "medium", "high"])
named_levels = sq.Sequence(level_sequence.ivals, level_alphabet)
named_levels.svals
[19]:
('low',
 'low',
 'low',
 'low',
 'medium',
 'medium',
 'medium',
 'high',
 'high',
 'high',
 'high')

Transforming sequences

Many operations return a new Sequence and leave the original one unchanged. For example, roll shifts the sequence, reverse reads it backwards, and reduce removes consecutive repetitions.

[20]:
rolled = generated_behaviour.roll(2)
reversed_sequence = generated_behaviour.reverse()
reduced = generated_behaviour.reduce()

print("original:", generated_behaviour.svals)
print("rolled:  ", rolled.svals)
print("reversed:", reversed_sequence.svals)
print("reduced: ", reduced.svals)
original: ('walking', 'standing', 'walking', 'walking', 'sitting', 'walking', 'walking', 'standing', 'walking', 'standing', 'walking', 'standing')
rolled:   ('walking', 'standing', 'walking', 'standing', 'walking', 'walking', 'sitting', 'walking', 'walking', 'standing', 'walking', 'standing')
reversed: ('standing', 'walking', 'standing', 'walking', 'standing', 'walking', 'walking', 'sitting', 'walking', 'walking', 'standing', 'walking')
reduced:  ('walking', 'standing', 'walking', 'sitting', 'walking', 'standing', 'walking', 'standing', 'walking', 'standing')

The words function creates a new sequence from overlapping blocks. With word length 2, each new symbol represents a pair of consecutive states. This is a common first step before block entropy or other pattern analyses.

[21]:
two_state_words = sq.words(generated_behaviour, 2)

print(two_state_words.ivals)
two_state_words.alphabet
[7 5 8 6 2 8 7 5 7 5 7]
[21]:
Alphabet(Symbol(0 | 0), Symbol(1 | 1), Symbol(2 | 2), Symbol(3 | 3), Symbol(4 | 4), Symbol(5 | 5), Symbol(6 | 6), Symbol(7 | 7), Symbol(8 | 8))

When we have two sequences of the same length, recode combines them into one joint symbolic sequence. Each new symbol represents the pair observed at the same time step.

[22]:
posture = sq.Sequence([0, 0, 1, 1, 0], sq.Alphabet(["sitting", "standing"]))
attention = sq.Sequence([0, 1, 0, 1, 1], sq.Alphabet(["phone", "gate"]))

joint = sq.recode([posture, attention], new_alphabet=True, names=["posture", "attention"])

print(joint.ivals)
joint.alphabet.svals
[0 1 2 3 1]
[22]:
('posture_sitting+attention_phone',
 'posture_sitting+attention_gate',
 'posture_standing+attention_phone',
 'posture_standing+attention_gate')

Summarizing a sequence

Two quick summaries are often enough to check that a sequence looks reasonable: count gives the number of visits to each symbol, and frequency gives the same information as proportions.

[23]:
summary = list(zip(generated_behaviour.alphabet.svals, generated_behaviour.count(), generated_behaviour.frequency()))
summary
[23]:
[('sitting', np.int64(1), np.float64(0.08333333333333333)),
 ('standing', np.int64(4), np.float64(0.3333333333333333)),
 ('walking', np.int64(7), np.float64(0.5833333333333334))]

This small workflow is enough to prepare data for the next layer of the package: entropy, Lempel-Ziv complexity, recurrence analysis, stochastic matrices, and visualization. We will introduce those tools one by one in later sections.