{ "cells": [ { "cell_type": "markdown", "id": "9dd39793", "metadata": {}, "source": [ "# Tutorial\n", "\n", "Symbolic sequences can arise in very different contexts from abstract symbolic dynamics, to encoding of behavioral sequences, through information theory, discrete Markov processes, etc. Here we take an applied perspective to stay in the spirit of large audience tutorial.\n", "\n", "## The `Sequence` objects\n", "\n", "The basic objects of `scyseq` are symbolic _sequences_ (instances of the\n", "`Sequence` class). A sequence represents an ordered series of symbolic states.\n", "Each state is encoded internally as an integer, and those integers are interpreted\n", "through a finite set of allowed symbols called an _alphabet_ (an instance of the\n", "`Alphabet` class).\n", "\n", "Each element of an alphabet is a `Symbol`. A `Symbol` has a readable string label\n", "(`sval`) and, once it belongs to an alphabet, an integer code (`ival`). This means\n", "a `Sequence` can be read in two complementary ways: as integer values (`ivals`) or\n", "as string labels (`svals`).\n", "\n", "### ScySeq object hierarchy\n", "\n", "```text\n", "scyseq\n", "├── Symbol\n", "│ └── represents one possible state, with:\n", "│ ├── an integer code assigned by an Alphabet (ival)\n", "│ └── a readable string label (sval)\n", "│\n", "├── Alphabet\n", "│ └── defines the finite set of allowed Symbols\n", "│\n", "└── Sequence\n", " ├── represents an ordered symbolic time series\n", " ├── stores integer codes internally (ivals)\n", " ├── uses one Alphabet to interpret those codes\n", " └── can be read in two ways:\n", " ├── integer representation: ivals\n", " └── string representation: svals\n", "\n", "```\n", "\n", "### Example\n", "\n", "Let's take the example of the observation of a person waiting for her plane at the airport. We can imagine to encode her behavior as `sitting` or `standing`. \n", "\n", "#### Symbols" ] }, { "cell_type": "code", "execution_count": 2, "id": "f1a68e62", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "A single Symbol i.e. outside an alphabet: (- | sitting)\n", "Symbols inserted in an alphabet: ((0 | sitting), (1 | standing))\n" ] } ], "source": [ "import scyseq as sq\n", "\n", "st = sq.Symbol('sitting')\n", "print(f\"A single Symbol i.e. outside an alphabet: {st}\") # a single symbol does not have integer representation\n", "sd = sq.Symbol('standing')\n", "behav = sq.Alphabet([st,sd]) # symbols in an alphabet get integer representation\n", "print(f\"Symbols inserted in an alphabet: {behav}\")" ] }, { "cell_type": "markdown", "id": "c5e05de4-0e0c-4220-bf6c-7f44b3ca8061", "metadata": {}, "source": [ "#### Alphabets\n", "\n", "The symbols in an alphabet cannot be accessed too easily to avoid mistakes:\n", "\n", "* You cannot change a Symbol:" ] }, { "cell_type": "code", "execution_count": 3, "id": "920a6031-1da7-49ba-969b-04967ed9090a", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Raises an AlphabetAccessError: 'Alphabet' object does not support item assignment.\n" ] } ], "source": [ "try:\n", " behav[0] = sq.Symbol(\"test\") # raises an exception\n", "except sq.AlphabetAccessError as e:\n", " print(f\"Raises an AlphabetAccessError: {e}.\")\n", " " ] }, { "cell_type": "markdown", "id": "721a4d8a-26d6-4ef6-8685-b2bd38d15d6e", "metadata": {}, "source": [ "* For each Symbol, `ival` is read-only:" ] }, { "cell_type": "code", "execution_count": 4, "id": "3d6efba8-b019-408c-86bf-af9fb8b509fd", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Raises a SymbolAccessError: ival is read-only.\n" ] } ], "source": [ "try:\n", " behav[0].ival = 99 # raises an exception\n", "except sq.SymbolAccessError as e:\n", " print(f\"Raises a SymbolAccessError: {e}.\")" ] }, { "cell_type": "markdown", "id": "71704215-9e00-49b7-ad24-eaac360dd4cf", "metadata": {}, "source": [ "* But you can rename a symbol i.e. change its sval:" ] }, { "cell_type": "code", "execution_count": 5, "id": "cda4f3a6-fd41-435c-9a4c-5baa9191e0b0", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Alphabet(Symbol(0 | assis), Symbol(1 | standing))" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "behav[0].sval = \"assis\"\n", "behav" ] }, { "cell_type": "markdown", "id": "1793fa7f-299a-4e82-ad48-b5f614ea2bc8", "metadata": {}, "source": [ "or rename it explicitely using a dictionary" ] }, { "cell_type": "code", "execution_count": 6, "id": "8837daa2-b796-40ab-a85d-13f1949418e8", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Alphabet(Symbol(0 | sitting), Symbol(1 | debout))" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "behav.rename({0: \"sitting\", 1: \"debout\"})\n", "behav" ] }, { "cell_type": "markdown", "id": "7ba1cd10-1d95-4ea8-ab00-e2c8e7e5f296", "metadata": {}, "source": [ "When you give an sval which already exists for a specific symbol, a *Warning* is provided:" ] }, { "cell_type": "code", "execution_count": 7, "id": "f3047ae8-b38f-4fe0-9f07-ae2b576c104e", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/home/zarpe/Nextcloud/Projets/development/scyseq/src/scyseq/sequence.py:388: UserWarning: 'debout' is already the sval of Symbol 1.\n", " return _rename(self, replacement)\n" ] } ], "source": [ "behav.rename({0: \"assis\", 1: \"debout\"})" ] }, { "cell_type": "markdown", "id": "b666f67e-e8ca-4fbd-a944-815cce0c678a", "metadata": {}, "source": [ "But it raises an exception if it exists somewhere else in the Alphabet:" ] }, { "cell_type": "code", "execution_count": 8, "id": "2e6fe800-ebf6-4b37-8d4c-c1b4a7c4f8f6", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Raises a AlphabetAccessError: Symbol 'assis' already exists in alphabet.\n" ] } ], "source": [ "try:\n", " behav.rename({1: \"assis\"})\n", "except sq.AlphabetAccessError as e:\n", " print(f\"Raises a AlphabetAccessError: {e}.\") " ] }, { "cell_type": "markdown", "id": "5c55f05f-1a58-461e-ac7b-3dc44612c6ca", "metadata": {}, "source": [ "#### Sequences\n", "\n", "Let's define a sequence" ] }, { "cell_type": "code", "execution_count": 9, "id": "6883f201", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Sequence as a string: [0 1 0 1 1]\n", "\n", "Sequence representation:\n" ] }, { "data": { "text/plain": [ "Sequence: [0 1 0 1 1]\n", "Alphabet(Symbol(0 | sitting), Symbol(1 | standing))\n", "N = 5 ; k = 2" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "behav = sq.Alphabet([st,sd]) # back to English :-)\n", "bseq = sq.Sequence([0,1,0,1,1], behav)\n", "print(f\"Sequence as a string: {bseq}\") # __str__(self)\n", "print(\"\\nSequence representation:\")\n", "bseq # __repr__(self)" ] }, { "cell_type": "markdown", "id": "b2f483b5-2486-42d8-80be-1f11e8c58c31", "metadata": {}, "source": [ "It can also be done more easily by passing an ordered sequence of strings:" ] }, { "cell_type": "code", "execution_count": 25, "id": "7bfbbe45-b594-4606-9859-63de396e9661", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "True" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "lseq = sq.Sequence([0,1,0,1,1], ('sitting', 'standing'))\n", "all(lseq == bseq)" ] }, { "cell_type": "markdown", "id": "248e5fc1-2edc-4802-804a-7cdc12cf8114", "metadata": {}, "source": [ "Values of the sequence can be accessed with:" ] }, { "cell_type": "code", "execution_count": 10, "id": "b92709df-51f7-4b4f-90a8-33ade0223487", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([0, 1, 0, 1, 1], dtype=uint8)" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "bseq.ivals # returns a **read-only** numpy.array" ] }, { "cell_type": "code", "execution_count": 12, "id": "f63859cc", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "('sitting', 'standing', 'sitting', 'standing', 'standing')" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "bseq.svals # returns a tuple" ] }, { "cell_type": "markdown", "id": "867f1f3c-7861-4763-bdc6-3553fd669eb4", "metadata": {}, "source": [ "Integer values are read-only:" ] }, { "cell_type": "code", "execution_count": 27, "id": "dcbbe033-aa3d-43e9-bb35-18b799357f14", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Raises a ValueError: assignment destination is read-only.\n" ] } ], "source": [ "try:\n", " bseq.ivals[1] = 22 # raises an exception\n", "except ValueError as e:\n", " print(f\"Raises a ValueError: {e}.\")" ] }, { "cell_type": "markdown", "id": "89974e6f", "metadata": {}, "source": [ "## A first workflow with symbolic sequences\n", "\n", "Now that we have seen the three basic objects, we can combine them in a small workflow. The usual pattern is:\n", "\n", "1. load or generate symbolic data,\n", "2. attach an alphabet so the integer codes have meaning (if needed since loaded sequences usually come with their own alphabet),\n", "3. transform the sequence when needed,\n", "4. summarize it with simple counts or frequencies.\n", "\n", "The examples below stay small on purpose. They are meant to show the package mechanics before moving to richer analyses.\n" ] }, { "cell_type": "markdown", "id": "5e9724ce", "metadata": {}, "source": [ "### Generating a symbolic sequence\n", "\n", "For examples, simulations, or tests, `scyseq.generator` can create sequences for us. `uniform_sequence` draws integer symbols uniformly between `0` and `k - 1`. If we want readable labels, we can reuse those integer values with a named `Alphabet`.\n" ] }, { "cell_type": "code", "execution_count": 16, "id": "9de98aeb", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Sequence: [2 1 2 2 0 2 2 1 2 1 2 1]\n", "Alphabet(Symbol(0 | sitting), Symbol(1 | standing), Symbol(2 | walking))\n", "N = 12 ; k = 3" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import numpy as np\n", "from scyseq.generator import generate, uniform_sequence\n", "\n", "np.random.seed(123)\n", "\n", "behaviour_alphabet = sq.Alphabet([\"sitting\", \"standing\", \"walking\"])\n", "generated_codes = uniform_sequence(12, 3)\n", "generated_behaviour = sq.Sequence(generated_codes.ivals, behaviour_alphabet)\n", "\n", "generated_behaviour" ] }, { "cell_type": "markdown", "id": "4e785b5a", "metadata": {}, "source": [ "The same generator can also be called through `generate`. This is useful when the method name is chosen from a configuration, a script argument, or another part of a workflow.\n" ] }, { "cell_type": "code", "execution_count": 17, "id": "2d3d8757", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "('walking',\n", " 'standing',\n", " 'walking',\n", " 'walking',\n", " 'sitting',\n", " 'walking',\n", " 'walking',\n", " 'standing',\n", " 'walking',\n", " 'standing',\n", " 'walking',\n", " 'standing')" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "np.random.seed(123)\n", "generated_with_dispatch = generate(\"uniform\", 12, 3)\n", "sq.Sequence(generated_with_dispatch.ivals, behaviour_alphabet).svals" ] }, { "cell_type": "markdown", "id": "013c92ed", "metadata": {}, "source": [ "### Discretizing continuous observations\n", "\n", "Real observations are often continuous numbers rather than symbols. The `partition` function turns a numeric series into bins, then stores the bin numbers as a `Sequence`. Here a simple linearly increasing signal is divided into three symbolic levels.\n" ] }, { "cell_type": "code", "execution_count": 18, "id": "add61920", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[ 0. 1. 2. 3. 4. 5. 6. 7. 8. 9. 10.]\n" ] }, { "data": { "text/plain": [ "Sequence: [0 0 0 0 1 1 1 2 2 2 2]\n", "Alphabet(Symbol(0 | 0), Symbol(1 | 1), Symbol(2 | 2))\n", "N = 11 ; k = 3" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from scyseq.discretize import partition\n", "\n", "waiting_signal = np.linspace(0, 10, 11)\n", "level_sequence = partition(waiting_signal, method=\"histogram\", nbin=3)\n", "\n", "print(waiting_signal)\n", "level_sequence" ] }, { "cell_type": "markdown", "id": "6cc39aa8", "metadata": {}, "source": [ "The integer labels `0`, `1`, and `2` are valid symbols, but names make the example easier to read. We can attach a small alphabet describing the three levels.\n" ] }, { "cell_type": "code", "execution_count": 19, "id": "659f5860", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "('low',\n", " 'low',\n", " 'low',\n", " 'low',\n", " 'medium',\n", " 'medium',\n", " 'medium',\n", " 'high',\n", " 'high',\n", " 'high',\n", " 'high')" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "level_alphabet = sq.Alphabet([\"low\", \"medium\", \"high\"])\n", "named_levels = sq.Sequence(level_sequence.ivals, level_alphabet)\n", "named_levels.svals" ] }, { "cell_type": "markdown", "id": "7b46a718", "metadata": {}, "source": [ "### Transforming sequences\n", "\n", "Many operations return a new `Sequence` and leave the original one unchanged. For example, `roll` shifts the sequence, `reverse` reads it backwards, and `reduce` removes consecutive repetitions.\n" ] }, { "cell_type": "code", "execution_count": 20, "id": "4a6fe428", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "original: ('walking', 'standing', 'walking', 'walking', 'sitting', 'walking', 'walking', 'standing', 'walking', 'standing', 'walking', 'standing')\n", "rolled: ('walking', 'standing', 'walking', 'standing', 'walking', 'walking', 'sitting', 'walking', 'walking', 'standing', 'walking', 'standing')\n", "reversed: ('standing', 'walking', 'standing', 'walking', 'standing', 'walking', 'walking', 'sitting', 'walking', 'walking', 'standing', 'walking')\n", "reduced: ('walking', 'standing', 'walking', 'sitting', 'walking', 'standing', 'walking', 'standing', 'walking', 'standing')\n" ] } ], "source": [ "rolled = generated_behaviour.roll(2)\n", "reversed_sequence = generated_behaviour.reverse()\n", "reduced = generated_behaviour.reduce()\n", "\n", "print(\"original:\", generated_behaviour.svals)\n", "print(\"rolled: \", rolled.svals)\n", "print(\"reversed:\", reversed_sequence.svals)\n", "print(\"reduced: \", reduced.svals)" ] }, { "cell_type": "markdown", "id": "95ac33f0", "metadata": {}, "source": [ "The `words` function creates a new sequence from overlapping blocks. With word length 2, each new symbol represents a pair of consecutive states. This is a common first step before block entropy or other pattern analyses.\n" ] }, { "cell_type": "code", "execution_count": 21, "id": "406912d8", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[7 5 8 6 2 8 7 5 7 5 7]\n" ] }, { "data": { "text/plain": [ "Alphabet(Symbol(0 | 0), Symbol(1 | 1), Symbol(2 | 2), Symbol(3 | 3), Symbol(4 | 4), Symbol(5 | 5), Symbol(6 | 6), Symbol(7 | 7), Symbol(8 | 8))" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "two_state_words = sq.words(generated_behaviour, 2)\n", "\n", "print(two_state_words.ivals)\n", "two_state_words.alphabet" ] }, { "cell_type": "markdown", "id": "8b039b84", "metadata": {}, "source": [ "When we have two sequences of the same length, `recode` combines them into one joint symbolic sequence. Each new symbol represents the pair observed at the same time step.\n" ] }, { "cell_type": "code", "execution_count": 22, "id": "5d2a6a01", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[0 1 2 3 1]\n" ] }, { "data": { "text/plain": [ "('posture_sitting+attention_phone',\n", " 'posture_sitting+attention_gate',\n", " 'posture_standing+attention_phone',\n", " 'posture_standing+attention_gate')" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "posture = sq.Sequence([0, 0, 1, 1, 0], sq.Alphabet([\"sitting\", \"standing\"]))\n", "attention = sq.Sequence([0, 1, 0, 1, 1], sq.Alphabet([\"phone\", \"gate\"]))\n", "\n", "joint = sq.recode([posture, attention], new_alphabet=True, names=[\"posture\", \"attention\"])\n", "\n", "print(joint.ivals)\n", "joint.alphabet.svals" ] }, { "cell_type": "markdown", "id": "161ffdbe", "metadata": {}, "source": [ "### Summarizing a sequence\n", "\n", "Two quick summaries are often enough to check that a sequence looks reasonable: `count` gives the number of visits to each symbol, and `frequency` gives the same information as proportions.\n" ] }, { "cell_type": "code", "execution_count": 23, "id": "a94ff12e", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[('sitting', np.int64(1), np.float64(0.08333333333333333)),\n", " ('standing', np.int64(4), np.float64(0.3333333333333333)),\n", " ('walking', np.int64(7), np.float64(0.5833333333333334))]" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "summary = list(zip(generated_behaviour.alphabet.svals, generated_behaviour.count(), generated_behaviour.frequency()))\n", "summary" ] }, { "cell_type": "markdown", "id": "a6735df5", "metadata": {}, "source": [ "This small workflow is enough to prepare data for the next layer of the package: entropy, Lempel-Ziv complexity, recurrence analysis, stochastic matrices, and visualization. We will introduce those tools one by one in later sections.\n" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.13.12" } }, "nbformat": 4, "nbformat_minor": 5 }