Reading data saved in HDF5 files

Introduction

Striptease provides an interface to the data acquired by the instrument and stored in HDF5 files which allows the user to access both scientific and housekeepings data returning them into time, data numpy arrays.

Before reading this chapter, you should be aware of the way data are saved to disk when LSPE/Strip operates. The instrument sends all the scientific (PWR and DEM timelines) and housekeeping (biases, temperatures…) timelines to a computer, where a software called data server takes them and writes them in HDF5 files. The stream of data is usually continuous, but occasionally the data server stops writing in a file and opens a new one. This ensures that files do not get too big and that you can grab one of them for analysis while the instrument is still running. These files are usually saved in a hierarchical structure of directories, grouped according to the year/month of acquisition. So, assuming that these data files are stored in /storage/strip, you might find the following files in the Strip data storage:

/storage/strip/2021/10
    2021_10_31_15-42-44.h5
    2021_10_31_19-42-44.h5
    2021_10_31_23-42-44.h5
/storage/strip/2021/11
    2021_11_01_03-42-44.h5
    2021_11_01_07-42-44.h5
    2021_11_01_11-42-44.h5

It is now time to see the kind of tools that Striptease provides to access these files. The interface is based on two classes:

  • DataFile provides a high-level access to a single HDF5 file.
  • DataStorage provides a high-level access to a directory containing several HDF5 files, like the one above. It basically abstracts over the concept of a «file» and instead considers the data as being acquired continuously.

Accessing one file

Let’s begin with a small example that shows how to use the class DataFile:

from striptease import DataFile
fname = '/path_to/data.h5'

with DataFile(fname) as my_data:
    # Load a HK time series
    time, data = my_data.load_hk("POL_Y6", "BIAS","VG4A_SET")

    # Load some scientific data
    time, data = my_data.load_sci("POL_G0", "DEM", "Q1")

The class DataFile assumes that the name of the file follows the convention used by the acquisition software used in the system-level tests in Bologna and during nominal data acquisition in Tenerife. Be sure not to mess with HDF5 file names!

Since the kind of HDF5 file used in Strip has a complex structure, Striptease provides a few facilities to handle them. To load timelines of housekeeping parameter and detector outputs, the DataFile class provides two methods:

Moreover, a method DataFile.get_average_biases() can be used to retrieve the average level of biases within some time frame.

class striptease.DataFile(filepath, filemode='r')[source]

A HDF5 file containing timelines acquired by Strip

This is basically a high-level wrapper over a h5py.File object. It assumes that the HDF5 file was saved by the acquisition software used in Bologna and Tenerife, and it provides some tools to navigate through the data saved in the file.

Creating a DataFile object does not automatically open the file; this is done to preserve space. The file is lazily opened once you call one of the methods that need to access file data.

The two methods you are going to use most of the time are:

You can access these class fields directly:

  • filepath: a Path object containing the full path of the
    HDF5 file
  • datetime: a Python datetime object containing the time
    when the acquisition started
  • mjd_range: a pair of float numbers representing the
    MJD of the first and last sample in the file. To initialize this field, you must call DataFile.read_file_metadata first.
  • computed_mjd_range: a Boolean that is set to True if
    the field mjd_range was computed through a complete scanning of the datasets in the file or read from the file attributes. (In the former case, the user might want to save the MJD range back in the file and have it ready for the future, as the scanning of a HDF5 file can take considerable time.)
  • hdf5_groups: a list of str objects containing the names
    of the groups in the HDF5 file. To initialize this field, you must call DataFile.read_file_metadata first.
  • polarimeters: a Python set object containing the names
    of the polarimeters whose measurements have been saved in this file. To initialize this field, you must call DataFile.read_file_metadata first.
  • hdf5_file: if the file has been opened using
    read_file_metadata(), this is the h5py.File object.
  • tags: a list of Tag objects; you must call read_file_metadata() before reading it.

This class can be used in with statements; in this case, it will automatically open and close the file:

with DataFile(myfile) as inpf:
    # The variable "inpf" is a DataFile object in this context
get_average_biases(polarimeter, time_range=None, calibration_tables=None) → striptease.biases.BiasConfiguration[source]

Return a BiasConfiguration object containing the average values of biases for a polarimeter.

The parameter polarimeter must be a string containing the name of the polarimeter, e.g., Y0. The parameter time_range, if specified, is a 2-element tuple containing the start and end MJDs to consider in the average. If calibration_tables is specified, it must be an instance of the CalibrationTables class.

The return value of this function is a BiasConfiguration object

If calibration_tables is specified, the values returned by this method are calibrated to physical units; otherwise, they are expressed in ADUs.

load_hk(group, subgroup, par, verbose=False)[source]

Loads scientific data from one detector of a given polarimeter

Parameters:
  • group (str) – Either BIAS or DAQ
  • subgroup (str) – Name of the housekeeping group. It can either be POL_XY or BOARD_X, with X being the letter identifying the module, and Y the polarimeter number within the module. Possible examples are POL_G0 and BOARD_Y.
  • par (str) – Name of the housekeeping parameter, e.g. ID4_DIV.
  • verbose (bool) – whether to echo the HK being loaded. Default is FALSE
Returns:

the stream of times (using the astropy.time.Time datatype), and the stream of data.

Return type:

A tuple containing two NumPy arrays

Example:

from striptease.hdf5files import DataFile

f = DataFile(filename)
time, data = f.load_hk("POL_Y6", "BIAS", "VG4A_SET")
load_sci(polarimeter, data_type, detector=[])[source]

Loads scientific data from one detector of a given polarimeter

Parameters:
  • polarimeter (str) – Name of the polarimeter, in the form POL_XY or XY for short, with X being the module letter and Y the polarimeter number within the module.
  • data_type (str) – Type of data to load, either DEM or PWR.
  • detector (str) – Either Q1, Q2, U1 or U2. You can also pass a list, e.g., ["Q1", "Q2"]. If no value is provided for this parameter, all the four detectors will be returned.
Returns:

the stream of times (using the astropy.time.Time datatype), and the stream of data. For multiple detectors, the latter will be a list of tuples, where each column is named either DEMnn or PWRnn, where nn is the name of the detector.

Return type:

A tuple containing two NumPy arrays

Examples:

from striptease.hdf5files import DataFile
import numpy as np

f = DataFile(filename)

# Load the output of only one detector
time, data = my_data.load_sci("POL_G0", "DEM", "Q1")
print(f"Q1 mean output: {np.mean(data)}")

# Load the output of several detectors at once
time, data = my_data.load_sci("POL_G0", "DEM", ("Q1", "Q2"))
print(f"Q1 mean output: {np.mean(data['DEMQ1'])}")

# Load the output of all the four detectors
time, data = my_data.load_sci("POL_G0", "DEM")
print(f"Q1 mean output: {np.mean(data['DEMQ1'])}")
read_file_metadata(force=False)[source]

Open the file and retrieve some basic metadata

This function opens the HDF5 file and retrieves the following information:

  • List of groups under the root node
  • List of boards for whom some data was saved in the file
  • List of polarimeters that have some data saved in the file
  • List of tags
  • MJD of the first and last scientific/housekeeping sample in the file

This function is idempotent, in the sense that calling it twice will not force a re-read of the metadata. To override this behavior, pass force=True: the function will re-open the file and read all the metadata again.

Accessing the list of tags

Every DataFile object keeps the list of tags in the tags attribute, which is a list of object of type Tag. Here is a code that searches for all the tags containing the string STABLE_ACQUISITION in their name:

from striptease import DataFile

with DataFile("test.h5") as inpf:
    list_of_tags = [
        t
        for t in inpf.tags
        if "STABLE_ACQUISITION" in t.name
    ]

for cur_tag in list_of_tags:
    print("Found a tag: ", cur_tag.name)

# Possible output:
#
# Found a tag: STABLE_ACQUISITION_R0
# Found a tag: STABLE_ACQUISITION_B0
# Found a tag: STABLE_ACQUISITION_R1
# Found a tag: STABLE_ACQUISITION_B1
class striptease.Tag(id: int, mjd_start: float, mjd_end: float, name: str, start_comment: str, end_comment: str)[source]

Information about housekeeping parameters

As there are hundreds of housekeeping parameters used in Strip HDF5 files, Striptease provides the function get_hk_descriptions(). You pass the name of a group and a subgroup to it, and it returns a dict-like object of type HkDescriptionList that associates the name of each housekeeping in the group/subgroup with a textual description. The object can be printed using print: it will produce a (long!) table containing all the housekeeping parameters and descriptions in alphabetic order.

class striptease.HkDescriptionList(group, subgroup, hklist)[source]

Result of a call to get_hk_descriptions

This class acts like a dictionary that associates the name of an housekeeping parameter with a description. It provides a nice textual representation when printed on the screen:

l = get_hk_descriptions("BIAS", "POL")

# Print the description of one parameter
if "VG4A_SET" in l:
    print(l["VG4A_SET"])

# Print all the descriptions in a nicely-formatted table
print(l)
striptease.get_hk_descriptions(group, subgroup)[source]

Reads the list of housekeeping parameters with their own description.

Parameters:
  • group (str) – The subgroup. It must either be BIAS or DAQ.
  • subgroup (str) – The group to load. It can either be POL_XY or BOARD_X, with X being the module letter, and Y the number of the polarimeter.
Returns:

A dictionary containing the association between the name of the housekeeping parameter and its description.

Examples:

list = get_hk_descriptions("DAQ", "POL_G0")

Handling multiple HDF5 files

It is often the case that the data you are looking for spans more than one HDF5 file. In this case, it is a tedious process to read chunks of data from several files and knit them together. Luckly, Striptease provides the DataStorage class, which implements a database of HDF5 files and provides methods for accessing scientific and housekeeping data without bothering of which files should be read.

Here is an example:

from striptease import DataStorage

# This call might take some time if you have never used DataStorage
# before, as it needs to build an index of all the files
ds = DataStorage("/storage/strip")

# Wow! We are reading one whole day of housekeeping data!
times, data = ds.load_hk(
    mjd_range=(59530.0, 59531.0),
    group="BIAS",
    subgroup="POL_R0",
    par="VG1_HK",
)

Note that the script provides the range of times as a MJD range; the DataStorage object looks in the list of files and decides which files contain this information and reads them. The return value is the same as for a call to DataFile.load_hk().

For the class DataStorage to work, a database of the HDF5 files in the specified path must be already present. You can create one using the command-line script build_hdf5_database.py:

./build_hdf5_database.py /storage/strip

Accessing data in a storage

The DataStorage provides the following methods to access tags, scientific data and housekeeping parameters:

All these functions accept either a 2-tuple containing the start and end MJD or a Tag object that specifies the time range.

class striptease.DataStorage(path: Union[str, pathlib.Path], database_name='index.db', update_database=False, update_hdf5=False)[source]

The storage where HDF5 files are kept

This class builds an index of all the files in a directory containing the HDF5 files saved by the LSPE/Strip data server. It can be used to load scientific/housekeeping data without caring of file boundaries.

Example:

from striptease import DataStorage

ds = DataStorage("/database/STRIP/HDF5/")
# One whole day of tags!
tags = ds.get_tags(mjd_range=(59530.0, 59531.0))
for cur_tag in tags:
    print(cur_tag)
files_in_range(mjd_range: Union[Tuple[float, float], Tuple[astropy.time.core.Time, astropy.time.core.Time], Tuple[str, str], striptease.hdf5files.Tag]) → List[striptease.hdf5db.HDF5FileInfo][source]

Return a list of the files that contain data within the MJD range

get_list_of_files() → List[striptease.hdf5db.HDF5FileInfo][source]

Return a list of all the files in the storage path

get_tags(mjd_range: Union[Tuple[float, float], Tuple[astropy.time.core.Time, astropy.time.core.Time], Tuple[str, str], striptease.hdf5files.Tag]) → List[striptease.hdf5files.Tag][source]

Return a list of all the tags falling within a MJD range

The function returns a list of all the tags found in the HDF5 files in the storage directory that fall (even partially) within the range of MJD specified by mjd_range. The range can either be:

  1. A pair of floating-point values, each representing a MJD date;
  2. A pair of strings, each representing a date (e.g., 2021-12-10 10:39:45);
  3. A pair of instances of astropy.time.Time;
  4. A single instance of the Tag class.

The list of tags is always sorted in chronological order.

The function is quite fast because it uses a cache instead of reading the HDF5 files themselves.

load_hk(mjd_range: Union[Tuple[float, float], striptease.hdf5files.Tag], *args, **kwargs)[source]

Load housekeeping data within a specified MJD time range

This function operates in the same way as DataFile.load_hk(), but it takes as input a time range that can cross the HDF5 file boundaries.The parameter mjd_time range can be one of the following:

  1. A pair of floating-point values, each representing a MJD date;
  2. A pair of strings, each representing a date (e.g., 2021-12-10 10:39:45);
  3. A pair of instances of astropy.time.Time;
  4. A single instance of the Tag class.

Example:

from striptease import DataStorage

ds = DataStorage("/database/STRIP/HDF5/")
# Caution! One whole day of scientific data!
times, data = ds.load_hk(
    mjd_range=(59530.0, 59531.0),
    group="BIAS",
    subgroup="POL_R0",
    par="VG1_HK",
)
load_sci(mjd_range: Union[Tuple[float, float], Tuple[astropy.time.core.Time, astropy.time.core.Time], Tuple[str, str], striptease.hdf5files.Tag], *args, **kwargs)[source]

Load scientific data within a specified MJD time range

This function operates in the same way as DataFile.load_sci(), but it takes as input a time range that can cross the HDF5 file boundaries. The parameter mjd_time range can be one of the following:

  1. A pair of floating-point values, each representing a MJD date;
  2. A pair of strings, each representing a date (e.g., 2021-12-10 10:39:45);
  3. A pair of instances of astropy.time.Time;
  4. A single instance of the Tag class.

Example:

from striptease import DataStorage

ds = DataStorage("/database/STRIP/HDF5/")
# Caution! One whole day of scientific data!
    times, data = ds.load_sci(
        mjd_range=(59530.0, 59531.0),
        polarimeter="R0",
        data_type="DEM",
        detector=["Q1"],
    )

You can access a list of the files indexed by a DataStorage object using the method DataStorage.get_list_of_files(), which returns a list of :.HDF5FileInfo objects.

class striptease.HDF5FileInfo(path, size, mjd_range)

Basic information about a HDF5 data file

Fields are:

  • path: a pathlib.Path object containing the path to the file
  • size: size of the file, in bytes
  • mjd_range: a 2-tuple containing the MJD of the first and last scientific/housekeeping sample in the file (float values)