Reading data saved in HDF5 files¶
Introduction¶
Striptease provides an interface to the data acquired by the instrument and stored in HDF5 files which allows the user to access both scientific and housekeepings data returning them into time, data numpy arrays.
Before reading this chapter, you should be aware of the way data are
saved to disk when LSPE/Strip operates. The instrument sends all the
scientific (PWR
and DEM
timelines) and housekeeping (biases,
temperatures…) timelines to a computer, where a software called data
server takes them and writes them in HDF5 files. The stream of data
is usually continuous, but occasionally the data server stops
writing in a file and opens a new one. This ensures that files do not
get too big and that you can grab one of them for analysis while the
instrument is still running. These files are usually saved in a
hierarchical structure of directories, grouped according to the
year/month of acquisition. So, assuming that these data files are
stored in /storage/strip
, you might find the following files in
the Strip data storage:
/storage/strip/2021/10
2021_10_31_15-42-44.h5
2021_10_31_19-42-44.h5
2021_10_31_23-42-44.h5
/storage/strip/2021/11
2021_11_01_03-42-44.h5
2021_11_01_07-42-44.h5
2021_11_01_11-42-44.h5
It is now time to see the kind of tools that Striptease provides to access these files. The interface is based on two classes:
DataFile
provides a high-level access to a single HDF5 file.DataStorage
provides a high-level access to a directory containing several HDF5 files, like the one above. It basically abstracts over the concept of a «file» and instead considers the data as being acquired continuously.
Accessing one file¶
Let’s begin with a small example that shows how to use the class
DataFile
:
from striptease import DataFile
fname = '/path_to/data.h5'
with DataFile(fname) as my_data:
# Load a HK time series
time, data = my_data.load_hk("POL_Y6", "BIAS","VG4A_SET")
# Load some scientific data
time, data = my_data.load_sci("POL_G0", "DEM", "Q1")
The class DataFile
assumes that the name of the file follows
the convention used by the acquisition software used in the
system-level tests in Bologna and during nominal data acquisition in
Tenerife. Be sure not to mess with HDF5 file names!
Since the kind of HDF5 file used in Strip has a complex structure,
Striptease provides a few facilities to handle them. To load timelines
of housekeeping parameter and detector outputs, the DataFile
class provides two methods:
Moreover, a method DataFile.get_average_biases()
can be used
to retrieve the average level of biases within some time frame.
-
class
striptease.
DataFile
(filepath, filemode='r')[source]¶ A HDF5 file containing timelines acquired by Strip
This is basically a high-level wrapper over a h5py.File object. It assumes that the HDF5 file was saved by the acquisition software used in Bologna and Tenerife, and it provides some tools to navigate through the data saved in the file.
Creating a DataFile object does not automatically open the file; this is done to preserve space. The file is lazily opened once you call one of the methods that need to access file data.
The two methods you are going to use most of the time are:
You can access these class fields directly:
filepath
: aPath
object containing the full path of the- HDF5 file
datetime
: a Pythondatetime
object containing the time- when the acquisition started
mjd_range
: a pair offloat
numbers representing the- MJD of the first and last sample in the file. To initialize
this field, you must call
DataFile.read_file_metadata
first.
computed_mjd_range
: a Boolean that is set toTrue
if- the field
mjd_range
was computed through a complete scanning of the datasets in the file or read from the file attributes. (In the former case, the user might want to save the MJD range back in the file and have it ready for the future, as the scanning of a HDF5 file can take considerable time.)
hdf5_groups
: a list ofstr
objects containing the names- of the groups in the HDF5 file. To initialize this field,
you must call
DataFile.read_file_metadata
first.
polarimeters
: a Pythonset
object containing the names- of the polarimeters whose measurements have been saved in
this file. To initialize this field, you must call
DataFile.read_file_metadata
first.
hdf5_file
: if the file has been opened usingread_file_metadata()
, this is the h5py.File object.
tags
: a list of Tag objects; you must callread_file_metadata()
before reading it.
This class can be used in
with
statements; in this case, it will automatically open and close the file:with DataFile(myfile) as inpf: # The variable "inpf" is a DataFile object in this context
-
get_average_biases
(polarimeter, time_range=None, calibration_tables=None) → striptease.biases.BiasConfiguration[source]¶ Return a
BiasConfiguration
object containing the average values of biases for a polarimeter.The parameter polarimeter must be a string containing the name of the polarimeter, e.g.,
Y0
. The parameter time_range, if specified, is a 2-element tuple containing the start and end MJDs to consider in the average. If calibration_tables is specified, it must be an instance of theCalibrationTables
class.The return value of this function is a
BiasConfiguration
objectIf calibration_tables is specified, the values returned by this method are calibrated to physical units; otherwise, they are expressed in ADUs.
-
load_hk
(group, subgroup, par, verbose=False)[source]¶ Loads scientific data from one detector of a given polarimeter
Parameters: - group (str) – Either
BIAS
orDAQ
- subgroup (str) – Name of the housekeeping group. It can either
be
POL_XY
orBOARD_X
, with X being the letter identifying the module, and Y the polarimeter number within the module. Possible examples arePOL_G0
andBOARD_Y
. - par (str) – Name of the housekeeping parameter,
e.g.
ID4_DIV
. - verbose (bool) – whether to echo the HK being loaded. Default is FALSE
Returns: the stream of times (using the astropy.time.Time datatype), and the stream of data.
Return type: A tuple containing two NumPy arrays
Example:
from striptease.hdf5files import DataFile f = DataFile(filename) time, data = f.load_hk("POL_Y6", "BIAS", "VG4A_SET")
- group (str) – Either
-
load_sci
(polarimeter, data_type, detector=[])[source]¶ Loads scientific data from one detector of a given polarimeter
Parameters: - polarimeter (str) – Name of the polarimeter, in the form
POL_XY
orXY
for short, with X being the module letter and Y the polarimeter number within the module. - data_type (str) – Type of data to load, either
DEM
orPWR
. - detector (str) – Either
Q1
,Q2
,U1
orU2
. You can also pass a list, e.g.,["Q1", "Q2"]
. If no value is provided for this parameter, all the four detectors will be returned.
Returns: the stream of times (using the astropy.time.Time datatype), and the stream of data. For multiple detectors, the latter will be a list of tuples, where each column is named either
DEMnn
orPWRnn
, wherenn
is the name of the detector.Return type: A tuple containing two NumPy arrays
Examples:
from striptease.hdf5files import DataFile import numpy as np f = DataFile(filename) # Load the output of only one detector time, data = my_data.load_sci("POL_G0", "DEM", "Q1") print(f"Q1 mean output: {np.mean(data)}") # Load the output of several detectors at once time, data = my_data.load_sci("POL_G0", "DEM", ("Q1", "Q2")) print(f"Q1 mean output: {np.mean(data['DEMQ1'])}") # Load the output of all the four detectors time, data = my_data.load_sci("POL_G0", "DEM") print(f"Q1 mean output: {np.mean(data['DEMQ1'])}")
- polarimeter (str) – Name of the polarimeter, in the form
-
read_file_metadata
(force=False)[source]¶ Open the file and retrieve some basic metadata
This function opens the HDF5 file and retrieves the following information:
- List of groups under the root node
- List of boards for whom some data was saved in the file
- List of polarimeters that have some data saved in the file
- List of tags
- MJD of the first and last scientific/housekeeping sample in the file
This function is idempotent, in the sense that calling it twice will not force a re-read of the metadata. To override this behavior, pass
force=True
: the function will re-open the file and read all the metadata again.
Accessing the list of tags¶
Every DataFile
object keeps the list of tags in the tags
attribute, which is a list of object of type Tag
. Here is a
code that searches for all the tags containing the string
STABLE_ACQUISITION
in their name:
from striptease import DataFile
with DataFile("test.h5") as inpf:
list_of_tags = [
t
for t in inpf.tags
if "STABLE_ACQUISITION" in t.name
]
for cur_tag in list_of_tags:
print("Found a tag: ", cur_tag.name)
# Possible output:
#
# Found a tag: STABLE_ACQUISITION_R0
# Found a tag: STABLE_ACQUISITION_B0
# Found a tag: STABLE_ACQUISITION_R1
# Found a tag: STABLE_ACQUISITION_B1
Information about housekeeping parameters¶
As there are hundreds of housekeeping parameters used in Strip HDF5
files, Striptease provides the function get_hk_descriptions()
.
You pass the name of a group and a subgroup to it, and it returns a
dict-like object of type HkDescriptionList
that associates
the name of each housekeeping in the group/subgroup with a textual
description. The object can be printed using print
: it will
produce a (long!) table containing all the housekeeping parameters and
descriptions in alphabetic order.
-
class
striptease.
HkDescriptionList
(group, subgroup, hklist)[source]¶ Result of a call to get_hk_descriptions
This class acts like a dictionary that associates the name of an housekeeping parameter with a description. It provides a nice textual representation when printed on the screen:
l = get_hk_descriptions("BIAS", "POL") # Print the description of one parameter if "VG4A_SET" in l: print(l["VG4A_SET"]) # Print all the descriptions in a nicely-formatted table print(l)
-
striptease.
get_hk_descriptions
(group, subgroup)[source]¶ Reads the list of housekeeping parameters with their own description.
Parameters: - group (str) – The subgroup. It must either be
BIAS
orDAQ
. - subgroup (str) – The group to load. It can either be
POL_XY
orBOARD_X
, with X being the module letter, and Y the number of the polarimeter.
Returns: A dictionary containing the association between the name of the housekeeping parameter and its description.
Examples:
list = get_hk_descriptions("DAQ", "POL_G0")
- group (str) – The subgroup. It must either be
Handling multiple HDF5 files¶
It is often the case that the data you are looking for spans more than
one HDF5 file. In this case, it is a tedious process to read chunks of
data from several files and knit them together. Luckly, Striptease
provides the DataStorage
class, which implements a database
of HDF5 files and provides methods for accessing scientific and
housekeeping data without bothering of which files should be read.
Here is an example:
from striptease import DataStorage
# This call might take some time if you have never used DataStorage
# before, as it needs to build an index of all the files
ds = DataStorage("/storage/strip")
# Wow! We are reading one whole day of housekeeping data!
times, data = ds.load_hk(
mjd_range=(59530.0, 59531.0),
group="BIAS",
subgroup="POL_R0",
par="VG1_HK",
)
Note that the script provides the range of times as a MJD range; the
DataStorage
object looks in the list of files and decides
which files contain this information and reads them. The return value
is the same as for a call to DataFile.load_hk()
.
For the class DataStorage
to work, a database of the HDF5
files in the specified path must be already present. You can create
one using the command-line script build_hdf5_database.py
:
./build_hdf5_database.py /storage/strip
Accessing data in a storage¶
The DataStorage
provides the following methods to access
tags, scientific data and housekeeping parameters:
DataStorage.get_tags()
retrieves a list of tags;DataStorage.load_sci()
retrieves scientific timelines;DataStorage.load_hk()
retrieves housekeeping timelines.
All these functions accept either a 2-tuple containing the start and
end MJD or a Tag
object that specifies the time range.
-
class
striptease.
DataStorage
(path: Union[str, pathlib.Path], database_name='index.db', update_database=False, update_hdf5=False)[source]¶ The storage where HDF5 files are kept
This class builds an index of all the files in a directory containing the HDF5 files saved by the LSPE/Strip data server. It can be used to load scientific/housekeeping data without caring of file boundaries.
Example:
from striptease import DataStorage ds = DataStorage("/database/STRIP/HDF5/") # One whole day of tags! tags = ds.get_tags(mjd_range=(59530.0, 59531.0)) for cur_tag in tags: print(cur_tag)
-
files_in_range
(mjd_range: Union[Tuple[float, float], Tuple[astropy.time.core.Time, astropy.time.core.Time], Tuple[str, str], striptease.hdf5files.Tag]) → List[striptease.hdf5db.HDF5FileInfo][source]¶ Return a list of the files that contain data within the MJD range
-
get_list_of_files
() → List[striptease.hdf5db.HDF5FileInfo][source]¶ Return a list of all the files in the storage path
Return a list of all the tags falling within a MJD range
The function returns a list of all the tags found in the HDF5 files in the storage directory that fall (even partially) within the range of MJD specified by mjd_range. The range can either be:
- A pair of floating-point values, each representing a MJD date;
- A pair of strings, each representing a date (e.g.,
2021-12-10 10:39:45
); - A pair of instances of
astropy.time.Time
; - A single instance of the
Tag
class.
The list of tags is always sorted in chronological order.
The function is quite fast because it uses a cache instead of reading the HDF5 files themselves.
-
load_hk
(mjd_range: Union[Tuple[float, float], striptease.hdf5files.Tag], *args, **kwargs)[source]¶ Load housekeeping data within a specified MJD time range
This function operates in the same way as
DataFile.load_hk()
, but it takes as input a time range that can cross the HDF5 file boundaries.The parameter mjd_time range can be one of the following:- A pair of floating-point values, each representing a MJD date;
- A pair of strings, each representing a date (e.g.,
2021-12-10 10:39:45
); - A pair of instances of
astropy.time.Time
; - A single instance of the
Tag
class.
Example:
from striptease import DataStorage ds = DataStorage("/database/STRIP/HDF5/") # Caution! One whole day of scientific data! times, data = ds.load_hk( mjd_range=(59530.0, 59531.0), group="BIAS", subgroup="POL_R0", par="VG1_HK", )
-
load_sci
(mjd_range: Union[Tuple[float, float], Tuple[astropy.time.core.Time, astropy.time.core.Time], Tuple[str, str], striptease.hdf5files.Tag], *args, **kwargs)[source]¶ Load scientific data within a specified MJD time range
This function operates in the same way as
DataFile.load_sci()
, but it takes as input a time range that can cross the HDF5 file boundaries. The parameter mjd_time range can be one of the following:- A pair of floating-point values, each representing a MJD date;
- A pair of strings, each representing a date (e.g.,
2021-12-10 10:39:45
); - A pair of instances of
astropy.time.Time
; - A single instance of the
Tag
class.
Example:
from striptease import DataStorage ds = DataStorage("/database/STRIP/HDF5/") # Caution! One whole day of scientific data! times, data = ds.load_sci( mjd_range=(59530.0, 59531.0), polarimeter="R0", data_type="DEM", detector=["Q1"], )
-
You can access a list of the files indexed by a DataStorage
object using the method DataStorage.get_list_of_files()
,
which returns a list of :.HDF5FileInfo objects.
-
class
striptease.
HDF5FileInfo
(path, size, mjd_range)¶ Basic information about a HDF5 data file
Fields are:
path
: apathlib.Path
object containing the path to the filesize
: size of the file, in bytesmjd_range
: a 2-tuple containing the MJD of the first and last scientific/housekeeping sample in the file (float
values)