Band B is about the validity of the data. Checking the faithfulness and representation. It can include components of exploratory data analysis.
What does it include?
- visualisations for exploratory analysis e.g. PCA and Hierarchical Clustering
- noise characterisation.
- missing values.
- entity disambiguation, record linkage, duplicate detection
- anomaly detection
- sanity checks on the use of physical units (if used)
- data representation (vectorizing, word embeddings etc)? Or does this come in Band A.
- Was a column or columns accidentally perturbed (e.g. through a sort operation that missed one or more columns)?
- Was a gene name accidentally converted to a date?
At the end of Band B, we are ready to define a candidate question, the context.