Investigating the attainment of optimum data quality for EHR Big Data: proposing a new methodological approach

PhD thesis


Juddoo, S. 2022. Investigating the attainment of optimum data quality for EHR Big Data: proposing a new methodological approach. PhD thesis Middlesex University Computer Science
TypePhD thesis
TitleInvestigating the attainment of optimum data quality for EHR Big Data: proposing a new methodological approach
AuthorsJuddoo, S.
Abstract

The value derivable from the use of data is continuously increasing since some years. Both commercial and non-commercial organisations have realised the immense benefits that might be derived if all data at their disposal could be analysed and form the basis of decision taking. The technological tools required to produce, capture, store, transmit and analyse huge amounts of data form the background to the development of the phenomenon of Big Data. With Big Data, the aim is to be able to generate value from huge amounts of data, often in non-structured format and produced extremely frequently. However, the potential value derivable depends on general level of governance of data, more precisely on the quality of the data. The field of data quality is well researched for traditional data uses but is still in its infancy for the Big Data context. This dissertation focused on investigating effective methods to enhance data quality for Big Data. The principal deliverable of this research is in the form of a methodological approach which can be used to optimize the level of data quality in the Big Data context. Since data quality is contextual, (that is a non-generalizable field), this research study focuses on applying the methodological approach in one use case, in terms of the Electronic Health Records (EHR).
The first main contribution to knowledge of this study systematically investigates which data quality dimensions (DQDs) are most important for EHR Big Data. The two most important dimensions ascertained by the research methods applied in this study are accuracy and completeness. These are two well-known dimensions, and this study confirms that they are also very important for EHR Big Data. The second important contribution to knowledge is an investigation into whether Artificial Intelligence with a special focus upon machine learning could be used in improving the detection of dirty data, focusing on the two data quality dimensions of accuracy and completeness. Regression and clustering algorithms proved to be more adequate for accuracy and completeness related issues respectively, based on the experiments carried out. However, the limits of implementing and using machine learning algorithms for detecting data quality issues for Big Data were also revealed and discussed in this research study. It can safely be deduced from the knowledge derived from this part of the research study that use of machine learning for enhancing data quality issues detection is a promising area but not yet a panacea which automates this entire process. The third important contribution is a proposed guideline to undertake data repairs most efficiently for Big Data; this involved surveying and comparing existing data cleansing algorithms against a prototype developed for data reparation. Weaknesses of existing algorithms are highlighted and are considered as areas of practice which efficient data reparation algorithms must focus upon.
Those three important contributions form the nucleus for a new data quality methodological approach which could be used to optimize Big Data quality, as applied in the context of EHR. Some of the activities and techniques discussed through the proposed methodological approach can be transposed to other industries and use cases to a large extent. The proposed data quality methodological approach can be used by practitioners of Big Data Quality who follow a data-driven strategy. As opposed to existing Big Data quality frameworks, the proposed data quality methodological approach has the advantage of being more precise and specific. It gives clear and proven methods to undertake the main identified stages of a Big Data quality lifecycle and therefore can be applied by practitioners in the area.
This research study provides some promising results and deliverables. It also paves the way for further research in the area. Technical and technological changes in Big Data is rapidly evolving and future research should be focusing on new representations of Big Data, the real-time streaming aspect, and replicating same research methods used in this current research study but on new technologies to validate current results.

Sustainable Development Goals9 Industry, innovation and infrastructure
Middlesex University ThemeCreativity, Culture & Enterprise
Department nameComputer Science
Institution nameMiddlesex University
Publication dates
Print06 Jan 2023
Publication process dates
Deposited06 Jan 2023
Accepted23 May 2022
Output statusPublished
Accepted author manuscript
LanguageEnglish
Permalink -

https://repository.mdx.ac.uk/item/8q372

Download files


Accepted author manuscript
  • 120
    total views
  • 124
    total downloads
  • 3
    views this month
  • 5
    downloads this month

Export as

Related outputs

Investigating data repair steps for EHR Big Data
Juddoo, S. 2022. Investigating data repair steps for EHR Big Data. 3rd International Conference on Next Generation Computing Applications (NextComp). Flic-en-Flac, Mauritius 06 - 08 Oct 2022 IEEE. https://doi.org/10.1109/nextcomp55567.2022.9932167
A qualitative assessment of machine learning support for detecting data completeness and accuracy issues to improve data analytics in big data for the healthcare industry
Juddoo, S. and George, C. 2020. A qualitative assessment of machine learning support for detecting data completeness and accuracy issues to improve data analytics in big data for the healthcare industry. ELECOM 2020 - 3rd International Conference on Emerging Trends in Electrical, Electronic and Communications Engineering (ELECOM). Mauritius, Mauritius 25 - 27 Nov 2020 IEEE. pp. 58-66 https://doi.org/10.1109/ELECOM49001.2020.9297009
Analyzing the prospects and acceptance of mobile-based marine debris tracking
Thanacoody, A., Bekaroo, G., Santokhee, A. and Juddoo, S. 2019. Analyzing the prospects and acceptance of mobile-based marine debris tracking. ELECOM 2018: 2nd International Conference on Emerging Trends in Electrical, Electronic and Communications Engineering. Mauritius 28 - 30 Nov 2018 Springer. pp. 256-267 https://doi.org/10.1007/978-3-030-18240-3_24
Discovering the most important data quality dimensions in health big data using latent semantic analysis
Juddoo, S. and George, C. 2018. Discovering the most important data quality dimensions in health big data using latent semantic analysis. 2018 International Conference on Advances in Big Data, Computing and Data Communication Systems (icABCD). Durban, South Africa 06 - 07 Aug 2018 IEEE. https://doi.org/10.1109/ICABCD.2018.8465129
Data governance in the health industry: investigating data quality dimensions within a big data context
Juddoo, S., George, C., Duquenoy, P. and Windridge, D. 2018. Data governance in the health industry: investigating data quality dimensions within a big data context. Applied System Innovation. 1 (4), pp. 1-16. https://doi.org/10.3390/asi1040043
JarPi: A low-cost raspberry pi based personal assistant for small-scale fishermen
Vora, M., Bekaroo, G., Santokhee, A., Juddoo, S. and Roopowa, D. 2017. JarPi: A low-cost raspberry pi based personal assistant for small-scale fishermen. IEEE 4th International Conference on Soft Computing and Machine Intelligence (ISCMI). Port Louis, Mauritius 23 - 24 Nov 2017 IEEE. pp. 159-163 https://doi.org/10.1109/iscmi.2017.8279618
Exploring the application and usability of NFC for promoting self-learning on energy consumption of household electronic appliances
Ramrecha, V., Bekaroo, G., Santokhee, A. and Juddoo, S. 2017. Exploring the application and usability of NFC for promoting self-learning on energy consumption of household electronic appliances. IEEE 4th International Conference on Soft Computing and Machine Intelligence (ISCMI). Port Louis, Mauritius 23 - 24 Nov 2017 IEEE. pp. 154-158 https://doi.org/10.1109/ISCMI.2017.8279617