I started my data science career years ago in the area of geoinformatics; cartography, geographic information systems (GIS), geospatial analysis. It was a very exciting time in which I was applying data and tools to real world issues. Everyday I went to work with something new I had learned the night before; a Python script to process imagery or a new method for manipulating geodemographic point data. This on-the-job experience led me to expand my formal education with courses in spatial statistics and advanced GIS practices. To this end, I was exposed to concepts such as the modifiable areal unit problem (MAUP) and ecological fallacy; two of many important considerations in the analysis and representation of spatial data. However, these were of little concern in the classroom where data was tailored to lessons and consequences for false assessments were minimal.
A year later, I was building custom geospatial data architectures and producing geovisualizations for a customer. I often found that unless I was extremely explicit in what the visualization was saying about the data, the representation could be misunderstood within the larger narrative of a given project. I had to verbally convey the many meanings that one could take away from something as complex as a thematic map; I literally labeled and placed caveats on everything I could to make sure no misunderstanding was derived by the reader. This process also entailed documenting the processing of my data; from receiving or generating the data to applying it within my analysis. I was keenly aware of how critical it was for both my customer and myself personally that I take zero chances in misrepresenting my data through some form of neglect or obscured bias.
The above story is common among many data science professionals. On a daily basis, many people within the larger field of data science have to manage risk and bias in their work; in part to appease one’s customer requirements and in part to satisfy a personal ethic towards their works, labors, and careers. However, the need to progress and produce should not overshadow the professional prerogative and responsibility in making biases, gaps, and overall faults overt in the course on one’s daily tasks. Given that many data scientists are the sole arbiters on matters of data and its modeling, there must be the utmost care taken in how the products of such deliberations are presented to customers and how the customer incorporates said products within their procedures and operations.
In general, considerations of bias center around the ideas of ‘favor’ and ‘fairness’. However, these terms are largely subjective depending on the perspectives and positions of various parties over time. In terms of data, there are those producing data and those providing the means through which people produce the data; those who use software and those who develop it. Beyond this, there are those who collect the data, manipulate the data, analyze the data, represent the data, and then ultimate initiate action based on the analytic byproducts. This involves numerous technical and heuristic steps through which people apply software-driven solutions in the process of developing and answering questions. At each step in a given data process, people are reacting to and reproducing the socio-technical relations that define fairness and favor based on how control of data, models, and algorithms are established and perpetuated. This is the nexus of complication through which bias in data science must be approached, assessed, and mitigated.
Keep in mind that data science is a practice: in part material based on the spaces in which the demand for practice emerges and the places they are to take effect; in part immaterial based on the conditions of mediation that data, software, and technology imposed upon people who are tasked to act upon assigned intent to produce effect within an ethic of conduct. Ethics may be assigned by an employer, an organization, or the individual themselves; even the non-existent ethic is an ethic unto itself. Therefore, ethical practice of data science is in its most basic form the realization of responsibility towards the handling of data in the process of modeling and manipulation that coincides with educating others to the complexity of multiple biases of which they may not know or care to know. Often what is asked of a data scientist is simplicity in form and structure of answer to a given problematic, but this would deny the messiness of the real world and the data which is a direct reflection of this entropy. To not develop ethical practices that reflect the immensely complex dimensional character of data science practices is to eschew any hope of validity in one’s work and credibility as a data scientist.
In this post, I will explore my thoughts on the risk of bias in a data science lacking ethical conduct. This will be followed by proposed practices and considerations for mitigation of bias which could potentially lead to establishment, in part, of a code of conduct on the matter of bias. I will conclude with a discussion on the state of ethical data science practices and postulate on how bias weaves into the larger scope of a general code of conduct in data science.
Risk of Bias
Data is an artifact; it is left in any number of forms based on a past action that predicated its existence and resulted in it production. To extract the most value from said artifact, care must be taken in its excavation. It must be documented and contextualized if analysis is to incorporation and make fruitful the data for any model formed (in part or in whole) from its characteristics. People then look upon data and are tasked with making sense of its conditions for the purpose of answering questions and taking action based on the outcomes of interrogation. However, there is often fallacious to consider this as the point at which bias must be addressed and contextualized for the data scientist; it is far more complicated than that. There are those biases that precede and proceed the involvement of a data scientist and those which the data scientists themselves impose upon the data practices; both must be accounted for by the data scientist if the scope of bias within the overall problematic they are addressing is to be validated to its fullest extent.
The bias of provenance is the first that a data scientist may encounter. It if often the case that a data scientist is tasked to collect and/or use data to which there is very little history provided on its origins, production, manipulation, and storage. These factors are practices in and of themselves undertaken over time through unknown processes by unknown actors; all points which are eventually critical in the process of validation for models. This incorporates the biases associated with reliability and obfuscation. If provenance cannot be established in order to review its efficacy for the purpose of analysis and modeling, there is very little in terms of assurance one can reliably make in the process of cross-validation. Furthermore, one cannot be sure of what biases were introduced to the data without some record of provenance from its conception to its present state. It is here where concern of obfuscation are twofold: there is the unknown associated with the provenance and reliability factors of existing data sets intended for inclusion for ones analysis; there is also the obfuscation the data scientist chooses to introduce in using this data within the scope of their own heuristics and assumption. Whether the data scientist is conceptualizing and collecting original data, re-contextualizing existing data, or performing both, transparency in their methodologies and methods is critical in consideration of complicating the dimensions of bias.
The data scientist must also account for bias in interpretation and perception. Data science as a term and field encompasses numerous sub-disciplines; this diversity is a double-edged sword in terms of bias. A professional working in the position of a data scientist will ultimately come from a specialized disciplinary focus; a disciplinary focus which implies a particular way of seeing the world from which the data is produced and in which it is contextualized. Therefore the manner in which the data is interpreted is a direct byproduct of the individual or team through which the analysis and modeling is pursued. This interpretation is faceted in numerous ways based on the context of the problematic, questions, and hypotheses the data scientist perceives in the course of their research and analysis. The multiplicity of perspectives that construct the overall interpretation can lead to complications in provenance and reliability through the inadvertent (or intentional) obfuscation of praxis undertaken by the individual or data science team.
All these biases must be predicated under the guise that there will be some level of bias introduced via intrusion. Bias via intrusion is that which occurs through the dictates or guidance of those directing or funding research. It is here where data scientists must observe the utmost vigilance and responsibility with their data, analysis, and modeling. It may often be the case that regardless of what the data actually says, a data scientist may be directed to design and institute analyses, models, and algorithms that have selectively manipulated variable and factors to produce a desired effect. If ethical considerations towards bias are to be observed in a large context of global conduct, intrusion must be assumed in existing data and models and explicitly documented in data and models being designed by data scientist in that moment of ethical crisis.
In a world of ever growing data sets where the need for expansive, computationally-driven analysis is ever present, considerations for ethical reproach of biases must be at the forefront of our conduct if “data science” is ever to be considered a science. In academic research, institutional review boards ensure researchers are aware of and in compliance with federal laws and institutional regulations dictating ethical and humane use of data as a byproduct of research. There is no such review process in the private sector for such considerations and the requirement for similar activities in the public and defense sector are tenuous at best. Therefore, it is up to the data scientist to adhere to an ethical code of conduct and to mitigate the effects of bias in the course of their work.
Mitigation of Bias
The above discussion on risk in bias, despite the brevity of its form, is a burdensome consideration for any person or group in the course of research and analysis. Challenging any aspect of bias towards data or processes may fall in direct opposition to the institution for which one labors. These institutions are driven to produce results within a time frame for a desired set of effects of which the byproduct of one’s data science practices is a considerable part. To insert a measure of ethical mitigation towards issues arising from bias will not come without its issues, but it is an imperative for a researcher which aims to bolster their professional efficacy and the efficacy of their work.
Perhaps the first step is to establish and distinguish the need for data; its scale, its scope, and its relevance. As mentioned above, data scientists are often the arbiters in such matters and its here where a measure of mitigation can be introduced. Those seeking outcomes from one’s research and analysis will have varying forms of intrusion and politics attached to their rationale; in many ways this problematic is why the data and the data scientist exists. Therefore, the data scientist must be proactive in interrogating that rationale and setting limits that account for the aforementioned considerations of bias-based risk. Often this will have to come in the form of an established lexicon through which the data scientist can convey said risks in a manner that elucidates the variance in expectations and makes the unknowns explicit if current demands are pursued.
The aim of the ethical approach to mitigating bias is to ground accountability. Given the compromises that must occur in the processes of research and analysis, data science practices must account for matters of security, privacy, and confidentiality for the people any data set ultimately represents. If the data scientist pursues this grounding in a transparent manner (within the limits of what one’s employ allows without compromising the safety of the data scientist), faults and failures can be relayed to the recipients of the analysis in a manner that is constructive and fruitful towards the least harmful outcome. Furthermore, failed experiments and lines of research can often be as telling as those which produce what are perceived as successes; this content and the ethics within which it is predicated are ultimately up to the data scientist.
Mitigation must be approached with an array of considerations that are subjectively and temporally situated for the data scientist given the nature and expectations of their work. Rationale for a line of research must be interrogated and challenged through the initial delineation between necessity and desire the affect what data is obtained and how it will be used. Presentation of limits and establishing common terms are some of many ways in which accountability is grounded for the all parties involved; ensuring ethics becomes part of the discourse. In doing this, matters of security and transparency for the data is simplified for the data scientist and ensures any unethical actions can be addressed and resolved accordingly.
This has been a brief discourse on the risks associated with bias in data science and a few of the ways in which such risks can be mitigated. This discussion has become critical given the persistent production of data in the everyday lives of people around the world. Smartphone and laptops create massive amounts of data connected to personal and professional conduct that increasingly define people and their futures. The increased demand to used said data will require a great deal of personal courage on the part of data scientists to ensure that favor and fairness in contextualized by those who a given data set represents and not those who control the data set itself. Ethics are something we as data scientists get to define; the manner in which we conceive and practice the moral principles that govern our conduct will be what legitimizes us as a profession and as professionals. This means setting limits to and challenging perceived misuse of data, models, and methodologies.
To these ends, we need to ensure the practices of data science enables individual and communal empowerment through a common, global code of conduct. However, this ethical construct should bolster best practices that balance data ownership and provenance with analytic affect and objectification of the data upon which these practices persist and evolve. The manner in which bias is mitigated in data collection and manipulation is the best way to avert complications with modeling and algorithmic formulation. Throughout the entirety of data science practices, communication and transparency must persist at all levels. Communication will permit optimal limiting of scopes; permitting increased efficiencies while limiting potential breaches in privacy and security. However, transparency will continue to be of issue as these practices are often connected to intellectual property and trade secrets that define many private and public rationale for data science practices. This does not mean that there is any less onus on the data scientist to ensure the most ethical practices in the course of their work.
Data science is challenging to both practice and define, but these challenges should be embraced for the diversity and potential they represent for the profession and for those who see themselves as part of it. This requires us who identify as and practice data science to be vigilant of ourselves and others; to ensure the legitimacy and reputation of the profession is not left to organizations to define and dispose with, but is guided by a community of dedicated people working to ensure that our science is pursuant of the highest ethical standards of conduct.
Ananny, M. (2016). Toward an Ethics of Algorithms. Science, Technology & Human Values, 41(1), 93-117.
Angwin, J., Larson, J., Mattu, S., Kirchner, L. (2016). Machine bias: There’s software used across the country to predict future criminals. And it’s biased against blacks. ProPublica. Available from https://www.propublica.org/article/machine-bias-risk-assessments-in-criminal-sentencing
Association for Computing Machinery (1992). ACM code of ethics and professional conduct. Available from https://www.acm.org/about-acm/acm-code-of-ethics-and-professional-conduct
American Statistical Association (2016). Ethical guidelines for statistical practice. Available from https://www.amstat.org/asa/files/pdfs/EthicalGuidelines.pdf
Crawford, K. (2016). Can an Algorithm be Agonistic? Ten Scenes from Life in Calculated Publics. Science, Technology & Human Values, 41(1), 77.
IEEE (2017). IEEE Code of Ethics. Available from https://www.ieee.org/about/corporate/governance/p7-8.html
Data Science Association (2017). Data science code of professional conduct. Available from http://www.datascienceassn.org/code-of-conduct.html
Diakopoulos, N. (2016). Accountability in Algorithmic Decision Making. Communications Of The ACM, 59(2), 56-62.
Grubaugh, C. (2014). The ethical obligations of a banal, content apocalypse. Kybernetes, 43(6), 947.
Kleinberg, J., Mullainathan, S., Raghavan, M. (2016). Inherent trade-offs in the fair determination of risk scores. Retrieved from https://arxiv.org/pdf/1609.05807.pdf
Knobe, C., & Bowker, G. C. (2011). Computing Ethics Values in Design. Communications Of The ACM, 54(7), 26-28.
Raymond, A. H., & Shackelford, S. J. (2014). Technology, ethics, and access to justice: Should an algorithm be deciding your case?. Michigan Journal Of International Law, 35(3), 485-524.
Tewell, E. e. (2016). Toward the Resistant Reading of Information: Google, Resistant Spectatorship, and Critical Information Literacy. Portal: Libraries & The Academy, 16(2), 289-310.
Tractenberg, R. E., Russell, A. J., Morgan, G. J., FitzGerald, K. T., Collmann, J., Vinsel, L., Steinmann, M., Dolling, L. M. (2015). Using ethical reasoning to amplify the reach and resonance of professional codes of conduct in training big data scientists. Sci Eng Ethics, 21, 1485-1507.
UK Cabinet Office, Government Digital Service (2016). Data science ethical framework. Available from https://www.gov.uk/government/publications/data-science-ethical-framework
Wild, C. M. (2017). Fair and balanced? Thoughts on bias in probabilistic modeling. Hackernoon. Available from https://hackernoon.com/fair-and-balanced-thoughts-on-bias-in-probabilistic-modeling-2ffdbd8a880f
Ziewitz, M. (2016). Governing Algorithms. Science, Technology & Human Values, 41(1), 3.
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.