Noisify: All purpose data augmentation¶
Release v1.0.0. (Installation)
Noisify is a simple light weight library for augmenting and modifying data by adding ‘noise’.
Let’s make some noise:
Add some human noise (typos, things in the wrong boxes etc.)
>>> from noisify.recipes import human_error
>>> test_data = {'this': 1.0, 'is': 2, 'a': 'test!'}
>>> human_noise = human_error(5)
>>> print(list(human_noise(test_data)))
[{'a': 'tset!', 'this': 2, 'is': 1.0}]
>>> print(list(human_noise(test_data)))
[{'a': 0.0, 'this': 'test!', 'is': 2}]
Add some machine noise (gaussian noise, data collection interruptions etc.)
>>> from noisify.recipes import machine_error
>>> machine_noise = machine_error(5)
>>> print(list(machine_noise(test_data)))
[{'this': 1.12786393038729, 'is': 2.1387080616716307, 'a': 'test!'}]
If you want both, just add them together
>>> combined_noise = machine_error(5) + human_error(5)
>>> print(list(combined_noise(test_data)))
[{'this': 1.23854334573554, 'is': 20.77848220943227, 'a': 'tst!'}]
Add noise to numpy arrays
>>> import numpy as np
>>> test_array = np.arange(10)
>>> print(test_array)
[0 1 2 3 4 5 6 7 8 9]
>>> print(list(combined_noise(test_array)))
[[0.09172393 2.52539794 1.38823741 2.85571154 2.85571154 6.37596668
4.7135771 7.28358719 6.83600156 9.40973018]]
Read an image
>>> from PIL import Image
>>> test_image = Image.open(noisify.jpg)
>>> test_image.show()

And now with noise
>>> from noisify.recipes import human_error, machine_error
>>> combined_noise = machine_error(5) + human_error(5)
>>> for out_image in combined_noise(test_image):
... out_image.show()

Noisify allows you to build flexible data augmentation pipelines for arbitrary objects. All pipelines are built from simple high level objects, plugged together like lego. Use noisify to stress test application interfaces, verify data cleaning pipelines, and to make your ML algorithms more robust to real world conditions.
Features¶
Noisify provides data augmentation through a simple high level abstraction
- Build reporters to apply augmentation to any object, images, dataframes, database interfaces etc.
- Compose augmentations from configurable flaw objects
- Build recipes to deploy pipelines simply
- Everything is composable, everything is polymorphic
Noisify is built for Python 3+.
The Basics¶
A brief high level guide of how to use noisify, mostly prose with illustrative examples.
Introduction¶
Background¶
Noisify is a project by Dstl (the Defence Science and Technology Laboratory). We are an executive agency of the UK Ministry of Defence.
Noisify was developed in part to expand the work done in image augmentation to other forms of data, and also to help test and perfect data cleaning and processing pipelines.
Copyright and usage information¶
Crown Copyright 2019
Noisify is released under the terms of the MIT licence.
Installation¶
Noisify is hosted on the PyPI central repo and can be installed as follows
$ pip install noisify
The only dependency is Python 3.5+ !
Build from Source¶
If you would prefer to install the latest version of the code, or perhaps to modify or contribute to it, the project is hosted on GitHub.
Quickstart¶
If Noisify is installed we can get to work with some examples!
Augmenting with recipes¶
Basic augmentation can be done very simply using basic recipes.
>>> from noisify.recipes import *
The built in recipes are designed to work with a wide variety of different object types. Let’s give it a go with a simple Python dict.
>>> test_data = {'this': 1.0, 'is': 2, 'a': 'test!'}
>>> human_noise = human_error(5)
>>> print(human_noise(test_data))
<generator object Noisifier.generate_reports at 0x7f2d67e0f570>
Recipes create Noisifier objects, these objects then generate observations based on what they are given. To get a simple list, cast to list. The built in recipes take a single ‘severity’ argument. Bigger numbers lead to bigger effects on the data.
>>> print(list(human_noise(test_data)))
[{'a': 'tset!', 'this': 2, 'is': 1.0}]
You can also use a noisifier on a list of data.
>>> test_data = [{'test%d' % (index): "This is test run number %d" % index} for index in range(5)]
>>> test_data
[{'test0': 'This is test run number 0'},
{'test1': 'This is test run number 1'},
{'test2': 'This is test run number 2'},
{'test3': 'This is test run number 3'},
{'test4': 'This is test run number 4'}]
>>> print(list(human_noise(test_data)))
[{'test0': 'This is test run number 0'},
{'test1': 'This is test run number 1'},
{'test2': 'hT iis testt unn umber2'},
{'test3': 'This is test run number 3'},
{'test4': 'This is test run number 4'}]
Let’s have a closer look at what human_noise does.
>>> print(human_noise)
{'Noisifier': {'Reporter': {'Attributes': [],
'Faults': [Fault: TypographicalFault {'likelihood': 0.5, 'severity': 0.5},
Fault: ScrambleAttributes {'likelihood': 0.5, 'attribute_identifiers': None}]}}}
That’s a lot of information! The main thing to focus on is the ‘Reporter’ entry. This contains attributes (which we’ll get to later) and Faults. Faults are the methods used to modify the incoming data steam, here you can see the two being used, typographical faults which scramble text and numbers, and attribute scrambling, this swaps values between keys in incoming dictionaries.
Let’s have a look at another recipe.
>>> print(machine_error(5))
{'Noisifier': {'Reporter': {'Attributes': [],
'Faults': [Fault: GaussianNoise {'sigma': 0.5},
Fault: InterruptionFault {'likelihood': 0.05}]}}}
Gaussian Noise is pretty self-explanatory, Interruption Fault leads to loss of data. Some values will be replaced with None.
Applying Gaussian noise to a string doesn’t make much sense. That’s no issue here though, if noisify doesn’t know how to apply a given fault to a value, it won’t try.
>>> print(list(machine_error(5)(test_data)))
[{'test0': 'This is test run number 0'},
{'test1': 'This is test run number 1'},
{'test2': None},
{'test3': 'This is test run number 3'},
{'test4': 'This is test run number 4'}]
Custom Noisifiers¶
Imagine we have a series of medical records, people’s height and weight are generally measured in metres and kilograms. Occasionally however somebody has their weight entered in pounds and their height in inches. Let’s say we’ve built a mechanism to find these wrongly entered values and we want to test it, how do we create this data? And more importantly, how do we tell when the noisifier has actually changed these values?
We need to create a custom noisifier.
First let’s create some data.
>>> import random
>>> def build_patient_record():
... return {'height': random.gauss(1.7, 0.1), 'weight': random.gauss(85, 10)}
>>> build_patient_record()
{'weight': 79.0702693462696, 'height': 1.690377702784025}
Now let’s create some conversion functions for metric to imperial.
>>> def kilo_to_pounds(weight):
... return weight * 2.205
...
>>> def metres_to_inches(height):
... return height * 39.37
...
>>>
Now let’s create our attributes, this enables us to associate specific faults with specific values of the record. There are many different ways attributes can be looked up and modified, in this case we’re using dictionary lookups.
>>> from noisify.attributes import DictValue
>>> from noisify.faults import UnitFault
>>> height = DictValue('height', faults=UnitFault(likelihood=0.25, unit_modifier=metres_to_inches))
>>> weight = DictValue('weight', faults=UnitFault(likelihood=0.25, unit_modifier=kilo_to_pounds))
Attributes take an identifier, this can be a key to a dictionary, or an attribute name of an object.
Now we build the reporter.
>>> from noisify.reporters import Reporter
>>> patient_reporter = Reporter(attributes=[height, weight])
That was easy, the reporter can be called on individual records, but won’t accept data series.
>>> patient_reporter(build_patient_record())
{'height': 1.8157596382670191, 'weight': 199.97545102729777}
To apply more generally, create a noisifier.
>>> from noisify.recipes import Noisifier
>>> patient_noise = Noisifier(reporter=patient_reporter)
Let’s build some data and noisify it.
>>> true_patients = [build_patient_record() for i in range(5)]
>>> true_patients
[{'height': 1.7831797462380368, 'weight': 84.70459461136014},
{'height': 1.7661108421633465, 'weight': 87.20572747494349},
{'height': 1.5047252739096044, 'weight': 102.7315276194823},
{'height': 1.9371269447064758, 'weight': 78.54807087351945},
{'height': 1.7624795973113694, 'weight': 76.47383227872784}]
>>> processed_patients = list(patient_noise(true_patients))
>>> processed_patients
[{'height': 1.7831797462380368, 'weight': 84.70459461136014},
{'height': 1.7661108421633465, 'weight': 192.2886290822504},
{'height': 59.24103403382112, 'weight': 102.7315276194823},
{'height': 76.26468781309394, 'weight': 78.54807087351945},
{'height': 1.7624795973113694, 'weight': 76.47383227872784}]
Report objects¶
Noisify reporters return report objects. These contain the observation made, but they also contain other information. These are stored as additional attributes on the object.
The faults triggered on an object can be retrieved through the triggered_faults attribute. Continuing from our example above:
>>> for patient in processed_patients:
... print(patient.triggered_faults)
{'reporter': [], 'height': [], 'weight': []}
{'reporter': [], 'height': [], 'weight': [Fault: UnitFault {'unit_modifier': <function kilo_to_pounds at 0x7f0b1fd17400>}]}
{'reporter': [], 'height': [Fault: UnitFault {'unit_modifier': <function metres_to_inches at 0x7f0b1fd17488>}], 'weight': []}
{'reporter': [], 'height': [Fault: UnitFault {'unit_modifier': <function metres_to_inches at 0x7f0b1fd17488>}], 'weight': []}
{'reporter': [], 'height': [], 'weight': []}
The ground truth is also stored.
>>> for patient in processed_patients:
... print(patient.truth)
{'height': 1.7831797462380368, 'weight': 84.70459461136014}
{'height': 1.7661108421633465, 'weight': 87.20572747494349}
{'height': 1.5047252739096044, 'weight': 102.7315276194823}
{'height': 1.9371269447064758, 'weight': 78.54807087351945}
{'height': 1.7624795973113694, 'weight': 76.47383227872784}
Recipes¶
Recipes are simply factory functions for noisifiers. Consider the built in ‘human_error’ recipe.
>>> def human_error(scale):
... return Noisifier(
... reporter=Reporter(
... faults=[TypographicalFault(likelihood=min(1, 0.1*scale), severity=0.1*scale),
... ScrambleAttributes(scrambledness=0.1*scale)]
... ),
... faults=None
... )
>>>
Combining reporters and noisifiers¶
The addition operator will combine reporters/ noisifiers into composites which will apply all faults from both original reporters.
>>> from noisify.recipes import machine_error, human_error
>>> print(machine_error(5))
{'Noisifier': {'Reporter': {'Attributes': [],
'Faults': [Fault: GaussianNoise {'sigma': 0.5},
Fault: InterruptionFault {'likelihood': 0.05}]}}}
>>> print(human_error(5))
{'Noisifier': {'Reporter': {'Attributes': [],
'Faults': [Fault: TypographicalFault {'likelihood': 0.5, 'severity': 0.5},
Fault: ScrambleAttributes {'likelihood': 0.5, 'attribute_identifiers': None}]}}}
>>> print(machine_error(5) + human_error(5))
{'Noisifier': {'Reporter': {'Attributes': [],
'Faults': [Fault: GaussianNoise {'sigma': 0.5},
Fault: InterruptionFault {'likelihood': 0.05},
Fault: TypographicalFault {'likelihood': 0.5, 'severity': 0.5},
Fault: ScrambleAttributes {'likelihood': 0.5, 'attribute_identifiers': None}]}}}
For custom faults and adding new datatype handlers to faults, see the advanced section.
Advanced Usage¶
This guide covers more advanced topics in noisify.
Defining Faults¶
Faults are defined by subclassing the base Fault class:
>>> from noisify.faults import Fault
>>> import random
>>> class AddOneFault(Fault):
... def __init__(self, likelihood=1.0):
... self.likelihood = likelihood
...
... @register_implementation(priority=1)
... def add_to_string(self, string_object):
... return string_object + "1"
...
Let’s unpack this definition.
We have the constructor, this behaves as expected. In this case adding a likelihood attribute to the object.
We also have an implementation. This describes how a fault will act on the data it is given.
Implementations And The Dispatch Queue¶
The power of noisify lies in its ability to take a large variety of different data types and intelligently apply noise. This mechanism is managed through the Dispatch Queue.
When an implementation is written for a given fault, it is decorated using the @register_implementation(priority=x) decorator. This gives the implementation its place within the queue. When a fault is called upon an unknown object it will attempt to apply each implementation in the queue to it in sequence. If all fail it will return the original object unaffected. Bigger numbers come first in the queue, so in the below example numpy_array will be called before python_numeric.
Let’s look at some source code for an example
>>> class GaussianNoise(AttributeFault):
... def __init__(self, sigma=0):
... AttributeFault.__init__(self, sigma=sigma)
... self.sigma = sigma
... pass
...
... @register_implementation(priority=10)
... def numpy_array(self, array_like_object):
... import numpy as np
... noise_mask = np.random.normal(scale=self.sigma, size=array_like_object.size)
... return array_like_object + noise_mask
...
... @register_implementation(priority=1)
... def python_numeric(self, python_numeric_object):
... return random.gauss(python_numeric_object, self.sigma)
This fault will apply a gaussian noise filter to the input data. If the python_numeric implementation is called on a numpy array then a single random value will be added to the entire array, this is not desired behaviour. To fix this a second implementation with higher priority kicks in for numpy array like objects, this adds a separate offset to each value independently.
Dispatch Through Type Annotations¶
Dispatch should be handled through ducktyping where possible. However we recognise that cases exist where explicit dispatch on type is needed, this can be done through type annotations on the relevant implementations as follows.
>>> class TypographicalFault(AttributeFault):
... @register_implementation(priority=1)
... def impact_string(self, string_object: str):
... return typo(string_object, self.severity)
...
... @register_implementation(priority=1)
... def impact_int(self, int_object: int):
... return int(self.impact_string(str(int_object)) or 0)
Implementation Dispatch And Inheritance¶
Implementations are passed down through inheritance. The main example of this is the AttributeFault fault type, which adds a single implementation which will attempt to map the fault onto all elements of the input object. This can be given to a Reporter to cause it to apply the fault to all of its attributes. Negative priorities can be used in base class implementations to ensure that they are resolved last. Negative priorities should not be used in normal fault implementation annotation.
The Community Guide¶
Our release process and community support process.
Support¶
File an Issue¶
If you spot a bug in noisify, or would like to suggest an additional feature, you can use our issue tracker on GitHub.
Community Updates¶
GitHub¶
The latest information on the status of the project is available on the GitHub repo.
Release and Version History¶
v1.0¶
- Initial release!
v0.9¶
- Looping behaviour for infinite generation
- Initial documentation
v0.8¶
- Ecosystem support for pandas, pil etc.
v0.7¶
- Type annotation dispatch added to priority dispatch mechanism
v0.6¶
- Renamed to Noisify
- First recipes
v0.5¶
- Priority dispatch mechanism first built
v0.4¶
- First reporter level faults
- Attribute introspection
v0.3¶
- Add composability by overloading addition to faults and reporters
v0.2¶
- Major rewrite from 0.1, focuses purely on fault generation.
v0.1¶
- Simulation and data augmentation together, too messy
Release Process and Rules¶
All code that adds new features will be required to implement unit tests to ensure that it does not introduce unexpected behaviour.
Pull requests that add new features will be very gladly accepted! Try and keep them small if possible. Larger requests will naturally take longer for us to review. Please avoid adding any dependencies, if you’re adding support for an extra library then make sure this extra support is done in an optional way (importing a library in an implementation will skip the implementation if the library is not installed, please use this for ecosystem support).
Most importantly however, thank you for contributing back to Noisify!
Versioning follows the Semantic Versioning framework.
Major Releases¶
The first number in the version number is the major release (i.e vX.0.0
). Changes to the core
API that are not backwards compatible will result in a new major release version.
Releases of this nature will be infrequent.
Minor Releases¶
Minor releases will change the second number of the version number (i.e v0.Y.0
),
these releases will add new features, but will be fully backwards compatible with
prior versions.
Hotfix Releases¶
Hotfix releases will change the final number of the version (i.e v0.0.Z
),
these releases will consist of bug fixes between versions.
The API Documentation / Guide¶
Full documentation of the noisify API
API reference¶
noisify.attribute_readers package¶
noisify.attribute_readers.attribute_readers module¶
Attribute Readers allow faults to be directed to specific attributes of an input object. These do not need to be literal attributes, they can be values in a dictionary or columns in a database for example, as long as they can be accessed via a key.
-
class
noisify.attribute_readers.attribute_readers.
AttributeReader
(attribute_identifier, faults=None)[source]¶ Bases:
noisify.helpers.fallible.Fallible
The AttributeReader interface describes a mechanism to read and write values from an object
-
get_value
(truth_object)[source]¶ (Part of the interface) Must return the ground truth for the given attribute of the original object
-
-
class
noisify.attribute_readers.attribute_readers.
DictValue
(attribute_identifier, faults=None)[source]¶ Bases:
noisify.attribute_readers.attribute_readers.AttributeReader
Provides support for dictionary value lookups as attributes.
-
class
noisify.attribute_readers.attribute_readers.
ObjectAttribute
(attribute_identifier, faults=None)[source]¶ Bases:
noisify.attribute_readers.attribute_readers.AttributeReader
Provides support for literal object attributes as attributes.
noisify.attribute_readers.inspection_strategies module¶
Inspection strategies are used by reporters to create attribute_readers for given objects when none are specified.
-
noisify.attribute_readers.inspection_strategies.
dictionary_lookup
(unknown_dictionary, attribute_faults=None)[source]¶ Generates attribute_readers for each key/value pair of a given dictionary, enables reporters to map faults across dictionaries without further specification.
-
noisify.attribute_readers.inspection_strategies.
object_attributes_lookup
(unknown_object, attribute_faults=None)[source]¶ Generates attribute_readers for each attribute of a given object, enables reporters to map faults across objects without further specification. Ignores methods and private attributes marked with ‘_’.
noisify.faults package¶
noisify.faults.fault module¶
The base classes for faults.
-
class
noisify.faults.fault.
AttributeFault
(*args, **kwargs)[source]¶ Derived base class for attribute_readers, adds mapping behaviour which enables attribute faults to be added at higher levels of data representation.
For example:
>>> from noisify.faults import GaussianNoise >>> noise = GaussianNoise(sigma=0.5) >>> noise.impact(100) 100.66812113455995 >>> noise.impact({'A group': 100, 'of numbers': 123}) {'of numbers': 122.83439465953323, 'A group': 99.69284150349345}
-
class
noisify.faults.fault.
Fault
(*args, **kwargs)[source]¶ Fault base class.
Requires implementations to be registered in its subclasses. Subclasses register implementations with the “register_implementation(priority=x)” decorator.
All implementations will be attempted using a try except loop which will except Type, Attribute and Import errors. If no implementations succeed, the Fault will return the original object, unchanged.
By default faults are constitutively active, this can be overridden at instantiation by providing a ‘likelihood’ keyword argument with a probability of activation as a float.
Example Usage:
>>> class AddOneFault(Fault): ... def condition(self, triggering_object): ... return True ... ... @register_implementation(priority=2) ... def make_uppercase(self, lowercase_string): ... return lowercase_string.upper() ... ... @register_implementation(priority=1) ... def add_to_int_string(self, integer_object): ... return int(str(integer_object) + "1") ... >>> adder = AddOneFault() >>> adder.impact("testing priority") 'TESTING PRIORITY' >>> adder.impact(1234) 12341
This decorator will also honour any type hints in the decorated function.
Example:
>>> class AddOneFault(Fault): ... @register_implementation(priority=1) ... def make_uppercase(self, lowercase_string: str): ... print('Called uppercase function') ... return lowercase_string.upper() ... ... @register_implementation(priority=2) ... def add_to_int_string(self, integer_object: int): ... print('Called integer adding function') ... return int(str(integer_object) + "1") ... >>> adder = AddOneFault() >>> adder.impact("testing annotation") Called uppercase function 'TESTING ANNOTATION' >>> adder.impact(1234) Called integer adding function 12341
-
apply
(not_faulted_object)[source]¶ Applies the fault to an object, returns self and the new object if the activation condition is met.
Parameters: not_faulted_object – Returns: self or None, impacted_object
-
condition
(triggering_object)[source]¶ Base condition method, applies fault either constitutively or according to a likelihood argument at instantiation.
Parameters: triggering_object – Can be used to create object-type dependant activation in overridden methods Returns: Boolean of whether or not the fault applies
-
noisify.faults.attribute_faults module¶
Basic attribute level faults, mostly basic numeric manipulations. A good place to get started.
-
class
noisify.faults.attribute_faults.
CalibrationFault
(offset=0)[source]¶ Subclass of UnitFault, adds a constant offset to the input numeric.
>>> calibration_fault = CalibrationFault(10) >>> calibration_fault.impact(200) 210
-
class
noisify.faults.attribute_faults.
GaussianNoise
(sigma=0)[source]¶ Applies a gaussian noise to a numeric object.
>>> noise = GaussianNoise(sigma=0.5) >>> noise.impact(27) 28.08656007204934
Numpy arrays like objects apply noise separately to each element.
>>> import numpy as np >>> test = np.arange(5) >>> noise.impact(test) array([0.56983913, 0.92835482, 2.36240306, 2.87398093, 3.92371237])
-
class
noisify.faults.attribute_faults.
InterruptionFault
(likelihood=0)[source]¶ Replaces input with None, activates according to set likelihood.
>>> interrupt = InterruptionFault(1.0) >>> interrupt.impact('This can be anything')
>>>
-
class
noisify.faults.attribute_faults.
TypographicalFault
(likelihood=0, severity=0)[source]¶ Applies a rough misspelling to the input using faults.utilities.typo()
>>> from noisify.faults import TypographicalFault >>> typo_fault = TypographicalFault(1.0, 1) >>> typo_fault.impact('This is the original text') 'Thhiisith heiginal etxt'
-
class
noisify.faults.attribute_faults.
UnitFault
(likelihood=1.0, unit_modifier=None)[source]¶ Applies a user defined adjustment to the input numeric object. Useful for modelling unit errors.
>>> def celsius_to_kelvin(celsius_value): ... return celsius_value + 273.15 ... >>> kelvin_fault = UnitFault(unit_modifier=celsius_to_kelvin) >>> kelvin_fault.impact(21) 294.15
noisify.faults.report_faults module¶
Report level faults typically comprise faults that depend on multiple attributes. For example switching attribute values.
-
class
noisify.faults.report_faults.
ConfuseSpecificAttributes
(attribute1, attribute2, likelihood=0)[source]¶ Swaps a specific pair of attribute values in a given object
-
class
noisify.faults.report_faults.
LoseEntireReport
(likelihood=0)[source]¶ Replaces entire report with None, activates according to set likelihood.
noisify.faults.utilities module¶
Fault utility functions, general purpose code that is used by multiple functions.
-
noisify.faults.utilities.
dropped_scramble
(collection, scrambledness, confusion_range)[source]¶ Scrambles objects in a collection, with a chance to lose some objects
Parameters: - collection –
- scrambledness – How likely two objects are to be switched
- confusion_range – How far apart objects can be confused with one another
Returns:
-
noisify.faults.utilities.
scramble
(collection, scrambledness, confusion_range)[source]¶ Scrambles the order of objects in a collection using a gaussian distribution, can lead to duplicate objects
Parameters: - collection –
- scrambledness – How likely two objects are to be switched
- confusion_range – How far apart objects can be confused with one another
Returns:
noisify.helpers package¶
noisify.recipes package¶
Default recipes, these are extremely simple and are mainly to provide examples for developing your own code.
noisify.reporters package¶
noisify.reporters.reporter module¶
-
class
noisify.reporters.reporter.
Reporter
(attributes=None, attribute_type=<function dictionary_lookup>, faults=None)[source]¶ Bases:
noisify.helpers.fallible.Fallible
The most important class in Noisify!
Reporters define how objects should be changed. They can be as specific or a general as needed.
noisify.reporters.series module¶
-
class
noisify.reporters.series.
Noisifier
(reporter=None, faults=None)[source]¶ Bases:
noisify.helpers.fallible.Fallible
The Noisifier class handles pipelining objects through an underlying reporter class, it can also be configured to apply faults at the pipeline level, such as confusing elements from one object to another.
-
get_series
(source_truths, key=None, loop=False)[source]¶ Calling the noisifier object directly on an object will call this method.
Parameters: - source_truths – a series of objects (or a single object)
- key – function which will extract a name from each object to be used as an
identifier for the resultant report. :param loop: whether to generate indefinitely by looping over the source truths :return: a report generator
-