×
Namespaces

Variants
Actions
Revision as of 11:08, 31 January 2013 by vnuckcha (Talk | contribs)

Stegafoto: a lens which embeds audio and text inside images

From Nokia Developer Wiki
Jump to: navigation, search

Stegafoto is a Windows Phone Lens which enables the user to embed a piece of audio or text within the image. This article explains the theory used to embed the content (virtually "loss free") along with technical detail about the implementation.

WP Metro Icon File.png
WP Metro Icon Multimedia.png
SignpostIcon XAML 40.png
WP Metro Icon WP8.png
Article Metadata
Code ExampleTested with
SDK: Windows Phone 8.0 SDK
Compatibility
Platform(s): Windows Phone 8 and later
Windows Phone 8
Article
Keywords: Steganography
Created: vnuckcha (28 Jan 2013)
Last edited: vnuckcha (31 Jan 2013)

Contents

Introduction

Embedding text or audio within an image can make it easier for the photographer to vividly re-live the experience when browsing an image months after it has been taken. The technique used here is "fat-free" (does not increase the size of the image) and does not visibly distort or affect the quality of the image. Briefly put, the technique uses the principle of Steganography with a simple "even-odd" encoding scheme in the least significant bits of the pixels in the image.

The article explains how this result has been achieved at two levels. The first part is structured so that even someone with no programming experience should be able to get a feel for how it works - all you need is an open mind. The "Technical Details" parts that follow assume that the reader is familiar with C# and Javascript programming.

The video below demonstrates this experience:


Walkthrough

In order to simplify the understanding of the method, I will use the case of embedding text instead of audio. Once this case is understood, the audio part will be explained on top of that understanding. So, let's examine the case of embedding text into an image with the following series of steps:

  1. Taking a picture with the Stegafoto lens.
  2. Entering a piece of text to be embedded inside the image. Text reads "Hi".
  3. Transferring the stegafoto from the device to the PC in order to view it as I did not bother to implement an "upload-to-picture-service" feature.
  4. Opening the picture from the web-browser via a loaded page with special Javascript code.
  5. The captured image + the embedded text "Hi". Note the greenish tint on the image comes from the 3rd party PNG encoder that i am using and not from the data fusion process.

Also note, that the method described below is one of the many ways of performing this task. While writing the application I used a test-driven approach coupled with rapid prototyping and this is why i ended up with this series of steps. I did not bother with refactoring the algorithm in order to optimize the solution (e.g. implement a fault tolerant scheme or a picture upload functionality) as i am not interested in writing a product but, merely keen in proving the concept. Also, I used a 3rd party library Imagetools for Silverlight as the PNG encoder for saving the captured image stream to a file. Finally, the code provided with this article is fragile in the sense that it is just a prototype proving a concept. For example, if you run it on your device do not record audio for too long as it will crash the app (4-5s is ok).

Fusing the data

In terms of the above scenario, the algorithm is as follows:

  1. Take a picture and get a reference to the captured pixels.
  2. Get the entered text as a String and convert it to an array of bits. For example, say the user entered the text "Hi". This is then converted to the binary sequence '0100100001101001' according to the following process:
    1. For each character in the string convert it to its ASCII value which is an integer. (So, for 'H' we get 72 and for 'i' we have 105)
    2. Transform the integer representation of that value to a byte representation (binary sequence). That is, transform the integer value to an appropriate sequence of 1's and 0's. Given that it is a byte representation, then for all characters the sequence will be of length 8. (So, for 72 we get 01001000 and for 105 we get 01101001)
    3. So, when we examine the sequence of characters 'H' and 'i' (i.e. the string "Hi") we have the binary sequence of 0100100001101001.
  3. Next, we take the sequence and encode it into the pixels according to the following scheme:
    1. We are going use all the channels (Alpha, Red, Green and Blue) in a pixel to encode the data. This means that each byte of data (sequence of 8 bits) will be encoded with 2 pixels. Hence, with this scheme, the maximum of amount of data that we can encode is size of image (width resolution x height resolution / 2). For example, if we use a 640 x 480 image resolution, then we can store (640 x 480 / 2 =) 153600 bytes of data (=~ 7 seconds of audio based on the idea described at the bottom of this section. Of course, there are other ways to increase the amount of audio that can be embedded in the image but, this is not the purpose of this article).
      1. A side note about the anatomy of a pixel: A pixel is encoded as a sequence of 4 bytes where each byte correspond to one of the above-mentioned channels. There is no gradation property on these bytes. They are equal in the sense that they do distinct things. For example, the channel known as Alpha is responsible for defining the transparency level of that pixel, while the one known as Blue is setting the amount of blue color to be used in that very same pixel. So, given a pixel value of 0xFF1A2302 (in hexadecimal (base 16) form), this means that the Alpha channel has a value of 0xFF (equal in integer terms to 255), the Red channel has a value of 0x1A (equal in integer terms to 26), the Green channel has a value of 0x23 (equal in integer terms to 35) and the Blue channel has a value of 0x02 (equal in integer terms to 2). Now, because we are talking about a byte, we are referring to a sequence of 8 bits where there is positional order. This means that when we examine a byte, there is a most-significant and a least-significant position. The least-significant bit (i.e. bit in the least significant position) corresponds to the integer 1 while the most-significant corresponds to the integer 128. For example, if i give you the sequence 10000001 i am representing the integer 129. Therefore when i change the least-significant bit, i am effectively making the least possible change in the integer-representation of that bit. For example, if i change the least significant bit for 10000001, i get 10000000 which is 128 (129 - 128 = 1).
    2. Finally, for the encoding scheme we are only going to use the "even-odd" scheme on the least-significant bit of each channel in a given pixel to do the encoding so that there is indiscernible distortion of the image. Let's examine this in detail with an example:
      1. Let's say we want to encode the letter 'H'. Per our algorithm, we are going to encode the binary sequence '01001000'. This means that we are going to need 2 (adjacent) pixels (p1 and p2). Hence, in the case of the image that we have captured (starting on the top-left corner of the image), p1 is 0xFF1A2302 and p2 is 0xFF1A2203. In terms of integers (i.e. decimal base system), the ARGB channels of those pixels can be expressed as follows: p1 = {255, 26, 35, 2} and p2 = {255, 26, 34, 3}.
      2. Therefore for our purpose, we modify the pixels so that the value of corresponding channel is odd if the corresponding bit is 1 and, even otherwise. That is, in terms of an odd-even sequence, '01001000' becomes 'even odd even even odd even even even'. With this in mind, the transformation of those pixels become p1 = {254, 25, 34, 2} and p2 = {255, 26, 34, 2}. As you can observe the change in the values of the ARGB channels between the modified pixels and their original values is minimal. Therefore, in terms of an even-odd sequence, 'p1p2' which was initially 'odd odd even odd odd odd even odd' subsequently becomes 'even odd even even odd even even even' which is exactly the data we wish to encode. The schematics below illustrates this logic in action:
The state of pixels p1 and p2 before and after the fusion process.

Now, when it comes to recording an audio (as in the above video), the process is the same except that we have to convert the captured sound from the microphone into a string and this is done in the following manner:

  1. Get the recorded sound bite as PCM data and apply the relevant header in order to convert it to WAV. The algorithm for converting PCM to WAV can be found here.
  2. Next take WAV data as a sequence of bytes and convert it to Base64 encoding in order to get a string representation.
  3. Take the string and pass it to the convert-to-binary method mentioned above.

Technical details

In this sub-section I am going to give code snippets and explanations (in the comments) on how the above algorithm has been implemented. So,

Converting a String to an array of bits (boolean):

private bool[] ConvertStringToBitArray(String str)
{
bool[] bitArray = new bool[8 * str.Length];
int j = 0;
foreach (char c in str)
{
for (int i = 0; i < 8; i++)
{
bitArray[j + i] = (((c >> (7 - i)) & 0x00000001) == 1 ? true : false);
}
j += 8;
}
 
return bitArray;
}

Encoding a bit in a pixel:

/**
* Method below is called from the following context:
*
* for (int i = 0; i < embeddedDataAsBitArray.Length; i++)
* {
* // pngImage.Pixels[i] is referring to a channel in the pixel. E.g. when i%4 == 1 we are accessing the Red channel of the pixel.
* pngImage.Pixels[i] = Encode(embeddedDataAsBitArray[i], pngImage.Pixels[i]);
* }
*/

private byte Encode(bool bit, byte val)
{
if (val % 2 == 1)
{
if (bit == false) // => byte is odd and we would like to write a 0
{
val--;
}
}
else
{
if (bit == true) // => byte is even and we would like to write a 1
{
val++;
}
}
return val;
}

Converting a PCM to a WAV according to the algorithm found here:

Encoding ENCODING = System.Text.Encoding.UTF8;
 
// User has pressed the 'Record Audio' button:
private void RecordAudio(object sender, GestureEventArgs e)
{
e.Handled = true;
Debug.WriteLine("Recording Audio ...");
if (_mic.State == MicrophoneState.Stopped)
{
Debug.WriteLine("Audio Sample Rate: {0}", _mic.SampleRate);
_audioStream.SetLength(0);
 
// Write a header to the stream so that we can have a WAV file:
// This document was used to create header: https://ccrma.stanford.edu/courses/422/projects/WaveFormat/
_audioStream.Write(ENCODING.GetBytes("RIFF"), 0, 4);
// This will be filled later once the recording is done. I.e. we would know the size of data.
_audioStream.Write(BitConverter.GetBytes(0), 0, 4);
// WAVE is made up of 2 parts: Format (fmt ) which describes the audio data such as, channels,
// bitrate, etc and then (data) which is the actual audio data.
_audioStream.Write(ENCODING.GetBytes("WAVE"), 0, 4);
// Writing the Format part:
_audioStream.Write(ENCODING.GetBytes("fmt "), 0, 4);
// This indicates the size of the 1st part that will follow this segment. 16 implies that audio is in PCM format.
_audioStream.Write(BitConverter.GetBytes(16), 0, 4);
_audioStream.Write(BitConverter.GetBytes((short)1), 0, 2);
_audioStream.Write(BitConverter.GetBytes((short)1), 0, 2);
_audioStream.Write(BitConverter.GetBytes(_mic.SampleRate), 0, 4);
_audioStream.Write(BitConverter.GetBytes(_mic.SampleRate * BYTES_PER_SAMPLE), 0, 4);
_audioStream.Write(BitConverter.GetBytes((short)BYTES_PER_SAMPLE), 0, 2);
_audioStream.Write(BitConverter.GetBytes((short)BITS_PER_SAMPLE), 0, 2);
// Writing the Data part:
_audioStream.Write(ENCODING.GetBytes("data"), 0, 4);
// The size of the data will be known once the recording is done.
_audioStream.Write(BitConverter.GetBytes(0), 0, 4);
_mic.Start();
StopRecordingButton.Visibility = Visibility.Visible;
RecordAudioButton.Visibility = Visibility.Collapsed;
}
}
 
// User has pressed on the 'Stop Audio Recording' button:
private void StopRecording(object sender, GestureEventArgs e)
{
e.Handled = true;
Debug.WriteLine("Stop recording Audio ...");
if (_mic.State == MicrophoneState.Started)
{
_mic.Stop();
_audioStream.Flush();
long endOfStream = _audioStream.Position;
int streamLength = (int)_audioStream.Length;
_audioStream.Seek(4, SeekOrigin.Begin); // Move the 'cursor' to the 1st place holder in the header of the WAVE format.
_audioStream.Write(BitConverter.GetBytes(streamLength - 8), 0, 4); // Insert the size of the stream - the WAVE header part.
_audioStream.Seek(40, SeekOrigin.Begin); // Move the 'cursor' to the 2nd place holder which is 36 bits away.
_audioStream.Write(BitConverter.GetBytes(streamLength - 44), 0, 4);
_audioStream.Seek(endOfStream, SeekOrigin.Begin);
Debug.WriteLine("Recorded {0}s of audio", _mic.GetSampleDuration(streamLength));
RecordAudioButton.Visibility = Visibility.Visible;
StopRecordingButton.Visibility = Visibility.Collapsed;
// Converting WAV into String format:
_capturedHiddenData = System.Convert.ToBase64String(_audioStream.ToArray());
if (!String.IsNullOrEmpty(_capturedHiddenData))
{
_capturedType = AUDIO_TYPE;
ShareImageButton.IsEnabled = true;
}
else
{
_capturedType = UNDEFINED_TYPE;
ShareImageButton.IsEnabled = false;
}
}
}

Extracting embedded data

In this section I am going to discuss the method of retrieving the embedded data from the image. For this purpose, I imagined that a probable context where the user would do that would be while browsing the photos hosted on a web-service. In that sense, the viewer application would be a web-browser. Having said that, the photos could be hosted locally (as in my examples above) and hence the web-browser would thus be the ideal tool for all situations. In an earlier experiment, I implemented the "decoding" with the basic <canvas> element but it was not so successful. It turns out that the get-pixels function (getImageData) of the <Canvas> object returns premultiplied alpha pixels which for us means that the embedded data is destroyed. Some vendors provide flags to turn off the premultiplier property but, this means that we have to have almost bespoke solutions for each browsers and more than often turning off the flags does not even help. So, then i decided to take the WebGL route and this was much simpler (I tested solution on Firefox and Chrome and the script worked flawlessly without any alterations.). Note that, Internet Explorer does not support WebGL but, an equivalent solution can be cooked with Silverlight 5. You can verify if your browser supports WebGL by visiting the can i use website.

The algorithm for this part of the solution is equally simple:

  1. First prepare your canvas so that it can leverage the WebGL APIs.
  2. Get a reference to the resource (i.e. URI to the image) and render it as a texture on the canvas.
  3. Then, read the pixels of the texture (not off the canvas via the getImageData() method) and create a binary sequence out of the ARGB channels of the pixels based on the even-or-odd nature of those values.
  4. Finally, convert the sequence of bits into a String and process according to the type of data we are dealing with. If the type of data is:
    1. pure text then display it somewhere (e.g. step 5 in the Introduction of this article).
    2. is audio, prepend "data:audio/wav;base64," to the data and set that string as source to the <Audio> element. As the WAV data was encoded in Base64 by the Lens and media elements in HTML support data-urls, subsequently when the user presses on the play button, the sound comes through.

Technical details

Because this solution permits the user to embed either text or audio, the system must be able to distinguish one from the other in order to deliver the appropriate experience to the user. This issue is tackled by simply hardening the format in which embedded data is encoded. For the sake of the demo a very simple scheme by prepending a header to the data before it is encoded. The header has the following structure: ST#<type_of_data>#<length_of_actual_data_in_bytes>#. For example, in the case described in the Introduction of the article, the encoded data is: ST#T#2#Hi. With this in mind, let's have a look at some code snippets to see how the above algorithm has been implemented. So,

Preparing <Canvas> to use WebGL:

var _canvas = null;
var _gl = null;
var _shaderProgram = null;
 
// Creates the shader based on the ID of the shader description found in the DOM:
function GetShader(id) {
var shader;
 
var shaderScriptNode = document.getElementById(id);
if (!shaderScriptNode) {
throw "Could not find a shader script descriptor with ID [" + id + "]";
}
 
// Walk down the node and construct the shader script:
var script = "";
var currChild = shaderScriptNode.firstChild;
while(currChild) {
if(currChild.nodeType == currChild.TEXT_NODE) {
script += currChild.textContent;
}
currChild = currChild.nextSibling;
}
 
// Identify the type of shader (Vertex or Fragment):
if (shaderScriptNode.type == "x-shader/x-vertex") {
shader = _gl.createShader(_gl.VERTEX_SHADER);
} else if (shaderScriptNode.type == "x-shader/x-fragment") {
shader = _gl.createShader(_gl.FRAGMENT_SHADER);
} else {
throw "Could not find a valid shader-type descriptor";
}
 
// Load the script into the shader object and compile:
_gl.shaderSource(shader, script);
_gl.compileShader(shader);
if (!_gl.getShaderParameter(shader, _gl.COMPILE_STATUS)) {
throw "Compilation error in script [" + id + "]: " + _gl.getShaderInfoLog(shader);
}
return shader;
}
 
function CreateShaderProgram(vsId, fsId) {
var vs = GetShader(vsId);
var fs = GetShader(fsId);
var shaderProgram = _gl.createProgram();
_gl.attachShader(shaderProgram, vs);
_gl.attachShader(shaderProgram, fs);
_gl.linkProgram(shaderProgram);
 
if(!_gl.getProgramParameter(shaderProgram, _gl.LINK_STATUS)) {
throw "Unable to create shader program with provided shaders."
}
_gl.useProgram(shaderProgram);
return shaderProgram;
}
 
function InitWebGL(canvasId, VertexShaderScriptId, FragmentShaderScriptId) {
_canvas = document.getElementById(canvasId);
 
if (!_canvas) {
throw "Could not locate a canvas element with id '" + canvasId + "'";
} else {
try {
_gl = _canvas.getContext("webgl") || _canvas.getContext("experimental-webgl");
console.log("Created WebGL context ...");
_gl.pixelStorei(_gl.UNPACK_PREMULTIPLY_ALPHA_WEBGL, false);
_gl.pixelStorei(_gl.UNPACK_COLORSPACE_CONVERSION_WEBGL, false);
_shaderProgram = CreateShaderProgram(VertexShaderScriptId, FragmentShaderScriptId);
console.log("Created Shader Program ...");
} catch (e) {
_gl = null;
throw "Err: WebGl not supported by this browser.";
}
}
}
 
// Initializing the canvas called 'WorkingArea':
function Initialize() {
try {
InitWebGL("WorkingArea", "ImgVertexShader", "ImgPixelShader");
 
// Set the canvas dimensions in the Shader Program (Vertex Shader):
_gl.uniform2f(_gl.getUniformLocation(_shaderProgram, "uCanvasRes"), _canvas.width, _canvas.height);
 
// Create a buffer for the Texture Coordinate:
_gl.bindBuffer(_gl.ARRAY_BUFFER, _gl.createBuffer());
_gl.bufferData(_gl.ARRAY_BUFFER, new Float32Array([0.0, 0.0, 1.0, 0.0, 0.0, 1.0, 0.0, 1.0, 1.0, 0.0, 1.0, 1.0]), _gl.STATIC_DRAW);
var texCoordLocation = _gl.getAttribLocation(_shaderProgram, "aTextureCoord");
_gl.enableVertexAttribArray(texCoordLocation);
_gl.vertexAttribPointer(texCoordLocation, 2, _gl.FLOAT, false, 0, 0);
 
// Create a texture inorder to load image into it later:
_gl.bindTexture(_gl.TEXTURE_2D, _gl.createTexture());
_gl.texParameteri(_gl.TEXTURE_2D, _gl.TEXTURE_WRAP_S, _gl.CLAMP_TO_EDGE);
_gl.texParameteri(_gl.TEXTURE_2D, _gl.TEXTURE_WRAP_T, _gl.CLAMP_TO_EDGE);
_gl.texParameteri(_gl.TEXTURE_2D, _gl.TEXTURE_MIN_FILTER, _gl.NEAREST);
_gl.texParameteri(_gl.TEXTURE_2D, _gl.TEXTURE_MAG_FILTER, _gl.NEAREST);
 
// Create a buffer for the rectangle that will "host" the texture:
_gl.bindBuffer(_gl.ARRAY_BUFFER, _gl.createBuffer());
var positionLocation = _gl.getAttribLocation(_shaderProgram, "aVertexPosition");
_gl.enableVertexAttribArray(positionLocation);
_gl.vertexAttribPointer(positionLocation, 2, _gl.FLOAT, false, 0, 0);
} catch (e) {
alert(e);
}
}

Defining the shader objects with HLSL so that the image can be rendered correctly as a texture:

<script id="ImgPixelShader" type="x-shader/x-fragment">
precision mediump float;
uniform sampler2D uImage;
varying vec2 vTextureCoord;
 
void main() {
gl_FragColor = texture2D(uImage, vTextureCoord);
}
</script>
<script id="ImgVertexShader" type="x-shader/x-vertex">
attribute vec2 aVertexPosition;
attribute vec2 aTextureCoord;
uniform vec2 uCanvasRes;
varying vec2 vTextureCoord;
 
void main() {
// The coordinate system is a different geometry to how we usually treat images. In an image the "origin" of the coordinate
// system is on the top-left corner. In this system, the origin is at the 'center' with a [-1, 1] range. Hence, we must perform
// the following transformation below in order to calibrate things. The end result is a coordinate system called the Clip-Space
// coordinate system with the origin at the bottom-left.
vec2 inCSCoordPos = ((aVertexPosition/uCanvasRes) * 2.0) - 1.0;
gl_Position = vec4(inCSCoordPos * vec2(1, -1), 0, 1);
vTextureCoord = aTextureCoord;
}
</script>

Extractring the embedded data from the loaded texture:

function ConvertBitArrayToString(bitArr) {
var str = "";
for (var i=0; i<bitArr.length; i+=8) {
var val = 0;
for (var j=0, shiftCtr=7; j<8; j++, shiftCtr--) {
val += (bitArr[i+j] << shiftCtr);
}
str += String.fromCharCode(val);
}
return str;
}
 
function ReadLine(lineNumber) {
var bitArray = new Array();
var pixelArrayInRGBA = new Uint8Array(4 * _canvas.width);
_gl.readPixels(0, lineNumber, _canvas.width, 1, _gl.RGBA, _gl.UNSIGNED_BYTE, pixelArrayInRGBA);
for(var i = 0; i < pixelArrayInRGBA.length; i++) {
bitArray[i] = pixelArrayInRGBA[i] % 2;
}
return ConvertBitArrayToString(bitArray);
}
 
function DecodeAsStegaFotoImage() {
console.log("Decoding as a StegaFoto ...");
// Because the origin is at the bottom-left in clipspace coordinate, this means that 1st pixel
// row of the image is actually at the very bottom of the canvas:
var lineCounter = _canvas.height - 1;
var line = ReadLine(lineCounter);
var indexOfDelimeter = line.indexOf(_MESSAGE_DELIM);
if (indexOfDelimeter > 0) {
var arrParts = line.substring(0, indexOfDelimeter).split(":");
var lengthOfData = parseInt(arrParts[2], 10);
var typeOfEmbeddedData = arrParts[1];
var data = "";
var fromIdx = indexOfDelimeter + _MESSAGE_DELIM.length;
var numberOfLinesToRead = ((lengthOfData * 2) + fromIdx) % _canvas.width;
if(numberOfLinesToRead > 0) {
for(var i=1; i <= numberOfLinesToRead; i++) {
line += ReadLine(lineCounter - i);
}
}
data = line.slice(fromIdx, fromIdx + lengthOfData);
// 'T' means that data must be interpreted as pure text. While 'A' implies that data must be treated as a the data part of data-url:
if (typeOfEmbeddedData == "T") {
document.getElementById("EmbeddedMessage").value = data;
} else if (typeOfEmbeddedData == "A") {
document.getElementById("AudioPlayer").src = "data:audio/wav;base64," + data;
}
}
}

Conclusion

I hope that the way I have structured the article does not cause too much overloading in case you are not familiar with programming. Hopefully, the videos as well as the images (schemas) support your understanding on how the Stegafoto app works.

373 page views in the last 30 days.