A Tale of Two Strings: The Deceptive Power of Unicode's Invisible Characters

Published on


Consider these two strings of text:

Hello, World! Hello, World!

They look identical. If you were to copy and paste them, they would seem to be the same. But what if I told you that one of them contains a hidden, five-letter secret, while the other is completely clean? This isn’t a trick of the eye; it’s a demonstration of one of the most subtle and clever forms of text steganography, using the hidden power of Unicode’s invisible characters.

This is the exact principle you must master in Venatus Level 6. It’s a technique that preys on the difference between what we see on our screens and what the computer actually stores in its memory.

What is Unicode and Why Does It Have Invisible Characters?

To understand the trick, we first need to understand Unicode. In the early days of computing, text was simple (mostly English characters), and systems like ASCII could represent everything with just 128 different codes. Today, our computers need to handle thousands of characters from virtually every language on Earth, plus emojis, mathematical symbols, and more.

Unicode is the universal standard that assigns a unique number (a “code point”) to every single one of these characters. This is what allows a computer in Japan to correctly display text written on a computer in Brazil.

However, writing systems are complex. Some languages require characters to join together or change shape based on their neighbors. To handle this, the Unicode standard includes a set of special “control characters” that are not meant to be seen. They provide instructions to the text rendering engine.

The most famous of these are the zero-width characters:

*Zero-Width Space (U+200B): This character tells a computer where it’s okay to break a line, without creating a visible space. It’s like a soft hyphen. *Zero-Width Joiner (U+200D): This character tells two other characters to “join” together and form a single glyph or ligature. It’s famously used to create many complex emojis, like the “person with red hair” emoji, which is actually [Person] + [Zero-Width Joiner] + [Red Hair]. *Zero-Width Non-Joiner (U+200C): This does the opposite, telling two characters that would normally join to stay separate.

Because these characters have no visual width, they are completely invisible to the human eye in most applications.

Hiding Data in the Gaps

The steganographic potential here is immediately obvious. If you can insert invisible characters into a piece of text, you can encode a message.

Let’s revisit our example: Hello, World!

The “clean” version is just 13 characters long. The “secret” version could be constructed like this (with [ZWSP] representing an invisible Zero-Width Space):

H[ZWSP][ZWSP][ZWSP][ZWSP]e[ZWSP][ZWSP]...

By using a simple binary code (e.g., one space = 0, two spaces = 1), you can embed a message bit by bit between every single visible character. To a human, the text is unchanged. To a computer program reading the raw data, the string is full of hidden information.

Real-World Applications and Dangers

This isn’t just a fun puzzle. This technique is used in the real world for both clever and malicious purposes.

1. Social Media Watermarking

Have you ever wondered how platforms like TikTok can tell if a video has been re-uploaded, even if it has been slightly edited? One technique they use is embedding an invisible, unique tracking ID made of zero-width characters into the video’s title or description when you upload it. When someone else downloads and re-uploads that video, the invisible watermark comes with it, allowing the platform to flag it as unoriginal content.

2. Phishing and URL Deception

This is where the technique becomes dangerous. An attacker can register a domain name that looks identical to a real one but contains an invisible character. For example:

*Real Domain: apple.com *Malicious Domain: app​le.com (with a zero-width space after the ‘p’)

To a user, these look exactly the same in many email clients and browsers. When they click the link, they think they are going to Apple’s website, but they are actually being taken to a phishing site controlled by the attacker.

3. Bypassing Content Filters

Automated moderation bots often work by scanning text for forbidden keywords. By inserting zero-width characters into a banned word (e.g., F[ZWSP]O[ZWSP]R[ZWSP]B[ZWSP]I[ZWSP]D[ZWSP]D[ZWSP]E[ZWSP]N), an attacker can bypass the filter. The word looks normal to human readers, but the automated script sees a sequence of individual letters separated by other characters and fails to recognize the keyword.

The Tools of the Detective

As you discovered in Level 6, you cannot trust your eyes to solve this kind of puzzle. You need a tool that can show you the raw, unfiltered truth of the data.

*Unicode Character Inspectors: There are many free online tools where you can paste a string of text, and they will break it down character by character, revealing the name and code point of every single one, including the invisible ones. *“Diff” Checkers: A “difference checker” tool is designed to compare two pieces of text and highlight what’s different. When you paste the two “identical” strings from a puzzle, it will immediately highlight the invisible characters present in one but not the other.

This technique is a powerful lesson in digital forensics: what you see is not always what you get. True analysis requires looking beyond the surface and examining the underlying data itself.