CTF-Writeups

AI Translator


Challenge description

I found this dusty old AI translator device in a pawn shop, so I tried translating the flag, and it worked! Later I wanted to play with it some more, but totally forgot what the flag is. Now all I have is /flag.wav, but maybe you can help me translate it back?
IP: 10.240.2.50
Hint: AI audio translator / /flag.wav on the box.

The goal: recover the original MCTF{...} flag from a WAV file that was generated by an “AI translator” web service.


Recon

First, scan the box to see what’s running:

nmap -sC -sV 10.240.2.50

Result (relevant part):

5000/tcp open  http    Werkzeug httpd 3.1.3 (Python 3.12.12)
|_http-title: AI Translator
|_http-server-header: Werkzeug/3.1.3 Python/3.12.12

So it’s a small Python web app (very likely Flask) on port 5000, serving the “AI Translator”.

Grab the WAV file mentioned in the description:

curl -v http://10.240.2.50:5000/flag.wav -o flag.wav
file flag.wav

Output:

flag.wav: RIFF (little-endian) data, WAVE audio, Microsoft PCM, 16 bit, mono 44100 Hz

So /flag.wav is indeed a valid audio file.


Understanding the web app

Check the main page:

curl -s http://10.240.2.50:5000/ | head -n 80

Relevant part of the HTML/JS:

<form id="form">
  <label for="inputText">Enter your text:</label>
  <textarea id="text" required maxlength="100"></textarea>
  <button type="submit">Translate</button>
  ...
</form>

<script>
document.getElementById('form').addEventListener('submit', async function(e) {
  e.preventDefault();
  const text = document.getElementById('text').value;
  const resp = await fetch('/translate', {
    method: 'POST',
    headers: { 'Content-Type': 'application/json' },
    body: JSON.stringify({ text })
  });
  const blob = await resp.blob();
  ...
});
</script>

Key points:

So we cannot just POST audio and get text back. Instead, we need to reverse what the translator does.


Inspecting the audio

On the VM, play or view the audio:

# One of these, depending on what’s installed
aplay flag.wav
# or
ffplay flag.wav

Subjective analysis:

You can also open flag.wav in Audacity, switch the track to Spectrogram view, and visually see the repeating beep “bars” that hint at a structured encoding.

With some analysis tooling (Python, FFT, etc.), you can observe:

So the flag encoded in flag.wav is 24 characters long.


Key insight: matching the prefix (Audacity spectrogram)

The challenge text says /flag.wav was produced by translating the flag with the AI translator. Since we also have access to the live translator, we can compare its output with flag.wav.

First, generate a WAV from a guessed prefix:

curl -s -H "Content-Type: application/json"      -d '{"text":"MCTF25{"}'      http://10.240.2.50:5000/translate      -o prefix.wav

Then open both flag.wav and prefix.wav in Audacity and switch their tracks to Spectrogram view. Visually comparing the spectrograms shows:

So we can confidently conclude that the flag starts with:

MCTF25{

This also confirms the encoding is deterministic and that each character is represented by a distinctive pair of beep frequencies. From there, we can use the translator as an oracle and brute-force the remaining characters by comparing beep pairs.


Strategy: use the translator as an oracle

Instead of trying to figure out the full mapping from frequencies to characters, we can do this character by character:

  1. We already know that the flag starts with MCTF25{.
  2. For each unknown position i in the flag:
    • We know the prefix up to i-1.
    • We guess the character at position i from a set of allowed characters (e.g. A–Z, a–z, 0–9, {, }, _).
    • For each candidate character c:
      1. Ask the translator to generate audio for <prefix_so_far><c>:
        curl -s -H "Content-Type: application/json"              -d '{"text":"<prefix><candidate>"}'              http://10.240.2.50:5000/translate              -o test.wav
        
      2. Extract the beep pair at position i from the generated test.wav.
      3. Compare it with the beep pair at position i in flag.wav.
    • When the pair matches, c is the correct character at position i.
  3. Append the found character to the prefix and repeat until we hit the closing }.

This reduces the problem to pattern matching, not full-blown audio decoding.


Recovering the full flag

We already know:

Index  0 1 2 3 4 5 6
Char   M C T F 2 5 {

Using the oracle approach and comparing beep pairs (helped by inspecting spectrograms in Audacity and confirming that each character maps to a unique frequency pair), we brute-forced each position from index 7 onward until we got the closing }.

Step by step, the recovered characters form:

MCTF25{t4Lk_b1rdY_t0_m3}

This matches:


Final flag

MCTF25{t4Lk_b1rdY_t0_m3}

Appendix: Python scripts used

Below are the key Python scripts used during analysis and solving. Some were exploratory (Morse/binary attempts), others were part of the final oracle-based solution.


1. decode_beeps.py – initial Morse-style attempt (dead end)

#!/usr/bin/env python3
import wave
import struct
import sys
from collections import defaultdict

MORSE_TABLE = {
    ".-": "A", "-...": "B", "-.-.": "C", "-..": "D", ".": "E",
    "..-.": "F", "--.": "G", "....": "H", "..": "I", ".---": "J",
    "-.-": "K", ".-..": "L", "--": "M", "-.": "N", "---": "O",
    ".--.": "P", "--.-": "Q", ".-.": "R", "...": "S", "-": "T",
    "..-": "U", "...-": "V", ".--": "W", "-..-": "X", "-.--": "Y",
    "--..": "Z",
    "-----": "0", ".----": "1", "..---": "2", "...--": "3", "....-": "4",
    ".....": "5", "-....": "6", "--...": "7", "---..": "8", "----.": "9",
}

def load_wav(path):
    with wave.open(path, "rb") as wf:
        if wf.getnchannels() != 1:
            print("Warning: not mono, using first channel only")
        n_frames = wf.getnframes()
        frames = wf.readframes(n_frames)
        sampwidth = wf.getsampwidth()
        framerate = wf.getframerate()
    if sampwidth != 2:
        raise RuntimeError(f"Unsupported sample width: {sampwidth} bytes")
    samples = struct.unpack("<" + "h" * (len(frames) // 2), frames)
    return samples, framerate

def detect_beeps(samples, framerate, window_ms=10, threshold_factor=0.3):
    window_size = int(framerate * window_ms / 1000)
    if window_size <= 0:
        window_size = 1
    mags = []
    for i in range(0, len(samples), window_size):
        chunk = samples[i : i + window_size]
        if not chunk:
            break
        avg = sum(abs(s) for s in chunk) / len(chunk)
        mags.append(avg)
    max_mag = max(mags) if mags else 1.0
    threshold = max_mag * threshold_factor

    bits = [1 if m >= threshold else 0 for m in mags]
    return bits, window_size

def compress_runs(bits):
    runs = []
    if not bits:
        return runs
    cur = bits[0]
    length = 1
    for b in bits[1:]:
        if b == cur:
            length += 1
        else:
            runs.append((cur, length))
            cur = b
            length = 1
    runs.append((cur, length))
    return runs

def classify_morse(runs):
    beep_lengths = [l for v, l in runs if v == 1]
    gap_lengths  = [l for v, l in runs if v == 0]

    if not beep_lengths:
        print("No beeps detected")
        return ""

    min_beep = min(beep_lengths)
    min_gap = min(gap_lengths) if gap_lengths else min_beep

    dot_dash_boundary = min_beep * 1.5
    intra_letter_boundary = min_gap * 1.5
    letter_gap_boundary = min_gap * 3.5

    morse = []
    current_symbol = ""

    def flush_symbol():
        nonlocal current_symbol
        if current_symbol:
            morse.append(current_symbol)
            current_symbol = ""

    for value, length in runs:
        if value == 1:
            if length <= dot_dash_boundary:
                current_symbol += "."
            else:
                current_symbol += "-"
        else:
            if length <= intra_letter_boundary:
                pass
            elif length <= letter_gap_boundary:
                flush_symbol()
            else:
                flush_symbol()
                morse.append(" / ")

    flush_symbol()
    return " ".join(morse)

def morse_to_text(morse):
    out = []
    for token in morse.split(" "):
        if token == "/":
            out.append(" ")
        elif token.strip() == "":
            continue
        else:
            ch = MORSE_TABLE.get(token)
            if ch:
                out.append(ch)
            else:
                out.append("?")
    return "".join(out)

def main():
    if len(sys.argv) < 2:
        print(f"Usage: {sys.argv[0]} flag.wav")
        sys.exit(1)
    path = sys.argv[1]
    samples, framerate = load_wav(path)
    bits, window_size = detect_beeps(samples, framerate)
    runs = compress_runs(bits)
    print(f"Detected {len(runs)} beep/silence runs")
    morse = classify_morse(runs)
    print("Morse guess:")
    print(morse)
    text = morse_to_text(morse)
    print("Decoded text guess:")
    print(text)

if __name__ == "__main__":
    main()

This concluded it was not Morse (everything looked like EEEEEE...), pushing the analysis towards binary/pitch-based encoding.


2. analyze_beeps.py – binary run-length inspection

#!/usr/bin/env python3
import wave, struct, sys
from math import gcd

def load_wav(path):
    with wave.open(path, "rb") as wf:
        if wf.getnchannels() != 1:
            print("Warning: not mono, using first channel only")
        n_frames = wf.getnframes()
        frames = wf.readframes(n_frames)
        sampwidth = wf.getsampwidth()
        framerate = wf.getframerate()
    if sampwidth != 2:
        raise RuntimeError(f"Unsupported sample width: {sampwidth} bytes")
    samples = struct.unpack("<" + "h" * (len(frames) // 2), frames)
    return samples, framerate

def detect_beeps(samples, framerate, window_ms=5, threshold_factor=0.4):
    window_size = int(framerate * window_ms / 1000)
    if window_size <= 0:
        window_size = 1
    mags = []
    for i in range(0, len(samples), window_size):
        chunk = samples[i : i + window_size]
        if not chunk:
            break
        avg = sum(abs(s) for s in chunk) / len(chunk)
        mags.append(avg)
    max_mag = max(mags) if mags else 1.0
    threshold = max_mag * threshold_factor
    bits = [1 if m >= threshold else 0 for m in mags]
    return bits, window_size

def compress_runs(bits):
    runs = []
    if not bits:
        return runs
    cur = bits[0]
    length = 1
    for b in bits[1:]:
        if b == cur:
            length += 1
        else:
            runs.append((cur, length))
            cur = b
            length = 1
    runs.append((cur, length))
    return runs

def main():
    if len(sys.argv) < 2:
        print(f"Usage: {sys.argv[0]} flag.wav")
        sys.exit(1)

    samples, framerate = load_wav(sys.argv[1])
    bits, ws = detect_beeps(samples, framerate)
    runs = compress_runs(bits)
    print(f"Window size: {ws} samples")
    print(f"Total runs: {len(runs)}")

    on_lengths = sorted(set(l for v, l in runs if v == 1))
    off_lengths = sorted(set(l for v, l in runs if v == 0))
    print("Unique ON lengths:", on_lengths)
    print("Unique OFF lengths:", off_lengths)

    g = 0
    for _, l in runs:
        g = gcd(g, l) if g else l
    if g == 0:
        print("No runs?")
        return

    print("GCD of run lengths:", g)
    norm = [(v, l // g) for v, l in runs]
    print("First 40 normalized runs (value, units):")
    print(norm[:40])

    bitstream = []
    for v, units in norm:
        bitstream.extend([str(v)] * units)
    bitstring = "".join(bitstream)
    print("First 160 bits of bitstream:")
    print(bitstring[:160])

    print("\nASCII attempt from bitstream (8 bits per char, from start):")
    chars = []
    for i in range(0, len(bitstring) - 7, 8):
        b = bitstring[i:i+8]
        val = int(b, 2)
        if 32 <= val <= 126:
            chars.append(chr(val))
        else:
            chars.append(".")
    ascii_guess = "".join(chars)
    print(ascii_guess[:80])

if __name__ == "__main__":
    main()

This showed very regular ON/OFF lengths, hinting that timing wasn’t carrying the main information; instead the frequencies were.


3. decode_flag.py – frequency mapping via a training WAV

This version used NumPy + a training WAV (train.wav) generated by sending a known alphabet to /translate. It learned a mapping from beep frequency pairs → characters, then applied it to flag.wav.

#!/usr/bin/env python3
import wave, struct, sys
import numpy as np

# MUST match the text sent to /translate for train.wav
TRAIN_TEXT = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789{}_"

def load_wav(path):
    with wave.open(path, "rb") as wf:
        fr = wf.getframerate()
        n = wf.getnframes()
        ch = wf.getnchannels()
        sw = wf.getsampwidth()
        data = wf.readframes(n)
    if sw != 2:
        raise RuntimeError(f"Unsupported sample width: {sw}")
    samples = np.frombuffer(data, dtype=np.int16).astype(np.float32) / 32768.0
    if ch > 1:
        samples = samples.reshape(-1, ch)[:, 0]
    return samples, fr

def segment_beeps(samples, sr, smooth_ms=5, thresh_ms=0.3,
                  min_beep_ms=20, min_gap_ms=10):
    win = int(sr * smooth_ms / 1000)
    if win < 1:
        win = 1
    env = np.convolve(np.abs(samples), np.ones(win)/win, mode="same")
    thr = env.max() * thresh_ms
    mask = env > thr

    beeps = []
    n = len(mask)
    i = 0
    while i < n:
        v = mask[i]
        j = i + 1
        while j < n and mask[j] == v:
            j += 1
        length = j - i
        if v:
            if length >= sr * min_beep_ms / 1000:
                beeps.append((i, j))
        i = j
    return beeps

def beep_freqs(samples, sr, beeps):
    freqs = []
    for start, end in beeps:
        length = end - start
        n = min(length, 2048)
        if n < 256:
            continue
        seg = samples[start:start+n]
        N = 1
        while N < n:
            N *= 2
        w = np.hanning(n)
        seg_win = seg * w
        spec = np.fft.rfft(seg_win, n=N)
        mag = np.abs(spec)
        mag[0] = 0
        k = np.argmax(mag)
        freq = k * sr / N
        freqs.append(round(freq, 1))
    return freqs

def pairs_from_freqs(freqs):
    mid = freqs[2:-2]
    assert len(mid) % 2 == 0
    return [(mid[i*2], mid[i*2+1]) for i in range(len(mid)//2)]

def main():
    flag_samples, sr1 = load_wav("flag.wav")
    train_samples, sr2 = load_wav("train.wav")
    if sr1 != sr2:
        raise RuntimeError("Sample rates differ")

    flag_beeps = segment_beeps(flag_samples, sr1)
    train_beeps = segment_beeps(train_samples, sr2)

    flag_freqs = beep_freqs(flag_samples, sr1, flag_beeps)
    train_freqs = beep_freqs(train_samples, sr2, train_beeps)

    print(f"Flag beeps: {len(flag_freqs)}, Train beeps: {len(train_freqs)}")

    flag_pairs = pairs_from_freqs(flag_freqs)
    train_pairs = pairs_from_freqs(train_freqs)

    print(f"Flag pairs: {len(flag_pairs)}, Train pairs: {len(train_pairs)}")
    if len(train_pairs) != len(TRAIN_TEXT):
        print("Warning: TRAIN_TEXT length and train_pairs length differ!")
        print(f"TRAIN_TEXT length: {len(TRAIN_TEXT)}")

    pair_map = {}
    for i, p in enumerate(train_pairs):
        if i >= len(TRAIN_TEXT):
            break
        key = (round(p[0], 1), round(p[1], 1))
        pair_map[key] = TRAIN_TEXT[i]

    decoded = []
    unknown = []
    for p in flag_pairs:
        key = (round(p[0], 1), round(p[1], 1))
        ch = pair_map.get(key, "?")
        decoded.append(ch)
        if ch == "?":
            unknown.append(key)

    decoded_text = "".join(decoded)
    print("Decoded flag guess:")
    print(decoded_text)
    if unknown:
        print("Unknown pairs (no mapping in training):")
        for u in sorted(set(unknown)):
            print(u)

if __name__ == "__main__":
    main()

This was useful to inspect how many pairs matched / didn’t match, but the final solve used a more direct oracle approach.


4. analyze_flag.py – extracting frequency pairs from flag.wav

This script extracts all character beep pairs from flag.wav and saves them to flag_pairs.txt:

#!/usr/bin/env python3
import wave, struct, numpy as np

SR = 44100

def load_wav(path):
    with wave.open(path, "rb") as wf:
        fr = wf.getframerate()
        n = wf.getnframes()
        ch = wf.getnchannels()
        sw = wf.getsampwidth()
        data = wf.readframes(n)
    assert sw == 2
    samples = np.frombuffer(data, dtype="<i2").astype(np.float32) / 32768.0
    if ch > 1:
        samples = samples.reshape(-1, ch)[:, 0]
    return samples, fr

def segment_beeps(samples, sr, win_ms=10, thresh_ratio=0.2):
    win = int(sr * win_ms / 1000)
    if win < 1:
        win = 1
    env = np.convolve(np.abs(samples), np.ones(win) / win, mode="same")
    thr = env.max() * thresh_ratio
    mask = env > thr

    segs = []
    n = len(mask)
    i = 0
    while i < n:
        if not mask[i]:
            i += 1
            continue
        start = i
        while i < n and mask[i]:
            i += 1
        segs.append((start, i))
    return segs

def dominant_freq(samples, sr, start, end):
    seg = samples[start:end]
    n = len(seg)
    if n < 200:
        return None
    win = np.hanning(n)
    segw = seg * win
    N = 1
    while N < n:
        N *= 2
    spec = np.fft.rfft(segw, n=N)
    mag = np.abs(spec)
    mag[0] = 0
    k = np.argmax(mag)
    return round(k * sr / N, 1)

def main():
    samples, sr = load_wav("flag.wav")
    segs = segment_beeps(samples, sr)
    freqs = [dominant_freq(samples, sr, s, e) for s, e in segs]
    freqs = [f for f in freqs if f is not None]

    print(f"Total beep freqs: {len(freqs)}")
    start_pair = tuple(freqs[:2])
    end_pair = tuple(freqs[-2:])
    mid = freqs[2:-2]
    pairs = [tuple(mid[i*2:(i+1)*2]) for i in range(len(mid)//2)]

    print("Start marker:", start_pair)
    print("End marker:  ", end_pair)
    print("Char pairs (index: (f1, f2)):")
    for i, p in enumerate(pairs):
        print(i, p)

    with open("flag_pairs.txt", "w") as f:
        for p in pairs:
            f.write(f"{p[0]} {p[1]}\n")

if __name__ == "__main__":
    main()

5. test_char.py – oracle-based brute force of each character

This script is the core of the final solution: for a given known prefix and target index, it brute-forces the next character by comparing beep pairs to flag.wav.

#!/usr/bin/env python3
import wave, struct, numpy as np
import json, sys, subprocess

SR = 44100

def load_wav(path):
    with wave.open(path, "rb") as wf:
        fr = wf.getframerate()
        n = wf.getnframes()
        ch = wf.getnchannels()
        sw = wf.getsampwidth()
        data = wf.readframes(n)
    assert sw == 2
    samples = np.frombuffer(data, dtype="<i2").astype(np.float32) / 32768.0
    if ch > 1:
        samples = samples.reshape(-1, ch)[:, 0]
    return samples, fr

def segment_beeps(samples, sr, win_ms=10, thresh_ratio=0.2):
    win = int(sr * win_ms / 1000)
    if win < 1:
        win = 1
    env = np.convolve(np.abs(samples), np.ones(win) / win, mode="same")
    thr = env.max() * thresh_ratio
    mask = env > thr
    segs = []
    n = len(mask)
    i = 0
    while i < n:
        if not mask[i]:
            i += 1
            continue
        start = i
        while i < n and mask[i]:
            i += 1
        segs.append((start, i))
    return segs

def dominant_freq(samples, sr, start, end):
    seg = samples[start:end]
    n = len(seg)
    if n < 200:
        return None
    win = np.hanning(n)
    segw = seg * win
    N = 1
    while N < n:
        N *= 2
    spec = np.fft.rfft(segw, n=N)
    mag = np.abs(spec)
    mag[0] = 0
    k = np.argmax(mag)
    return round(k * sr / N, 1)

def get_pairs(path):
    samples, sr = load_wav(path)
    segs = segment_beeps(samples, sr)
    freqs = [dominant_freq(samples, sr, s, e) for s, e in segs]
    freqs = [f for f in freqs if f is not None]
    mid = freqs[2:-2]
    pairs = [tuple(mid[i*2:(i+1)*2]) for i in range(len(mid)//2)]
    return pairs

def main():
    if len(sys.argv) != 3:
        print(f"Usage: {sys.argv[0]} <known_prefix> <position_index>")
        sys.exit(1)

    prefix = sys.argv[1]
    pos = int(sys.argv[2])

    flag_pairs = []
    with open("flag_pairs.txt") as f:
        for line in f:
            a, b = line.strip().split()
            flag_pairs.append((float(a), float(b)))
    target_pair = flag_pairs[pos]

    alphabet = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789{}_"

    for ch in alphabet:
        text = prefix + ch
        print(f"Trying {ch} ...", end="", flush=True)

        payload = json.dumps({"text": text})
        cmd = [
            "curl", "-s", "-H", "Content-Type: application/json",
            "-d", payload, "http://10.240.2.50:5000/translate",
            "-o", "test.wav"
        ]
        subprocess.run(cmd, check=True)

        pairs = get_pairs("test.wav")
        if len(pairs) <= pos:
            print(" (too short)")
            continue
        if pairs[pos] == target_pair:
            print(" MATCH")
            print(f"Found char at position {pos}: {ch}")
            return
        else:
            print(" no")

    print("No match found in alphabet")

if __name__ == "__main__":
    main()

Usage example:

# After generating flag_pairs.txt with analyze_flag.py
# and knowing the prefix "MCTF25{"

python3 test_char.py MCTF25{ 7     # find char at index 7
python3 test_char.py MCTF25{t 8    # then with updated prefix, etc.

Repeating this for each position eventually yielded the full flag:

MCTF25{t4Lk_b1rdY_t0_m3}