CTF-Writeups

AI Translator

Category: Audio / Web
Author: Pētēris #0001
Flag: MCTF25{t4Lk_b1rdY_t0_m3}

Challenge description

I found this dusty old AI translator device in a pawn shop, so I tried translating the flag, and it worked! Later I wanted to play with it some more, but totally forgot what the flag is. Now all I have is /flag.wav, but maybe you can help me translate it back?
IP: 10.240.2.50
Hint: AI audio translator / /flag.wav on the box.

The goal: recover the original MCTF{...} flag from a WAV file that was generated by an “AI translator” web service.

Recon

First, scan the box to see what’s running:

nmap -sC -sV 10.240.2.50

Result (relevant part):

5000/tcp open  http    Werkzeug httpd 3.1.3 (Python 3.12.12)
|_http-title: AI Translator
|_http-server-header: Werkzeug/3.1.3 Python/3.12.12

So it’s a small Python web app (very likely Flask) on port 5000, serving the “AI Translator”.

Grab the WAV file mentioned in the description:

curl -v http://10.240.2.50:5000/flag.wav -o flag.wav
file flag.wav

Output:

flag.wav: RIFF (little-endian) data, WAVE audio, Microsoft PCM, 16 bit, mono 44100 Hz

So /flag.wav is indeed a valid audio file.

Understanding the web app

Check the main page:

curl -s http://10.240.2.50:5000/ | head -n 80

Relevant part of the HTML/JS:

<form id="form">
  <label for="inputText">Enter your text:</label>
  <textarea id="text" required maxlength="100"></textarea>
  <button type="submit">Translate</button>
  ...
</form>

<script>
document.getElementById('form').addEventListener('submit', async function(e) {
  e.preventDefault();
  const text = document.getElementById('text').value;
  const resp = await fetch('/translate', {
    method: 'POST',
    headers: { 'Content-Type': 'application/json' },
    body: JSON.stringify({ text })
  });
  const blob = await resp.blob();
  ...
});
</script>

Key points:

The app only supports text → audio:
- POST /translate
- Content-Type: application/json
- Body is { "text": "<your text>" }
The response is a WAV file (like flag.wav).

So we cannot just POST audio and get text back. Instead, we need to reverse what the translator does.

Inspecting the audio

On the VM, play or view the audio:

# One of these, depending on what’s installed
aplay flag.wav
# or
ffplay flag.wav

Subjective analysis:

The sound is a sequence of short, high-pitched beeps.
Each beep has (roughly) the same duration.
The information seems to be encoded in pitch, not in speech or Morse timing.

You can also open flag.wav in Audacity, switch the track to Spectrogram view, and visually see the repeating beep “bars” that hint at a structured encoding.

With some analysis tooling (Python, FFT, etc.), you can observe:

There are 52 beeps in total.
The pattern is:
- 2 beeps at the start (start marker)
- 2 beeps at the end (end marker)
- In between: beeps come in pairs → 2 beeps per character

That gives:

(52 total beeps - 4 marker beeps) / 2 = 24 characters

So the flag encoded in flag.wav is 24 characters long.

Key insight: matching the prefix (Audacity spectrogram)

The challenge text says /flag.wav was produced by translating the flag with the AI translator. Since we also have access to the live translator, we can compare its output with flag.wav.

First, generate a WAV from a guessed prefix:

curl -s -H "Content-Type: application/json"      -d '{"text":"MCTF25{"}'      http://10.240.2.50:5000/translate      -o prefix.wav

Then open both flag.wav and prefix.wav in Audacity and switch their tracks to Spectrogram view. Visually comparing the spectrograms shows:

The start marker beeps look identical.
The pattern of beep “bars” for MCTF25{ in prefix.wav matches exactly the beginning of flag.wav.

So we can confidently conclude that the flag starts with:

MCTF25{

This also confirms the encoding is deterministic and that each character is represented by a distinctive pair of beep frequencies. From there, we can use the translator as an oracle and brute-force the remaining characters by comparing beep pairs.

Strategy: use the translator as an oracle

Instead of trying to figure out the full mapping from frequencies to characters, we can do this character by character:

We already know that the flag starts with MCTF25{.
For each unknown position i in the flag:
- We know the prefix up to i-1.
- We guess the character at position i from a set of allowed characters (e.g. A–Z, a–z, 0–9, {, }, _).
- For each candidate character c:
  1. Ask the translator to generate audio for <prefix_so_far><c>:
    curl -s -H "Content-Type: application/json" -d '{"text":"<prefix><candidate>"}' http://10.240.2.50:5000/translate -o test.wav
  2. Extract the beep pair at position i from the generated test.wav.
  3. Compare it with the beep pair at position i in flag.wav.
- When the pair matches, c is the correct character at position i.
Append the found character to the prefix and repeat until we hit the closing }.

This reduces the problem to pattern matching, not full-blown audio decoding.

Recovering the full flag

We already know:

Index  0 1 2 3 4 5 6
Char   M C T F 2 5 {

Using the oracle approach and comparing beep pairs (helped by inspecting spectrograms in Audacity and confirming that each character maps to a unique frequency pair), we brute-forced each position from index 7 onward until we got the closing }.

Step by step, the recovered characters form:

MCTF25{t4Lk_b1rdY_t0_m3}

This matches:

Length: 24 characters (consistent with (52 - 4) / 2).
Proper flag format: MCTF25{...}.

Final flag

MCTF25{t4Lk_b1rdY_t0_m3}

Appendix: Python scripts used

Below are the key Python scripts used during analysis and solving. Some were exploratory (Morse/binary attempts), others were part of the final oracle-based solution.

1. `decode_beeps.py` – initial Morse-style attempt (dead end)

#!/usr/bin/env python3
import wave
import struct
import sys
from collections import defaultdict

MORSE_TABLE = {
    ".-": "A", "-...": "B", "-.-.": "C", "-..": "D", ".": "E",
    "..-.": "F", "--.": "G", "....": "H", "..": "I", ".---": "J",
    "-.-": "K", ".-..": "L", "--": "M", "-.": "N", "---": "O",
    ".--.": "P", "--.-": "Q", ".-.": "R", "...": "S", "-": "T",
    "..-": "U", "...-": "V", ".--": "W", "-..-": "X", "-.--": "Y",
    "--..": "Z",
    "-----": "0", ".----": "1", "..---": "2", "...--": "3", "....-": "4",
    ".....": "5", "-....": "6", "--...": "7", "---..": "8", "----.": "9",
}

def load_wav(path):
    with wave.open(path, "rb") as wf:
        if wf.getnchannels() != 1:
            print("Warning: not mono, using first channel only")
        n_frames = wf.getnframes()
        frames = wf.readframes(n_frames)
        sampwidth = wf.getsampwidth()
        framerate = wf.getframerate()
    if sampwidth != 2:
        raise RuntimeError(f"Unsupported sample width: {sampwidth} bytes")
    samples = struct.unpack("<" + "h" * (len(frames) // 2), frames)
    return samples, framerate

def detect_beeps(samples, framerate, window_ms=10, threshold_factor=0.3):
    window_size = int(framerate * window_ms / 1000)
    if window_size <= 0:
        window_size = 1
    mags = []
    for i in range(0, len(samples), window_size):
        chunk = samples[i : i + window_size]
        if not chunk:
            break
        avg = sum(abs(s) for s in chunk) / len(chunk)
        mags.append(avg)
    max_mag = max(mags) if mags else 1.0
    threshold = max_mag * threshold_factor

    bits = [1 if m >= threshold else 0 for m in mags]
    return bits, window_size

def compress_runs(bits):
    runs = []
    if not bits:
        return runs
    cur = bits[0]
    length = 1
    for b in bits[1:]:
        if b == cur:
            length += 1
        else:
            runs.append((cur, length))
            cur = b
            length = 1
    runs.append((cur, length))
    return runs

def classify_morse(runs):
    beep_lengths = [l for v, l in runs if v == 1]
    gap_lengths  = [l for v, l in runs if v == 0]

    if not beep_lengths:
        print("No beeps detected")
        return ""

    min_beep = min(beep_lengths)
    min_gap = min(gap_lengths) if gap_lengths else min_beep

    dot_dash_boundary = min_beep * 1.5
    intra_letter_boundary = min_gap * 1.5
    letter_gap_boundary = min_gap * 3.5

    morse = []
    current_symbol = ""

    def flush_symbol():
        nonlocal current_symbol
        if current_symbol:
            morse.append(current_symbol)
            current_symbol = ""

    for value, length in runs:
        if value == 1:
            if length <= dot_dash_boundary:
                current_symbol += "."
            else:
                current_symbol += "-"
        else:
            if length <= intra_letter_boundary:
                pass
            elif length <= letter_gap_boundary:
                flush_symbol()
            else:
                flush_symbol()
                morse.append(" / ")

    flush_symbol()
    return " ".join(morse)

def morse_to_text(morse):
    out = []
    for token in morse.split(" "):
        if token == "/":
            out.append(" ")
        elif token.strip() == "":
            continue
        else:
            ch = MORSE_TABLE.get(token)
            if ch:
                out.append(ch)
            else:
                out.append("?")
    return "".join(out)

def main():
    if len(sys.argv) < 2:
        print(f"Usage: {sys.argv[0]} flag.wav")
        sys.exit(1)
    path = sys.argv[1]
    samples, framerate = load_wav(path)
    bits, window_size = detect_beeps(samples, framerate)
    runs = compress_runs(bits)
    print(f"Detected {len(runs)} beep/silence runs")
    morse = classify_morse(runs)
    print("Morse guess:")
    print(morse)
    text = morse_to_text(morse)
    print("Decoded text guess:")
    print(text)

if __name__ == "__main__":
    main()

This concluded it was not Morse (everything looked like EEEEEE...), pushing the analysis towards binary/pitch-based encoding.

2. `analyze_beeps.py` – binary run-length inspection

#!/usr/bin/env python3
import wave, struct, sys
from math import gcd

def load_wav(path):
    with wave.open(path, "rb") as wf:
        if wf.getnchannels() != 1:
            print("Warning: not mono, using first channel only")
        n_frames = wf.getnframes()
        frames = wf.readframes(n_frames)
        sampwidth = wf.getsampwidth()
        framerate = wf.getframerate()
    if sampwidth != 2:
        raise RuntimeError(f"Unsupported sample width: {sampwidth} bytes")
    samples = struct.unpack("<" + "h" * (len(frames) // 2), frames)
    return samples, framerate

def detect_beeps(samples, framerate, window_ms=5, threshold_factor=0.4):
    window_size = int(framerate * window_ms / 1000)
    if window_size <= 0:
        window_size = 1
    mags = []
    for i in range(0, len(samples), window_size):
        chunk = samples[i : i + window_size]
        if not chunk:
            break
        avg = sum(abs(s) for s in chunk) / len(chunk)
        mags.append(avg)
    max_mag = max(mags) if mags else 1.0
    threshold = max_mag * threshold_factor
    bits = [1 if m >= threshold else 0 for m in mags]
    return bits, window_size

def compress_runs(bits):
    runs = []
    if not bits:
        return runs
    cur = bits[0]
    length = 1
    for b in bits[1:]:
        if b == cur:
            length += 1
        else:
            runs.append((cur, length))
            cur = b
            length = 1
    runs.append((cur, length))
    return runs

def main():
    if len(sys.argv) < 2:
        print(f"Usage: {sys.argv[0]} flag.wav")
        sys.exit(1)

    samples, framerate = load_wav(sys.argv[1])
    bits, ws = detect_beeps(samples, framerate)
    runs = compress_runs(bits)
    print(f"Window size: {ws} samples")
    print(f"Total runs: {len(runs)}")

    on_lengths = sorted(set(l for v, l in runs if v == 1))
    off_lengths = sorted(set(l for v, l in runs if v == 0))
    print("Unique ON lengths:", on_lengths)
    print("Unique OFF lengths:", off_lengths)

    g = 0
    for _, l in runs:
        g = gcd(g, l) if g else l
    if g == 0:
        print("No runs?")
        return

    print("GCD of run lengths:", g)
    norm = [(v, l // g) for v, l in runs]
    print("First 40 normalized runs (value, units):")
    print(norm[:40])

    bitstream = []
    for v, units in norm:
        bitstream.extend([str(v)] * units)
    bitstring = "".join(bitstream)
    print("First 160 bits of bitstream:")
    print(bitstring[:160])

    print("\nASCII attempt from bitstream (8 bits per char, from start):")
    chars = []
    for i in range(0, len(bitstring) - 7, 8):
        b = bitstring[i:i+8]
        val = int(b, 2)
        if 32 <= val <= 126:
            chars.append(chr(val))
        else:
            chars.append(".")
    ascii_guess = "".join(chars)
    print(ascii_guess[:80])

if __name__ == "__main__":
    main()

This showed very regular ON/OFF lengths, hinting that timing wasn’t carrying the main information; instead the frequencies were.

3. `decode_flag.py` – frequency mapping via a training WAV

This version used NumPy + a training WAV (train.wav) generated by sending a known alphabet to /translate. It learned a mapping from beep frequency pairs → characters, then applied it to flag.wav.

#!/usr/bin/env python3
import wave, struct, sys
import numpy as np

# MUST match the text sent to /translate for train.wav
TRAIN_TEXT = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789{}_"

def load_wav(path):
    with wave.open(path, "rb") as wf:
        fr = wf.getframerate()
        n = wf.getnframes()
        ch = wf.getnchannels()
        sw = wf.getsampwidth()
        data = wf.readframes(n)
    if sw != 2:
        raise RuntimeError(f"Unsupported sample width: {sw}")
    samples = np.frombuffer(data, dtype=np.int16).astype(np.float32) / 32768.0
    if ch > 1:
        samples = samples.reshape(-1, ch)[:, 0]
    return samples, fr

def segment_beeps(samples, sr, smooth_ms=5, thresh_ms=0.3,
                  min_beep_ms=20, min_gap_ms=10):
    win = int(sr * smooth_ms / 1000)
    if win < 1:
        win = 1
    env = np.convolve(np.abs(samples), np.ones(win)/win, mode="same")
    thr = env.max() * thresh_ms
    mask = env > thr

    beeps = []
    n = len(mask)
    i = 0
    while i < n:
        v = mask[i]
        j = i + 1
        while j < n and mask[j] == v:
            j += 1
        length = j - i
        if v:
            if length >= sr * min_beep_ms / 1000:
                beeps.append((i, j))
        i = j
    return beeps

def beep_freqs(samples, sr, beeps):
    freqs = []
    for start, end in beeps:
        length = end - start
        n = min(length, 2048)
        if n < 256:
            continue
        seg = samples[start:start+n]
        N = 1
        while N < n:
            N *= 2
        w = np.hanning(n)
        seg_win = seg * w
        spec = np.fft.rfft(seg_win, n=N)
        mag = np.abs(spec)
        mag[0] = 0
        k = np.argmax(mag)
        freq = k * sr / N
        freqs.append(round(freq, 1))
    return freqs

def pairs_from_freqs(freqs):
    mid = freqs[2:-2]
    assert len(mid) % 2 == 0
    return [(mid[i*2], mid[i*2+1]) for i in range(len(mid)//2)]

def main():
    flag_samples, sr1 = load_wav("flag.wav")
    train_samples, sr2 = load_wav("train.wav")
    if sr1 != sr2:
        raise RuntimeError("Sample rates differ")

    flag_beeps = segment_beeps(flag_samples, sr1)
    train_beeps = segment_beeps(train_samples, sr2)

    flag_freqs = beep_freqs(flag_samples, sr1, flag_beeps)
    train_freqs = beep_freqs(train_samples, sr2, train_beeps)

    print(f"Flag beeps: {len(flag_freqs)}, Train beeps: {len(train_freqs)}")

    flag_pairs = pairs_from_freqs(flag_freqs)
    train_pairs = pairs_from_freqs(train_freqs)

    print(f"Flag pairs: {len(flag_pairs)}, Train pairs: {len(train_pairs)}")
    if len(train_pairs) != len(TRAIN_TEXT):
        print("Warning: TRAIN_TEXT length and train_pairs length differ!")
        print(f"TRAIN_TEXT length: {len(TRAIN_TEXT)}")

    pair_map = {}
    for i, p in enumerate(train_pairs):
        if i >= len(TRAIN_TEXT):
            break
        key = (round(p[0], 1), round(p[1], 1))
        pair_map[key] = TRAIN_TEXT[i]

    decoded = []
    unknown = []
    for p in flag_pairs:
        key = (round(p[0], 1), round(p[1], 1))
        ch = pair_map.get(key, "?")
        decoded.append(ch)
        if ch == "?":
            unknown.append(key)

    decoded_text = "".join(decoded)
    print("Decoded flag guess:")
    print(decoded_text)
    if unknown:
        print("Unknown pairs (no mapping in training):")
        for u in sorted(set(unknown)):
            print(u)

if __name__ == "__main__":
    main()

This was useful to inspect how many pairs matched / didn’t match, but the final solve used a more direct oracle approach.

4. `analyze_flag.py` – extracting frequency pairs from `flag.wav`

This script extracts all character beep pairs from flag.wav and saves them to flag_pairs.txt:

#!/usr/bin/env python3
import wave, struct, numpy as np

SR = 44100

def load_wav(path):
    with wave.open(path, "rb") as wf:
        fr = wf.getframerate()
        n = wf.getnframes()
        ch = wf.getnchannels()
        sw = wf.getsampwidth()
        data = wf.readframes(n)
    assert sw == 2
    samples = np.frombuffer(data, dtype="<i2").astype(np.float32) / 32768.0
    if ch > 1:
        samples = samples.reshape(-1, ch)[:, 0]
    return samples, fr

def segment_beeps(samples, sr, win_ms=10, thresh_ratio=0.2):
    win = int(sr * win_ms / 1000)
    if win < 1:
        win = 1
    env = np.convolve(np.abs(samples), np.ones(win) / win, mode="same")
    thr = env.max() * thresh_ratio
    mask = env > thr

    segs = []
    n = len(mask)
    i = 0
    while i < n:
        if not mask[i]:
            i += 1
            continue
        start = i
        while i < n and mask[i]:
            i += 1
        segs.append((start, i))
    return segs

def dominant_freq(samples, sr, start, end):
    seg = samples[start:end]
    n = len(seg)
    if n < 200:
        return None
    win = np.hanning(n)
    segw = seg * win
    N = 1
    while N < n:
        N *= 2
    spec = np.fft.rfft(segw, n=N)
    mag = np.abs(spec)
    mag[0] = 0
    k = np.argmax(mag)
    return round(k * sr / N, 1)

def main():
    samples, sr = load_wav("flag.wav")
    segs = segment_beeps(samples, sr)
    freqs = [dominant_freq(samples, sr, s, e) for s, e in segs]
    freqs = [f for f in freqs if f is not None]

    print(f"Total beep freqs: {len(freqs)}")
    start_pair = tuple(freqs[:2])
    end_pair = tuple(freqs[-2:])
    mid = freqs[2:-2]
    pairs = [tuple(mid[i*2:(i+1)*2]) for i in range(len(mid)//2)]

    print("Start marker:", start_pair)
    print("End marker:  ", end_pair)
    print("Char pairs (index: (f1, f2)):")
    for i, p in enumerate(pairs):
        print(i, p)

    with open("flag_pairs.txt", "w") as f:
        for p in pairs:
            f.write(f"{p[0]} {p[1]}\n")

if __name__ == "__main__":
    main()

5. `test_char.py` – oracle-based brute force of each character

This script is the core of the final solution: for a given known prefix and target index, it brute-forces the next character by comparing beep pairs to flag.wav.

#!/usr/bin/env python3
import wave, struct, numpy as np
import json, sys, subprocess

SR = 44100

def load_wav(path):
    with wave.open(path, "rb") as wf:
        fr = wf.getframerate()
        n = wf.getnframes()
        ch = wf.getnchannels()
        sw = wf.getsampwidth()
        data = wf.readframes(n)
    assert sw == 2
    samples = np.frombuffer(data, dtype="<i2").astype(np.float32) / 32768.0
    if ch > 1:
        samples = samples.reshape(-1, ch)[:, 0]
    return samples, fr

def segment_beeps(samples, sr, win_ms=10, thresh_ratio=0.2):
    win = int(sr * win_ms / 1000)
    if win < 1:
        win = 1
    env = np.convolve(np.abs(samples), np.ones(win) / win, mode="same")
    thr = env.max() * thresh_ratio
    mask = env > thr
    segs = []
    n = len(mask)
    i = 0
    while i < n:
        if not mask[i]:
            i += 1
            continue
        start = i
        while i < n and mask[i]:
            i += 1
        segs.append((start, i))
    return segs

def dominant_freq(samples, sr, start, end):
    seg = samples[start:end]
    n = len(seg)
    if n < 200:
        return None
    win = np.hanning(n)
    segw = seg * win
    N = 1
    while N < n:
        N *= 2
    spec = np.fft.rfft(segw, n=N)
    mag = np.abs(spec)
    mag[0] = 0
    k = np.argmax(mag)
    return round(k * sr / N, 1)

def get_pairs(path):
    samples, sr = load_wav(path)
    segs = segment_beeps(samples, sr)
    freqs = [dominant_freq(samples, sr, s, e) for s, e in segs]
    freqs = [f for f in freqs if f is not None]
    mid = freqs[2:-2]
    pairs = [tuple(mid[i*2:(i+1)*2]) for i in range(len(mid)//2)]
    return pairs

def main():
    if len(sys.argv) != 3:
        print(f"Usage: {sys.argv[0]} <known_prefix> <position_index>")
        sys.exit(1)

    prefix = sys.argv[1]
    pos = int(sys.argv[2])

    flag_pairs = []
    with open("flag_pairs.txt") as f:
        for line in f:
            a, b = line.strip().split()
            flag_pairs.append((float(a), float(b)))
    target_pair = flag_pairs[pos]

    alphabet = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789{}_"

    for ch in alphabet:
        text = prefix + ch
        print(f"Trying {ch} ...", end="", flush=True)

        payload = json.dumps({"text": text})
        cmd = [
            "curl", "-s", "-H", "Content-Type: application/json",
            "-d", payload, "http://10.240.2.50:5000/translate",
            "-o", "test.wav"
        ]
        subprocess.run(cmd, check=True)

        pairs = get_pairs("test.wav")
        if len(pairs) <= pos:
            print(" (too short)")
            continue
        if pairs[pos] == target_pair:
            print(" MATCH")
            print(f"Found char at position {pos}: {ch}")
            return
        else:
            print(" no")

    print("No match found in alphabet")

if __name__ == "__main__":
    main()

Usage example:

# After generating flag_pairs.txt with analyze_flag.py
# and knowing the prefix "MCTF25{"

python3 test_char.py MCTF25{ 7     # find char at index 7
python3 test_char.py MCTF25{t 8    # then with updated prefix, etc.

Repeating this for each position eventually yielded the full flag:

MCTF25{t4Lk_b1rdY_t0_m3}