IA174

Assignment #2 - Merkled Åmgard

Update (2/11 9:30): Due to a mistake on our side the API was not responding correctly till the 1st of November at 11:00, when we fixed it. We apologize and postpone the deadline by one day. More details are in the discussion forum.

Deadline: 15.11.2023 23:59:59 Brno time
Points: 10 points
Responsible contact: Jan Kvapil <408788@mail.muni.cz>
Discussion forum: here
Submit: here

Task

The aim of this task is to access the contents of a secret file. The secret file is stored on our server. You can interact with the server by sending it commands through an API. However, the commands need to be authenticated with a custom hash-based authentication scheme. To access the secret file you will need to break this authentication scheme and its implementation. The authentication is based on Merkle-Dåmgard construction. The hash-function has all the properties (strengths and weaknesses, such as the Length extension attack) you would expect from such a MD based function.

Solving this task does not require deep analysis of the hash-function itself, only of its general structure/construction.

API description

You will be interacting with the server through the API described below. The API lets you send Unix-like commands which it then executes on the server. To prevent an execution of potentially malicious commands, the API expects a valid signature. A hash-based authentication function is used to calculate the signatures. To authorize a command (i.e., to obtain a valid signature) use the Authorize endpoint. Then to run the command use the Run endpoint. The Run endpoint recognizes two commands: ls and cat. These are Unix-like commands to list and concatenate (or view) files. However, the commands on the server are only simple variants of the Unix commands and do not support any options (or flags). If you are familiar with the Unix ls and cat commands think of our variants as being invoked as ls <filename> or cat <filename> (again, no flags are supported).

There are a few important points in this task where understanding the quirks of Python helps. As this is not something we are trying to test, there is always a comment explaining the quirk (if it is necessary for solving the task).

Hint

At some point, you will need to work directly with the state of the hash-function used on the server, so we provide its implementation below. This function has more weaknesses than a well-made Merkle-Dåmgard hash-function, but you will not need to use those to solve this task.

Code & API

You can download the code of the hash-function here: merkled_amgard.py. You can also download a minimal working example of interacting with the endpoints in Python here: mwe.py. We suggest you to start from the mwe.py, especially if you are not familiar with how the HTTP protocol works, i.e., how clients and a server exchange messages with each other. You can also have a look into the documentation for the Requests Python module. For now, we intentionally digress from how the query string should be constructed for the GET requests. As understanding this is crucial, we will return to this at the end of the assignment.

Common functions

Now, we present some common functions used in the server code and the API endpoints that will process requests you send to the server. We suggest you to open the mwe.py file now and try to see how the client code (i.e., in the mwe.py) and the server code (presented here) interact with each other. You don't need the complete server code and therefore we provide only the necessary parts. In case you wonder, the server is implemented in the Flask web framework. Understanding the following functions is crucial in figuring out the attack.

# The authentication_prefix is a randomly generated 32 byte fixed secret value
authentication_prefix = b"..."
len(authentication_prefix) == 32

def verify(data, signature):
    """Verify that `signature` is valid for `data` by re-computing it and comparing."""
    if signature != MerkledAmgard(authentication_prefix + data).hexdigest():
        raise ValueError('Invalid signature')

def sign(data):
    """Sign `data` by prepending the authentication_prefix and hashing the whole thing."""
    return MerkledAmgard(authentication_prefix + data).hexdigest()

def parse(query_string):
    """
    Parse query string and also get it in bytes.

    Returns two objects:
     - First a dictionary with parsed query string.
       Multiple values are overridden (e.g., "?cmd=ls&cmd=aaaa" -> {"cmd": "aaaa"})
     - Secondly the raw query string unquoted into bytes.
       (i.e., unquotes raw bytes: %00 -> b'\x00')
    """
    return dict(parse_qsl(query_string, errors='ignore')), unquote_to_bytes(query_string)

And now, let's see the actual API endpoints. For example, when you execute the authorize function inside the mwe.py, the following authorize API endpoint will get called on the server with the values that you pass to it via the query string. If there is an error on the server-side, you will get an error response such as 'No command specified' and the HTTP status code 400.

Authorize

Request

Authorize a command.

Path: /hw02/authorize/<uco>/
Method: GET
Query string: A cmd key with the command to execute.

Response

JSON dictionary with one key: authorized with the authorized query string. The response also contains a Set-Cookie header which contains the signature which authorizes the query string.

@hw02.route("/authorize/<int:uco>/")
def authorize(uco):
    args, decoded = parse(request.query_string)
    command = args.get(b'cmd')
    if not command:
        return jsonify({'error': 'No command specified'}), 400
    if command != b'ls':
        # Only allows ls command
        return jsonify({'error': 'Bad command'}), 403
    # Construct the expiration timestamp
    expiry = int((datetime.now(timezone.utc) + timedelta(seconds=15)).timestamp())
    expiry_arg = b'expiry=' + str(expiry).encode() + b'&'
    # Construct the response with the authorized query string
    resp = jsonify({'authorized': (expiry_arg + request.query_string).decode()})
    # Authorize the query string
    resp.set_cookie('signature', sign(expiry_arg + decoded), max_age=30)
    return resp

Run

Request

Run an authorized command.

Path: /hw02/run/<uco>/
Method: GET
Query string: An authorized query string, containing a cmd key with the command to execute and the expiry key.
Cookie: Requires that a valid signature cookie is set that authorized the query string (obtained from the Authorize endpoint).

Response

Depending on the command either a list of files for the ls command or the contents of a file for the cat command.

@hw02.route("/run/<int:uco>/")
def run(uco: int):
    args, decoded = parse(request.query_string)
    signature = request.cookies.get('signature')
    # Verify that the signature is valid
    if not signature:
        return jsonify({'error': 'Missing signature'}), 403
    try:
        verify(decoded, signature)
    except ValueError as error:
        return jsonify({'error': 'Invalid signature', 'detail': str(error)}), 403
    # Verify that the signature is not expired
    expiry = float(args.get(b'expiry'))
    if datetime.now(timezone.utc).timestamp() >= expiry:
        return jsonify({'error': 'Signature has expired'}), 403

    # Get the command and execute it (if it's "ls" or "cat")
    command = args.get(b'cmd')
    if command == b"ls":
        return list_files(uco)
    elif command.startswith(b"cat "):
        fname = command.split(b" ")[1].decode()
        return cat_file(uco, fname)
    else:
        return jsonify({'error': 'Unknown command'}), 403

Important notes on query string encoding

The API expects some parameters as part of the query string; that is, within the URL itself. In general, passing any values between programs/services suffers from the need of correct interpretation of the data being sent by both sides (the sending and the receiving one). ASCII characters are often handled with ease, but sending special characters (where special depends on the context) and raw bytes can be troublesome. As an example, imagine sender's intention to send the zero byte 0x00, but the receiver might interpret it as four characters 0, x, 0, 0. Another example, the following comparison in Python bytearray(b'\x00')[0] == 0x00 evaluates to true, but to send the zero byte as part of URL you need to encode it differently and send %00. While those differences could seem nitpicky, they really aren't. You can read more about URL encoding on Wikipedia and about query strings in general.

Going back to the mwe.py you might notice that we intentionally do not use the recommended way from the docs to send the query parameters using the params keyword argument. Since you are free to pick a different language be careful and pay attention to how the query string is actually created. For Python, have a look at the functions unquote_to_bytes, quote_from_bytes we import in mwe.py.

Submission

You should submit a zip-file containing three, optionally four (including the llm.txt), things:

A solution.txt file that contains the contents of the secret file you were supposed to obtain (i.e., it has four space-separated words in ASCII on one line).
- Do not put the words on separate lines.
- Do not put tabs or many spaces between the words.
- Submit an empty file in case you do not recover the contents of the secret file.
A description.txt file that contains a description of how you solved the task (i.e., what you did in order to obtain the contents of the secret file, how you came up with it, etc.). Please keep the descriptions reasonably short and clear (4-5 paragraphs should be enough). Cite any external sources for the solution ideas and code that you have used accordingly. Use of large language models (LLMs, e.g., ChatGPT, Bing Chat, Bard, Copilot, etc.) must be declared in the description.txt file. Specify which LLMs you used and give a one-sentence description of the purpose for which you used them. You don't have to cite this course's materials and links provided in this assignment's text. In your description, please answer the following question:
- What property of the used hash-function did you (ab)use in order to solve the task?
A code directory that contains any source code used in solving the task.
[Optional] An llm.txt file. Include only if you used an LLM during your work. If you used an LLM with chat-like natural-language interface (ChatGPT, Bing Chat, etc.), the file should include the transcription of the relevant part of the interaction with the model (your prompts and the model's responses). If you used an LLM focused on code completion (such as Copilot), write a very brief description of how do you feel the model helped you with writing your code.

Grading

The recovery of the correct contents of the secret file is worth 7 points, with the description worth the remaining 3 points. However, a submission with just the contents of the secret file and no description is worth 0 points. If you don't complete the task, submit a description of where you got stuck and the code you used. Not conforming to the above format of the solution leads to a -0.5 point penalty.