Posted on: 06 11 2024.

File Streaming to REST APIs using Python

Author: Amar Zlojić, Lead Engineer

What is streaming and why do we need it?

Streaming in software engineering terms is a sequence of some data that is being sent over a certain period of time. Basically, it’s taking a big package, turning it into tons of smaller packages and sending those over one by one.

We need streaming as we’re usually limited by either, or all of these functions; storage, memory, I/O or network bandwidth. When we’re not limited by any of the above-mentioned, we can still be susceptible to network issues that could cause our batch operation to fail.

A great use of streaming is in the implementation of data protection, data backup or data recovery since it helps to parallelize and improve operations without needing additional resources.

REST API file uploads

File uploads through REST APIs are usually done by using multipart/form-data. This lets you send over multiple files in one request without needing to affect URL encoding, and it allows for sending raw bytes.

Multipart/form-data requests contain multiple parts which are separated by a boundary delimiter. The root HTTP message defines the boundary delimiter so that the HTTP server knows the boundary between each part; boundary is usually a randomly generated string. And each part of multipart can contain some additional headers which would help the server understand the part better, some of these are:

  • Content-Disposition Header: usually only used when going through HTML forms and it looks something like Content-Disposition: form-data/
  • Content-Type Header: used for defining the MIME type of a file that is being sent over. For raw binary data we would use application/octet-stream, however, if we know what type if file is sent, we should use that specific one, for example application/pdf. The default for Content-Type is text/plain. In most cases application/octet-stream would work but if we want to be precise it’s best to define it properly.
  • Content-Transfer-Encoding Header: used to indicate the type of transformation that has been used to represent the body in an acceptable manner for transport. It is mostly seen in SMTP and IMAP implementations and is rarely used outside of this scenario. It looks something like Content-Transfer-Encoding: base64.

Uploading files to REST API using Python

When it comes to working with REST API and Python, we will almost always go to requests library to make the API calls and do everything regarding request-response logic.

Simply put, requests library works when we make a request to the HTTP server with specific methods and parameters, after which we get a response in return.

import requests

response = requests.get("https://acme.com/api/v1/user", params={"username": "test"})
print(response.content)

This would be equivalent of executing a curl command of

curl -X GET "https://acme.com/api/v1/user?username=test"

And for POST request, it depends on the HTTP server and what type of data it accepts, but a simple POST request would look something like:

import requests

response = requests.post("https://acme.com/api/v1/user", headers={"Accept": "application/json"}, data={"username": "new-username"})
print(response.content)

And curl equivalent is

curl -X POST -H 'Accept: application/json' -d "{'username': 'new-username'}" 'https://acme.com/api/v1/user'

However, the most interesting and challenging part arises when uploading a file.

The simplest approach is to upload the entire file at once, which involves loading the whole file into memory and sending it in a single request. The requests library conveniently handles multipart/form-data uploads for entire files using the files parameter in the post method.

import requests

response = requests.post("https://acme.com/api/v1/files", file={"filename.pdf", open("file-to-upload.pdf", "rb")})
print(response.content)

This then sends over the file-to-upload.pdf to HTTP server and tells it that its name is filename.pdf and it implements these actions by using multipart/form-data.

And requests also supports streaming files out of the box by sending it a generator of a file object.

with open("file.pdf", "rb") as file:
    response = requests.post("https://acme.com/v1/upload", files=file)

This will work for some servers, but many servers anticipate a properly encoded multipart/form-data with a file field. Unfortunately, the requests library doesn’t have that functionality out of the box.

To make this work, we must use an extension of requests library called requests_toolbelt. Requests_toolbelt contains additional features that aren’t part of the requests library because they’re not used too often or don’t belong there.

Requests_toolbelt contains multipart encoder which prepares data that is to be streamed in a format exactly as we need it as well as ways to monitor the streaming upload.

import requests
from requests_toolbelt.multipart.encoder import MultipartEncoder

encoded_data = MultipartEncoder(
    fields={
        "id": "this-is-some-id",
        "file": ("name-when-uploaded.pdf", open("file-to-upload.pdf", "rb"), "application/pdf"),
    }
)

r = requests.post(
    "https://acme.com/v1/upload",
    data=encoded_data,
    headers={"Content-Type": encoded_data.content_type},
)

This works great for almost every HTTP server, but unfortunately not for all of them. MultipartEncoder requires the dictionary approach to fields, so for a file, for example, you need to do MultipartEncoder(fields={"file": open("file.pdf", "rb")}) but some HTTP servers require us to just send over raw data without file field.

First solution to that problem would be to just use the requests library streaming of generator, but that would set the Transfer-Encoding header to chunked and this might cause some HTTP servers to behave poorly or even reject the request immediately.

To avoid this, we would once again use the requests_toolbelt but this time with StreamingIterator where we can just give it a generator and the final size of the data that is being streamed and it would craft the request and stream it over without changing the Transfer-Encoding to chunked.

import requests
from requests_toolbelt.streaming_iterator import StreamingIterator

generator = open("file-to-upload.pdf", "rb")
size = 3126474
content_type = "application/pdf"

streamer = StreamingIterator(size, generator)
response = requests.post(
    "https://acme.com/v1/upload",
    data=streamer,
    headers={"Content-Type": content_type}
)

If you’re unsure of what the content type of data that you’re sending over is, you can always default to application/octet-stream which encompasses any binary data.

Conclusion

Streaming file uploads can get very complicated and a lot of details of uploading a file depend on the HTTP server that you’re trying to upload to. You need to be aware of how the HTTP server wants the data to be formatted, which fields it will accept and which headers it wants.

A great use of streaming would be when we do a backup or recovery. Instead of loading a file in memory or saving it to intermediary storage we would stream it chunk by chunk to either be backed up or restored.

For example, suppose we backed up a large file to AWS S3 storage, when executing a restore, but we may not have enough memory or storage on the machine implementing the restore. This would require streaming directly from AWS S3 to the REST API that we want to upload the file to. That way, we wouldn’t stress available resources.

The safest option for streaming upload of data in Python is to use the combination of requests and requests_toolbelt libraries which enable us to stream the data in almost any shape or form to almost any HTTP server and REST API.