How can we help?
Let’s talk about how we can help you transform your business.
Contact usAuthor: Amar Zlojić, Lead Engineer
Streaming in software engineering terms is a sequence of some data that is being sent over a certain period of time. Basically, it’s taking a big package, turning it into tons of smaller packages and sending those over one by one.
We need streaming as we’re usually limited by either, or all of these functions; storage, memory, I/O or network bandwidth. When we’re not limited by any of the above-mentioned, we can still be susceptible to network issues that could cause our batch operation to fail.
A great use of streaming is in the implementation of data protection, data backup or data recovery since it helps to parallelize and improve operations without needing additional resources.
File uploads through REST APIs are usually done by using multipart/form-data
. This lets you send over multiple files in one request without needing to affect URL encoding, and it allows for sending raw bytes.
Multipart/form-data
requests contain multiple parts which are separated by a boundary delimiter. The root HTTP message defines the boundary delimiter so that the HTTP server knows the boundary between each part; boundary is usually a randomly generated string. And each part of multipart can contain some additional headers which would help the server understand the part better, some of these are:
Content-Disposition: form-data/
application/octet-stream
, however, if we know what type if file is sent, we should use that specific one, for example application/pdf
. The default for Content-Type is text/plain
. In most cases application/octet-stream
would work but if we want to be precise it’s best to define it properly.Content-Transfer-Encoding: base64
.When it comes to working with REST API and Python, we will almost always go to requests
library to make the API calls and do everything regarding request-response logic.
Simply put, requests
library works when we make a request to the HTTP server with specific methods and parameters, after which we get a response in return.
import requests
response = requests.get("https://acme.com/api/v1/user", params={"username": "test"})
print(response.content)
This would be equivalent of executing a curl
command of
curl -X GET "https://acme.com/api/v1/user?username=test"
And for POST
request, it depends on the HTTP server and what type of data it accepts, but a simple POST
request would look something like:
import requests
response = requests.post("https://acme.com/api/v1/user", headers={"Accept": "application/json"}, data={"username": "new-username"})
print(response.content)
And curl
equivalent is
curl -X POST -H 'Accept: application/json' -d "{'username': 'new-username'}" 'https://acme.com/api/v1/user'
However, the most interesting and challenging part arises when uploading a file.
The simplest approach is to upload the entire file at once, which involves loading the whole file into memory and sending it in a single request. The requests
library conveniently handles multipart/form-data
uploads for entire files using the files
parameter in the post
method.
import requests
response = requests.post("https://acme.com/api/v1/files", file={"filename.pdf", open("file-to-upload.pdf", "rb")})
print(response.content)
This then sends over the file-to-upload.pdf
to HTTP server and tells it that its name is filename.pdf
and it implements these actions by using multipart/form-data
.
And requests
also supports streaming files out of the box by sending it a generator of a file object.
with open("file.pdf", "rb") as file:
response = requests.post("https://acme.com/v1/upload", files=file)
This will work for some servers, but many servers anticipate a properly encoded multipart/form-data
with a file
field. Unfortunately, the requests
library doesn’t have that functionality out of the box.
To make this work, we must use an extension of requests
library called requests_toolbelt
. Requests_toolbelt
contains additional features that aren’t part of the requests
library because they’re not used too often or don’t belong there.
Requests_toolbelt
contains multipart encoder
which prepares data that is to be streamed in a format exactly as we need it as well as ways to monitor the streaming upload.
import requests
from requests_toolbelt.multipart.encoder import MultipartEncoder
encoded_data = MultipartEncoder(
fields={
"id": "this-is-some-id",
"file": ("name-when-uploaded.pdf", open("file-to-upload.pdf", "rb"), "application/pdf"),
}
)
r = requests.post(
"https://acme.com/v1/upload",
data=encoded_data,
headers={"Content-Type": encoded_data.content_type},
)
This works great for almost every HTTP server, but unfortunately not for all of them. MultipartEncoder
requires the dictionary approach to fields
, so for a file, for example, you need to do MultipartEncoder(fields={"file": open("file.pdf", "rb")})
but some HTTP servers require us to just send over raw data without file
field.
First solution to that problem would be to just use the requests
library streaming of generator, but that would set the Transfer-Encoding
header to chunked
and this might cause some HTTP servers to behave poorly or even reject the request immediately.
To avoid this, we would once again use the requests_toolbelt
but this time with StreamingIterator
where we can just give it a generator and the final size of the data that is being streamed and it would craft the request and stream it over without changing the Transfer-Encoding
to chunked
.
import requests
from requests_toolbelt.streaming_iterator import StreamingIterator
generator = open("file-to-upload.pdf", "rb")
size = 3126474
content_type = "application/pdf"
streamer = StreamingIterator(size, generator)
response = requests.post(
"https://acme.com/v1/upload",
data=streamer,
headers={"Content-Type": content_type}
)
If you’re unsure of what the content type of data that you’re sending over is, you can always default to application/octet-stream
which encompasses any binary data.
Streaming file uploads can get very complicated and a lot of details of uploading a file depend on the HTTP server that you’re trying to upload to. You need to be aware of how the HTTP server wants the data to be formatted, which fields it will accept and which headers it wants.
A great use of streaming would be when we do a backup or recovery. Instead of loading a file in memory or saving it to intermediary storage we would stream it chunk by chunk to either be backed up or restored.
For example, suppose we backed up a large file to AWS S3
storage, when executing a restore, but we may not have enough memory or storage on the machine implementing the restore. This would require streaming directly from AWS S3
to the REST API
that we want to upload the file to. That way, we wouldn’t stress available resources.
The safest option for streaming upload of data in Python is to use the combination of requests
and requests_toolbelt
libraries which enable us to stream the data in almost any shape or form to almost any HTTP server and REST API.