lighttpd exhaust RAM while downloading big file

changed milestone to %Turris OS 5.1.2

I have tried that with downloading 700 MB file on MOX with 512 MB RAM. The php processes are no problem: they consume sane number of Megabytes in memory and php is also limited by php.ini configuration.

It looks like lighttpd tries to "cache" or pull the file in memory before sending it to http client.

Also note uploading the same file was no problem for the whole system and works like a charm.

changed the description

added Bug High To Do labels

probably related

assigned to @mhrusecky

This was here since the beginning. I think that this bug can not block another fixup release, which we need to release ASAP. We will need to priorize %Turris OS 5.1.2 for Sentinel fixes mostly in firewall. So this is going to be in one of upcoming fixup of Turris OS 5.1.

no problem, yet still High

changed milestone to %Turris OS 5.1.3

changed milestone to %Turris OS 5.1.4

changed milestone to %Turris OS 5.1.5

@vmyslivec What version of lighttpd is being used and can you reproduce this? If so, I'd like to fix it. (I am a lighttpd developer)

An immediate workaround might be server.stream-response-body = 2 https://redmine.lighttpd.net/projects/lighttpd/wiki/Server_stream-response-bodyDetails

I believe that the bug you are seeing here was fixed in lighttpd 1.4.56. The bug was visible with FastCGI backends, large files, and (the default) server.stream-response-body = 0

https://redmine.lighttpd.net/issues/3033

In the future, please consider reporting lighttpd bugs upstream. I do try to fix things that are reported.

https://redmine.lighttpd.net/projects/lighttpd/issues

https://github.com/CZ-NIC/turris-os-packages/pull/63 includes commit to upgrade to lighttpd 1.4.56

Hi @gstrauss

What version of lighttpd is being used?

It's version 1.4.55 in our stable release now. However, I guess you have already find that.

Can you reproduce this?

Yes. One of our users reported issues with downloading large files through Next Cloud and I was able to reproduce this and so I created this issue.

I am a lighttpd developer

Nice! You are welcomed here!

I believe that the bug you are seeing here was fixed in lighttpd 1.4.56.

Perfect, looking forward to test updated version

In the future, please consider reporting lighttpd bugs upstream. I do try to fix things that are reported.

I was quite confident it is an issue with configuration on our side. This issue was postponed since then, so I didn't focused on that.

https://github.com/CZ-NIC/turris-os-packages/pull/63

wow! Big thanks, as well for configuration optimizations

assigned to @gstrauss and unassigned @mhrusecky

changed milestone to %Turris OS 5.2.0

added Doing label and removed To Do label

added Patch Review labels

mentioned in merge request !422 (merged)

This issue is fixed with the upgrade to lighttpd 1.4.58, which is part of !422 (merged) (pending review for merge)

Just to be sure this should be tested I think. I am assigning this to @vmyslivec to test it (of course only once !422 (merged) is merged and built).

I just tested HBL (future %Turris OS 5.2.0) and I still got issues with RAM while downloading large files from Next Cloud

I need to focus on the issue a bit more since I didn't recognized the process consuming the memory. There is some higher memory usage by myslqd and php which is needed by NextCloud, but these did not grow much during the file transfer...

With lighttpd 1.4.59 released as part of Turris OS 5.2.0, this issue ("lighttpd exhaust RAM while downloading big file") should be resolved.

I need to focus on the issue a bit more since I didn't recognized the process consuming the memory.

@vmyslivec could you post some of those details here? Is lighttpd memory growing and triggering OOM killer? That should not happen. If the issue is elsewhere, maybe a new issue with Nextcloud should be open to better describe where to look and how to reproduce. Thanks.

removed Doing label

assigned to @vmyslivec and unassigned @gstrauss

@kkoci since this issue is marked 'Bug' and 'High', should this issue get the label 'Needs run-testing' or 'In Test'? How about the 'Doing' label? ;)

Review does that for us. @vmyslivec tested it and it seems that it still happens. Let's wait for him to re-investigate the issue.

removed Review label

added To Do label

changed milestone to %Turris OS 5.2.1

changed milestone to %Turris OS 5.3.0

added Doing label and removed To Do label

The problem is still present: While downloading a big file via NextCloud exhausts RAM completly and leads to OOM killer to kill a mysqld (MariaDB) process.

Current versions are:

Turris OS 5.2.2
nextcloud 19.0.3-3
lighttpd 1.4.59-1
php 7.2.34-3
mariadb-server 10.4.18-1

The difference between current and original state is that I don't see which process consumes the vast amount of memory now.

mysqld consumes a lot of memory, but constantly the same amount even via (obviously)
php processess consumes a little bit more memory during a file download, but not so significantly more than in "idle" state
lighttpd process consumes CPU time (obviously) but no extra memory time during the file tranfer

It's hard to debug as during the memory exhaust, everything is stuck and the MOX does not respond. htop command did not help much with locating the issue and it displays inconsistent data

We can probably create a new issue with and close this one as it seems lighttpd is not causing the issue anymore.

I also created #775 (closed) as I think the mysqld process is not tuned well according to memory consumption (especially on a 512MB MOX), but it is not probably the root cause of this issue.

PS: Again, uploading a file still works like a charm with no significant or unexpected load on the device.

OK, I found the root cause now! I realized that we use /tmp in RAM which is 222 MB in size in my case and the downloaded file size is 222 MB maximum.

I can confirm that during a file download lighthttpd creates temporary files within /tmp directory called lighttpd-upload-<random-letters>. Every files has the size of 1 MB and lightttpd creates up to circa 200 files that consumes (together with other temp files) 100 % of /tmp file system and thus about half of total RAM of my 512MB MOX. (That\s also why htop didn't revealed the cause of the RAM consumption.)

When the /tmp is exhausted, a browser offers the file to download but its size is only circa 220 MB (i.e. not the whole file!). Also, oom-killer comes to the scene about this time.

After several seconds, these temporary files disappear and the router comes back to normal. However, myslqd is killed, restarted and a downloaded file is not complete.

@gstrauss can we do something about that? Is it possible to configure lighttpd to recycle temporary files to not exhaust /tmp while downloading a file? Thanks for the reply

marked this issue as related to #775 (closed)

There are multiple options.

MOX would have an additional ~160 MB memory free if reForis and supporting services were not so bloated and always running on the system: #705 (closed) "lighttpd: reduce memory usage of foris and reforis" (160 MB is almost 1/3 of the entire 512 MB Mox memory!!!)

lighttpd supports server.max-request-size to limit the request size. (probably not applicable to this issue)

lighttpd supports streaming the request or response, and I mentioned this above 7 months ago: (#665 (comment 185700))

$HTTP["url"] =~ "^/nextcloud" {  # or whatever appropriate url-path
    server.stream-request-body = 2
    server.stream-response-body = 2
}

https://wiki.lighttpd.net/Server_stream-response-bodyDetails

https://wiki.lighttpd.net/Server_stream-request-bodyDetails

Those settings will reduce the amount of intermediate buffering performed by lighttpd. However, they also disable the default behavior of lighttpd, e.g. offloading the response as quickly as possible from the backend. With the above settings, the backend is now busy sending the response for (almost) as long as the client takes to download. This feature is more important when the backend is a heavy scripting language, such as PHP (nextcloud), running via CGI, where every request is an independent PHP process.

lighttpd supports X-Sendfile response header from backends, e.g. so that a PHP backend can tell lighttpd to directly read a file from the filesystem for the response, rather than PHP reading the file and copying to lighttpd, having lighttpd store the response in temp files, only to then send it to the client. It looks like NextCloud developers were unable to implement this cleanly in their system: https://github.com/nextcloud/server/issues/13082

Alternatively -- and possibly a better solution when discussing a system that has a secondary storage device (e.g. the nextcloud target volume) -- is to configure lighttpd server.upload-dirs to use a tmp/ directory on the persistent storage hosting nextcloud, e.g. /nextcloud/tmp

If the nextcloud feature is enabled in Mox, this could be set in e.g. /etc/lighttpd/conf.d/nextcloud.conf

server.upload-dirs := ( "/nextcloud/tmp" )  # or whatever appropriate path

lighttpd server.upload-dirs is a global setting in lighttpd and defaults to /var/tmp, since a sufficiently large storage location is intended. OpenWRT and Turris OS lighttpd.conf set upload-dirs = ( "/tmp" ) in /etc/lighttpd/lighttpd.conf. On Mox, /tmp is a limited in-memory filesystem.

server.upload-dirs supports multiple directories for tmp file creation (same directories for both upload and download) and if the first dir fills up (e.g. /dev/shm), then lighttpd will begin to use the next in the list (e.g. /var/tmp). This can work well on a system with an appropriately sized in-memory /dev/shm (default 1/2 memory) and a disk-backed /var/tmp.

For small-memory systems which support large uploads to persistent storage, it is often a better idea to set server.upload-dirs to a location on persistent storage, and to not use in-memory filesystems.

server.upload-temp-file-size controls the size of each tmp file, default 1 MB. The idea is that as soon as the tmp file is consumed (upstream or downstream), it can be removed to free up space, rather than always taking up the entire size of the request body or response body.

Aside: lighttpd mod_webdav is a not as featureful as NextCloud, but is much faster since lighttpd mod_webdav uploads files directly to the persistent storage location (instead of to temporary directories), and atomically renames the uploaded file into place.

Thanks for the comprehensive analysis and description @gstrauss. From my point of view, server.upload-dirs and server.stream-response-body are two possible solutions.

To take advantage of setting a different temporary upload dir, we must make sure it is on external storage (as we need to avoid excessive writes to the internal storage). This is something the storage plugin can take care about IMO.

Streaming the response/request is something that could work in the case the device lacks external storage.

To sum it up, we need to edit/update lighttpd configuration based on the state of the device/configuration. This should be handled by managing conf.d/ configuration snippets in certain TOS packages. What do you think @kkoci?

@kkoci please enable "Notifications" for me on this issue. For some reason, I do not have permission to do so myself, hence the reason I did not see @vmyslivec update 4 weeks ago, but I did get an email today when he referenced me @gstrauss

You are noted as participants. I do not have the right on configuring your account (and I am not sure if anyone has actually). The notification switch on the right panel has to be enabled to get mail. I suspect that you have it disabled and possibly unable to enable it? I can't change that. That might be Gitlab bug or something. I can report it to our admins but please check first what is the state of that button.

I noticed someone complains about the lack of notification from GitLab issues as well. I will try to figure it out and discuss it with GitLab administrators.

I suspect that you have it disabled and possibly unable to enable it? I can't change that. That might be Gitlab bug or something. I can report it to our admins but please check first what is the state of that button.

Yes, "Notifications" is disabled and the control is grayed-out. I do not have the ability to enable it for this issue. Also, since you did not mention me @gstrauss in your response, I did not get any notification of your response. Until this is sorted, please mention me @gstrauss in your posts if you would like to me to see your post in a timely fashion. Thank you.

Our GitLab instance was updated recently. Please check the notification toggle now @gstrauss

Thanks! I am now able to enable Notifications on this issue.

@vmyslivec did you have any questions about my response above? #665 (comment 221544)

assigned to @kkoci and unassigned @vmyslivec

@vmyslivec wrote:

From my point of view, server.upload-dirs and server.stream-response-body are two possible solutions.

To sum it up, we need to edit/update lighttpd configuration based on the state of the device/configuration. This should be handled by managing conf.d/ configuration snippets in certain TOS packages. What do you think @kkoci?

Please keep in mind that server.upload-dirs is a global setting. If external storage is available, it is desirable to use (globally) on the server instead of using internal storage (with more limited write cycles).

server.stream-response-body and server.stream-request-body can be configured with any lighttpd.conf condition, e.g. for any URL.

server.upload-dirs and server.stream-response-body and server.stream-request-body can be used together. They are not mutually exclusive.

FYI: I wrote some documentation on the lighttpd wiki which explains how to use server.upload-dirs, server.stream-response-body, and server.stream-request-body: lighttpd resource tuning

@kkoci: In the interest of "a working solution now is better than a perfect solution in another year": server.stream-response-body = 2 can be applied immediately. With server.stream-response-body = 2, everything should work and lighttpd will not fill up /tmp, which happens without server.stream-response-body = 2.

It would be better if applied only to requests to NextCloud, e.g.

$HTTP["url"] =~ "^/nextcloud" {  # or whatever appropriate url-path
    server.stream-request-body = 2
    server.stream-response-body = 2
}

lighttpd resource tuning describes the behavior of server.stream-response-body = 2 in more detail.

A longer term solution would be a lighttpd include file if NextCloud is configured, e.g. /etc/lighttpd/conf.d/nextcloud.conf

server.upload-dirs := ( "/nextcloud/tmp" )  # or whatever appropriate path to large persistent storage

or a similar configuration created and included by lighttpd if Turris OS configures an large external storage device and the user enables this external device by specifying a /tmp directory on the device with 1777 permissions. Creating and enabling such an include file for lighttpd.conf should be a secondary effect of some system-wide storage management solution so that it is clear that the external device is being designated for temp file use.

I can do both with ease as we provide that configuration file as part of our distribution.

Do you think that it is safe to enable server.stream-request-body and server.stream-response-body server wide to prevent issues with memory with any setup? I read through the document you linked and it seems to me that default on devices with low ram (in today's standards) should be 2. I also do not see a reason why it should not be set to 2 on low traffic sites (I can see issues if the site has high traffic). My understanding is that it can be selectively set to 0 for applications that we know can't trigger OOM this way.

Edit: Just to explain my thinking. The issue seems to be generic for any deployment of "upload/download" capable web on Turris. The solutions seem to be exclusive. It makes no sense to set upload-dirs when we are streaming data instead. Thus it seems to me that using the stream solution is easier and more generic. I am just not sure why are you suggesting the upload-dirs as the solution over streaming.

Do you think that it is safe to enable server.stream-request-body and server.stream-response-body server wide to prevent issues with memory with any setup?

Yes, it should be safe to do so.

Offloading requests and responses from backends will be reduced, which appears to be ok for the Turris environment. Also, mod_deflate will not operate on streaming responses, which is also likely acceptable for the Turris environment.

Below, I'll try to answer in more detail your question about why streaming is not enabled by default in lighttpd

I am just not sure why are you suggesting the upload-dirs as the solution over streaming.

I tried to describe in lighttpd resource tuning that there are tradeoffs between streaming and not streaming.

Disabling streaming (the default) allows lighttpd to offload requests and responses from backends, which is especially useful for low resource systems. lighttpd easily runs on routers with 64 MB of memory or even less. On memory-constrained systems, it is often desirable that CGI programs run for as short a time as possible. Too many CGI programs running in parallel might overload a small system. Too many CGI programs running in parallel might be avoided by not starting the CGI program until the entire request body has been received, and by reading the response body as quickly as possible from the CGI program, allowing the CGI program to finish and exit more quickly.

Independently, if a system supports large file uploads and downloads, that might suggest the presence of a large disk of persistent storage. If a large disk is present, then server.upload-dirs on the large disk should be considered, rather than using a very small in-memory filesystem for tempfiles. (The very small part is emphasized.)

I believe that server.upload-dirs on persistent storage is the better long-term solution for upload and download of large files.

Independently, enabling streaming for backends is recommended for large requests and responses for which offloading from the backend is a lower priority, or if full offloading might cause resource issues for the machine on which lighttpd is running.

Just for fun: computing resources have grown exponentially over the past few decades. Remember when 64 MB of RAM was a huge amount of memory? Another scenario (unlikely to affect users of Turris OS): for dumb clients, HTTP/1.1 streaming responses will send Transfer-Encoding: chunked. Without streaming responses, lighttpd is able to send Content-Length, even if the backend sent Transfer-Encoding: chunked to lighttpd. (Transfer-Encoding: chunked and Content-Length are both part of the HTTP/1.1 specification, but some dumb clients historically expected only Content-Length.)

Thank you for the explanation as well as for the feedback.

BTW, if short-term changes are made to lighttpd config, I would suggest making those changes in turris-root.conf, and not in the main lighttpd.conf since (eventually?) someone might review the patches I proposed in #474 (closed) "lighttpd: Use upstream version instead of ours"

We try to use files in /etc/lighttpd/conf.d for every update because they are not marked as configuration files while the top-level lighttpd.conf is. Configuration files means that changes do not propagate automatically if the user modified the file.

mentioned in merge request !802 (closed)

#665 (comment 221544)

lighttpd supports X-Sendfile response header from backends, e.g. so that a PHP backend can tell lighttpd to directly read a file from the filesystem for the response, rather than PHP reading the file and copying to lighttpd, having lighttpd store the response in temp files, only to then send it to the client. It looks like NextCloud developers were unable to implement this cleanly in their system: https://github.com/nextcloud/server/issues/13082

FYI: I added a note to https://github.com/nextcloud/server/issues/13082 with some suggestions for how to add X-Sendfile support to NextCloud.

Back in 2019, one of the NextCloud developers had posted

It still this sounds like a nice feature, but the requests for this are quite low.

It might be nice if someone from the Turris team would like to post on behalf of the Turris organization. Having the Turris organization add a "this is useful to users running NextCloud on home routers" may help to get more support for adding the feature to NextCloud.

mentioned in merge request turris/foris-controller/foris-controller-storage-module!24 (merged)

ping

It might be nice if someone from the Turris team would like to post on behalf of the Turris organization. Having the Turris organization add a "this is useful to users running NextCloud on home routers" may help to get more support for adding the feature to NextCloud.

https://github.com/nextcloud/server/issues/13082

The RAM exhaustion should be resolved with foris-controller-storage-plugin new release (!816 (merged)). This makes it resolved for me. What stays here is the request for support from @gstrauss on our behalf. Honestly, I do not see into it in such a way it would be beneficial that I would write there. I think that @mhrusecky should do it considering his history with Nextcloud project and community. Thus I am keeping this open and assigning it to @mhrusecky for doing that.

added Review label and removed Doing Patch labels

assigned to @mhrusecky and unassigned @kkoci

@kkoci: In an effort to reduce noise on the issue board, please go ahead and close this. I do not see a reason to keep this issue open for a request to comment on an external github issue.

@mhrusecky: you have a github handle? Would you subscribe to https://github.com/nextcloud/server/issues/13082 and perhaps add a note there? Thanks.

Why not. I am not against that.

closed

removed Review label

lighttpd exhaust RAM while downloading big file

Designs

Child items ...

Activity

Admin message

lighttpd exhaust RAM while downloading big file

Activity