Serving the Quantitative Finance Community

  • 1
  • 6
  • 7
  • 8
  • 9
  • 10
 
User avatar
tags
Posts: 3602
Joined: February 21st, 2010, 12:58 pm

Re: Excel tricks

April 16th, 2025, 7:28 am

Hi everyone. I have an Excel range (about 40K cells) and I need to create call it a mask with background color of the Excel cell (hence object return would be a dataframe with column being Excel column, row being Excel rows and values being Excel cells background color). I tried out multipled strategies in python but my code is way to slow, Best I could get was around 230s. Using cython I went down to 94s. And about the same time with pybind11, that is 90s. That is still way too slow. The slowness is openpyxl library's internal functions, specifically in the XML parsing and object creation phases.
What should I do next? My use case is not crazy innovative, there must be some ready-to-use solutions??  Many thanks.
 
User avatar
Cuchulainn
Posts: 22929
Joined: July 16th, 2004, 7:38 am

Re: Excel tricks

April 16th, 2025, 9:37 am

Do you have a snippet code where the bottleneck is?

Is Excel DNA still used?

Plan B VBA with dll??


And PythonNet?
https://pythonnet.github.io/


 XML parsing and object creation phases.
Killers
Attachments
Mastering_Cython.pdf
(1.99 MiB) Downloaded 13 times
 
User avatar
Cuchulainn
Posts: 22929
Joined: July 16th, 2004, 7:38 am

Re: Excel tricks

April 16th, 2025, 9:46 am

BTW nanobind is follow up act to pybind11
Wha' about JSoN??
 
User avatar
bearish
Posts: 5906
Joined: February 3rd, 2011, 2:19 pm

Re: Excel tricks

April 16th, 2025, 10:35 am

Where do the colors come from? If they’re functions of the corresponding cell values, it may be easier to replicate that relationship in code.
 
User avatar
Cuchulainn
Posts: 22929
Joined: July 16th, 2004, 7:38 am

Re: Excel tricks

April 16th, 2025, 10:47 am

what about xlwings?

https://www.xlwings.org/book

Haven't used it ...
 
User avatar
tags
Posts: 3602
Joined: February 21st, 2010, 12:58 pm

Re: Excel tricks

April 16th, 2025, 1:51 pm

Thank you for replies (I acknowledge "use case" is quite meaningless...)

Below is the profile output with the top 20 operations that take the most time. If I understand well the functions related to openpyxl.descriptors.serialisable, openpyxl.descriptors.base, and the underlying xml.etree.ElementTree are to blame.
| **Count**          | **Time (s)** | **CPU Time (s)** | **Wall Time (s)** | **Location**                                                                 |
|--------------------|--------------|------------------|-------------------|------------------------------------------------------------------------------|
| 3444787/37649      | 42.943       | 0.000            | 140.063           | C:\Python311\Lib\site-packages\openpyxl\descriptors\serialisable.py:46(from_tree) |
| 21477798           | 23.222       | 0.000            | 33.982            | C:\Python311\Lib\site-packages\openpyxl\descriptors\base.py:39(__set__)       |
| 5231626            | 11.335       | 0.000            | 21.015            | C:\Python311\Lib\xml\etree\ElementTree.py:1637(_start)                        |
| 9443777/9443749    | 10.466       | 0.000            | 28.238            | C:\Python311\Lib\site-packages\openpyxl\descriptors\base.py:68(__set__)       |
| 57816954/57816952  | 9.624        | 0.000            | 9.624             | {built-in method builtins.isinstance}                                         |
| 8625               | 8.404        | 0.001            | 39.484            | {method 'Parse' of 'pyexpat.xmlparser' objects}                               |
| 4846463            | 8.148        | 0.000            | 14.491            | C:\Python311\Lib\site-packages\openpyxl\xml\functions.py:77(localname)        |
| 23870387           | 7.710        | 0.000            | 7.710             | C:\Python311\Lib\site-packages\openpyxl\descriptors\base.py:24(__set__)       |
| 2528893            | 7.338        | 0.000            | 15.009            | C:\Python311\Lib\site-packages\openpyxl\worksheet\_reader.py:189(parse_cell)  |
| 5231626            | 6.569        | 0.000            | 6.569             | {method 'start' of 'xml.etree.ElementTree.TreeBuilder' objects}               |
| 90                 | 5.383        | 0.060            | 81.916            | C:\Python311\Lib\site-packages\openpyxl\worksheet\_reader.py:367(bind_cells)  |
| 5065495            | 5.338        | 0.000            | 20.358            | C:\Python311\Lib\site-packages\openpyxl\descriptors\base.py:164(__set__)      |
| 2528893            | 4.229        | 0.000            | 4.608             | C:\Python311\Lib\site-packages\openpyxl\utils\cell.py:193(coordinate_to_tuple)|
| 11495787/11463855  | 4.204        | 0.000            | 4.314             | {built-in method builtins.getattr}                                            |
| 4865685            | 4.180        | 0.000            | 4.180             | {method 'match' of 're.Pattern' objects}                                      |
| 16243267           | 3.974        | 0.000            | 3.974             | C:\Python311\Lib\xml\etree\ElementTree.py:1620(_fixname)                      |
| 5240342            | 3.767        | 0.000            | 5.163             | C:\Python311\Lib\xml\etree\ElementTree.py:1310(read_events)                   |
| 5231626            | 3.762        | 0.000            | 5.785             | C:\Python311\Lib\xml\etree\ElementTree.py:1649(_end)                          |
| 5231626            | 3.577        | 0.000            | 10.064            | C:\Python311\Lib\xml\etree\ElementTree.py:1578(handler)                       |
| 5529298/5528595    | 3.324        | 0.000            | 4.194             | C:\Python311\Lib\site-packages\openpyxl\descriptors\base.py:53(_convert)      |
I couldn't find alternatives to openpyxl (the extension xlsm of my Excel and the for free nature narrow down options dramatically).

Maybe to be more specific about what I'm willing to achieve. I have an Excel range, and the cells in that range can have one of 4 possible background colors (as seen by the user on his/her screen). And I'm interested in returning the column/row of cells with lightred background. In more technical terms these lightred cells are distinguishable from the 3 other background colors by either (or both) theme and schema as determined as follows

# Load the workbook and select the sheet
workbook = openpyxl.load_workbook(filename)
sheet = workbook['Sheet1']

# Get the cell
cell = sheet['P95']

# Get the fill pattern of the cell
fill = cell.fill

# Get the cell's theme and schema (if applicable)
theme = cell.fill.start_color.theme if cell.fill.start_color.theme else None # 5
schema = cell.fill.start_color.tint if cell.fill.start_color.tint else None # 0.7999511703848384

No other background color has a theme == 5 and/or a schema == 0.7999511703848384

And I'm looking for the right tools to do that super force for a whole range of Excel cells.

xlwings would be way too slow based on my experience. And i really need execution speed to go down to 2-3s.

(I acknowledge relying on Excel cell background colors can be disastrous! Heopfully what i meant with "background color" is more clear now)
 
User avatar
bearish
Posts: 5906
Joined: February 3rd, 2011, 2:19 pm

Re: Excel tricks

April 16th, 2025, 4:41 pm

This won’t be pretty, but probably not that hard. You can unzip the Excel file, e.g. with something like

tar -xf filename.xlsx

In the file sheet1.xml in the resulting worksheets directory you will find a list of all the cells in the workbook with their respective properties. The background color is indicated with the code “s” immediately following the address, i.e. something like

<c r=“B8” s=“2”><v>0.235</v></c>

Once you have identified the value(s) in this scheme of the color code you care about, it should be pretty straightforward to filter this file and extract the corresponding cell addresses.
 
User avatar
bearish
Posts: 5906
Joined: February 3rd, 2011, 2:19 pm

Re: Excel tricks

April 16th, 2025, 4:44 pm

The task can be done with a one-line VBA function using the x.interior.color property (of cell x), but that doesn’t scale to your problem size at all, even making Python look fast…
 
User avatar
tags
Posts: 3602
Joined: February 21st, 2010, 12:58 pm

Re: Excel tricks

April 27th, 2025, 12:05 am

So I finally went for a C++ solution (OpenXLSX, minizip-ng, tinyxml2).
Execution time goes down to hardly a few secs although Excel worbook parsed is stored on OneDrive seated on a slow server (don't make fun of me!). I'm really impressed I have to say. I mean it.
 
User avatar
tags
Posts: 3602
Joined: February 21st, 2010, 12:58 pm

Re: Excel tricks

July 4th, 2025, 12:41 pm

tags:
So I finally went for a C++ solution (OpenXLSX, minizip-ng, tinyxml2).
Execution time goes down to hardly a few secs although Excel worbook parsed is stored on OneDrive seated on a slow server (don't make fun of me!). I'm really impressed I have to say. I mean it.

well, ...   in my current setup (I'm not the one who make technical choices here, unfortunately,...), using C++ was say cumbersome.
i then went back to exploring a python solution.
if relevant to anyone, I manage to detect the "cells with red light background" very fast (a very large worksheet under 0.5s using a notebook and without any code opimization.
i first investigated those cells using openpyxl:
[*]fill.patternType == "solid"
[*]fill.fgColor.type == "theme"
[*]fgColor.theme == 5 and tint ≈ 0.8
and then I wrote down code that flag cells union of these conditions (it is trivial to code...)
and that's it.
I'm not sure why I previously had rough time to get this done under a couple of/several minutes.