wilmott.com

Posted: **April 8th, 2021, 4:48 pm**

Hi everyone,
I have a python script that uses API from one broker platform I use. It connects (in localhost) to the platform software via 'socket' and using a socket.recv() call , I receive all the requested price-feed information (time, symbol, bid, ask, last, last quantity, volume...etc) .
Now, I want to analyze these records, so I append them into one big Pandas Dataframe called 'priceAll'

....
.... 
while True:
        msg = mysock.recv(16384)
        msg_stringa=str(msg,'utf-8')
        
        read_df = pd.read_csv(StringIO(msg_stringa) , sep=";", error_bad_lines=False, 
                        index_col=None, header=None, 
                        engine='c', names=range(33),
                        decimal = '.')
        priceAll = priceAll.append(priceDF, ignore_index=True).copy()

I then read and analyze 'priceAll' , making local copies inside of other functions , eg:
- summing all quantities per symbol and price so I have all the volumes per specific prices
- summing all quantities per second so to have second-based volumes
- etc ...

This whole workflow works quite fine (looping in 100-200ms approx , I almost never encounter missed values or reading problems) and DF 'priceAll' gets as big as approx 4000-5000 lines, per 8 columns. Rows older than 10mins are automatically dropped with this:

priceAll = priceAll[(now - priceAll['time']).astype('timedelta64[s]') < 600].copy()

MY QUESTION: Is there any other workflow more suitable for this purpose using Pandas Dataframe ? I am aware that dataframes are not exactly the best choice for realtime and "append" tasks, but I cannot find another better solution that is as fast and as simple with handling tabular data (making summations, averages , grouby's ,etc..) .

Thanks in advance!

Posted: **April 9th, 2021, 2:43 pm**

Interesting question. I am a python novice, so just have a novice suggestion. It sounds like you have seen this advice: better to use a list because a list append is O(1), while a df append is O(len(df)). I would add that you can do sums and averages by simple updating rules, so you don't need a df for those.

Posted: **April 9th, 2021, 3:50 pm**

Are you using connection-oriented (TCP) or connectionless (UDP) sockets? Feels a bit klingon.
'realtime' is a relative concept.

Guess: what about asynchronous programming (futures)?

https://docs.python.org/3/library/asyncio.html

Posted: **April 9th, 2021, 4:23 pm**

.... and maybe coroutines ..

https://docs.python.org/3/library/async ... #coroutine

That's the kind of approach I would take.

It means that you can be updating a DataFrame while (non-blocked) waiting on the next incoming data packet.

Posted: **April 10th, 2021, 10:54 pm**

Concatenation should be 10x faster: https://pandas.pydata.org/pandas-docs/s ... ppend.html
RTFM ;-D

Posted: **April 11th, 2021, 8:40 am**

Concatenation should be 10x faster: https://pandas.pydata.org/pandas-docs/s ... ppend.html
RTFM ;-D

Does this solve everything? Seems too good to be true.

In short, if append is the answer, what is the question? I missed the latter.

Posted: **April 11th, 2021, 9:31 pm**

I was referring to this part of the Pandas manual:
"Iteratively appending rows to a DataFrame can be more computationally intensive than a single concatenate. A better solution is to append those rows to a list and then concatenate the list with the original DataFrame all at once." - and the example below.

Another thing that comes to mind is that since pandas is written in C, memory allocation may help (assuming you know how much you will write to it) and insert your data:

import pandas as pd
import numpy as np

df = pd.DataFrame(np.zeros((100_000, 100)))
data = np.random.randn(100,100)
df.loc[99_900:, df.columns] = data

Posted: **April 12th, 2021, 3:33 am**

I haven't tried it, but what about some kind of streaming pandas

http://www.xavierdupre.fr/app/pandas_st ... index.html

Posted: **April 17th, 2021, 2:07 pm**

First of all, thank you very much for all your replies and interest in my question, very much appreciated.
I better add some information about overall workflow so to better get to my original question:

1) I 'connect' to a localhost server, so I believe it's not a matter of TCP/UDP , is it ? Then I use socket.send(cmd) to request specific outputs of a list of stock symbols:

mysock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
mysock.connect(('localhost', 12001))
cmd = (cmd_type+" "+symbs_list+"\n").encode() #byte-type
mysock.send(cmd)

2) After this, using a 'while True' loop, I keep on reading incoming data, good thing is that Server send byte-array which terminates with '\n\n' , which makes my client-side socket instruction understand when 'that' message is finished.
I collect 'that' message into a DF converting it to string, and parsing string to columns and rows , columns are coma-separated, while rows are '\n' separated; the StringIO library helps for that.
After that, I append this DF into a global DF :

....
....    
while True:
        if (flag_start == False):
            break
        flag_running = True
        msg = mysock.recv(16384)
        msg_stringa=str(msg,'utf-8')
        
        read_df = pd.read_csv(StringIO(msg_stringa) , sep=";", error_bad_lines=False, 
                        index_col=None, header=None, 
                        engine='c', names=range(33),
                        decimal = '.')
        ....
        ....
        priceAll = priceAll.append(priceDF, ignore_index=True).copy()
        ...
        #function_child1(priceAll)

3) Now the important part: using 'priceAll' DF , my tasks are:
- analyze traded volumes by symbol and timestamp (my timestamp is second-based, not milliseconds)
- find big trades (e.g. >10'000 shares / second)
- find specific traded volume (eg. 199999 shares , or 111111 or 22222 or ...etc ) per symbol, per second

I successfully complete them using '.groupby' , '.isin' methods or simply column value comparisons, eg:

DF1 = priceAll.groupby(['time','symb'], as_index=False).agg({'qty':'sum', 'price':'last'}).copy()
...
DF2 = priceAll.loc[priceAll['qty'].isin(array_of_odd_volume_qtys)].copy()
..

4) Print out these result-DFs into a Excel file 'realtime' (I know it's a relative concept, right point @Cuchulainn!) , meaning that during my trading session I check this file as an ongoing stream of live data, not 'ex-post' file to analyze.
For this purpose I use 'xlwing' python library which is quite good, using a COM channel it can update excel cells while file is open:

....
xw.Book(file_path).sheets['output_sheet'].range('A1').value = output_D1
...

Since DF is sorted by time, last on top, my excel table is updated constantly , showing most recent records on top of the list, scrolling down the previous results, as if it would a sort of 'Time & Sales' scroller.

Getting back to your replies:

@Cuchulainn: I do not really need to have an asynchronous or concurrent function that analyze data (do I ?) , because:
-it gets data of time0
-- it analyzes it
-I get data of time1
-- it analyzes it
... and so on...

I had tried it but it would not really make it batter as, if I do not receive data between Time0 ....and ... Time1 , it does not make sense to analyze data, as there is no really new data to be analyzed. 'Analyze' process takes about 100-200ms (I accept it), and then code gets back to 'socket.recv()' and it receives the new data. I am quite sure I would not miss any data while script is analyzing the previous data - in case the 'analyze' phase would take longer, I will only get a lag for the next-to-be-received data, but not miss any , Am I right ?
The output I get is used ad a screener for manual trading, not algo-trading, so <500ms lag is quite acceptable.

@Alan: thank you for your advice and link, which I had already seen and read. I am not totally sure (tried a few options but maybe not all the possible ones) I can really use that workflow for my case. As per point 2) I wrote above, I directly receive data as a matrix of 33 colums by 20-to-100ish lines per loop , which I put into a DF , would a list be a feasible object for such tabular data ? - sorry for this basic question, I have more experience with DFs than lists or dict's , hence I might be wrong.

@katastrofa: Thank you too for your advice, it's helpful in case I would need to append more than one DF per time into a another big-DF , while, in my case, I append one only DF into a big-DF , for each loop of the 'while True' chuck, where I have the 'socket.recv()' method reading from server. I am not appending one single line per time, but smallDF into bigDF . I haven't tried 'concat' , maybe it's faster than append in my case too. I will have a check.

@Cuchulainn : Thanks also pandas_streaming advice, I haven't tried it yet - I am not sure what it really does. I will have a better look and try it out, I will keep you updated about this too.

Posted: **April 29th, 2021, 1:52 pm**

might be better to put the data in a queue and then read the queue to update the dataframe in a separate worker thread.

Posted: **April 29th, 2021, 4:18 pm**

might be better to put the data in a queue and then read the queue to update the dataframe in a separate worker thread.

Sounds right. aka Producer-Consumer pattern

https://stonesoupprogramming.com/2017/0 ... ith-queue/

Posted: **May 6th, 2021, 12:30 pm**

might be better to put the data in a queue and then read the queue to update the dataframe in a separate worker thread.

Thank you for your suggestion. It is actually a good option to put in place.
Although, It would be really only improving performance IF the elaboration process and analysis timing is long, in my case is about 50-100ms , which is acceptable for my project.

The 'overhead' cost of implementing a Producer-Consumer pattern is probably higher than the real improvement in timing, in my case of course.

----
Another question about OUTPUT and GUI: does anyone use excel together with xlwings , to exploit excel sheet as a tabular output live from python script, via COM ? Any alternative to that ?

: xlw.png (11.35 KiB) Viewed 4887 times

This for example is one of my 'dashboards' I use as screener visualization to see a sort of live ticker of symbols with certain criteria, in this case new-5min high.

Can this be done more elegantly with another solution ?
Thank you for your help and support.

Posted: **May 6th, 2021, 1:13 pm**

Another question about OUTPUT and GUI: does anyone use excel together with xlwings , to exploit excel sheet as a tabular output live from python script, via COM ? Any alternative to that ?

It seems you are working with Excel. A while back I used RTServer in C++ and C# for rt data feeds so maybe(?) Python could be used?

https://github.com/SublimeText/Pywin32/ ... DServer.py

My hunch is it is almost the same calling code as with Python. (pyrtd library??)

Posted: **May 8th, 2021, 7:55 pm**

(thank you Cuch for the producer/consumer pattern reference).

my 2-cent
I use xlwings all the time, but never in the context of live feed. Maybe a drawback is that everything else Excel-related gets slowed down (?).
For a use case close to the one you are describing here (that would be 100% pyhon based), I was advised to go for Dash (Plotly). They added some components specifically for live updating.

Posted: **May 14th, 2021, 9:25 pm**

(thank you Cuch for the producer/consumer pattern reference).

my 2-cent
I use xlwings all the time, but never in the context of live feed. Maybe a drawback is that everything else Excel-related gets slowed down (?).
For a use case close to the one you are describing here (that would be 100% pyhon based), I was advised to go for Dash (Plotly). They added some components specifically for live updating.

Very very interesting Dash / Plotly suggestion. I will have a look during the weekend and try some tests. I will keep you posted.
Thank you, much appreciated.
L

wilmott.com

Python: socket realtime prices and Pandas dataframes

Python: socket realtime prices and Pandas dataframes

Re: Python: socket realtime prices and Pandas dataframes

Re: Python: socket realtime prices and Pandas dataframes

Re: Python: socket realtime prices and Pandas dataframes

Re: Python: socket realtime prices and Pandas dataframes

Re: Python: socket realtime prices and Pandas dataframes

Re: Python: socket realtime prices and Pandas dataframes

Re: Python: socket realtime prices and Pandas dataframes

Re: Python: socket realtime prices and Pandas dataframes

Re: Python: socket realtime prices and Pandas dataframes

Re: Python: socket realtime prices and Pandas dataframes

Re: Python: socket realtime prices and Pandas dataframes

Re: Python: socket realtime prices and Pandas dataframes

Re: Python: socket realtime prices and Pandas dataframes

Re: Python: socket realtime prices and Pandas dataframes