Welcome to today’s daily kōrero!
Anyone can make the thread, first in first served. If you are here on a day and there’s no daily thread, feel free to create it!
Anyway, it’s just a chance to talk about your day, what you have planned, what you have done, etc.
So, how’s it going?
I’m a data engineer working mostly in Python and sql embedded in a data analytics team. Our main use cases are for ingestion pipelines (API sources, glue scripts, batch jobs in airflow and aws), and some work in pandas that doesn’t fit into our dbt sql models. I think it’s also nice for data exploration and sharing via jupyter/colab notebooks.
What are you thinking of using it for?
There’s a few different reasons that I’ve though about for now:
A lot of the data that we are working with is quite large, and it’s sometimes a struggle to work with it in Google Sheets / Excel (Unfortunately our workplace uses both for some reason)
I have some weekly reports that I’ve somehow ended up generating (Getting data via SQL, massaging the data, and presenting via a dashboard or sharing a spreadsheet.
For creating a repeatable set of calculations when someone asks for something (which I’m sort of doing via Powerquery or Google Apps Script)
I’m quite big on visualizations, so I want to give Matplotlib a go.
And I do of coding (Javascript & C++(Arduino)), and have always wanted to add Python to my list of skills, especially in recent times, as I begin to delve more into Data.
Those sound like perfect scenarios! One of the first projects that got me hooked on python was processing large csv files instead of opening them in excel and running visual basic on them.
If you haven’t already, you should check out duck db for working with your larger data sets, too. It’s pretty neat. https://duckdb.org/
I’ve had a brief look into duckdb, and not too sure if I’m interpreting it’s use case correctly, but does it basically allow you to use SQL within your Python to query your large datasets that you have locally?
That’s right. You can read in structured files and query them locally without having to load into a database. It’s nice in the case where you would rather write analytics sql, or want to convert between sql and pandas. It’s very quick to load and run files. It can connect to databases, too.
Oooh that sounds pretty promising - I’ve been struggling with how to handle quite large datasets when they don’t live within a Database.
Thank you for enlightening me! :) - I might have to send you some messages or the like later if I have any questions if that’s okay with you?
Sure thing!
Thank you :)