Ep 1. Data Movement with Kristyna Ferris Artwork

Figuring Out Fabric: Learn Fabric in 30 minutes.

Each week I’ll be interviewing experts and users alike on their experience with Fabric, warts and all. I can guarantee that we’ll have voices you aren’t used to and perspectives you won’t expect.

Each episode will be 30 minutes long with a single topic, so you can listen during your commute or while you exercise. Skip the topics you aren’t interested in. This will be a podcast that respects your time and your intelligence. No 2 hour BS sessions.

All Episodes

Figuring Out Fabric: Learn Fabric in 30 minutes.

Ep 1. Data Movement with Kristyna Ferris

January 20, 2025 • Eugene Meidinger • Season 1 • Episode 1

0:00 | 32:52

In this Episode, I interview Kristyna Ferris about the different types of data Movement in Microsoft Fabric. Specifically, we talk about gen 2 dataflows, data pipelines, and Spark notebooks. We see how you start simple and work your way up. Kristyna shares the "faucets first" approach at P3 adaptive.

Links

Eugene Meidinger (00:00.844)
All right, welcome to the inaugural episode of figuring out fabric where I ask dumb questions and you get smart answers. Hopefully anyway. And my first guest is Christina. So you are a Microsoft MVP, you're a solution architect at P3 adaptive and you're just kind of fun to be around. I'm looking, it's true. I don't know. You're always like just very, very uplifting or warm or I don't know the way to put it. I'm bad with words, but so yes.

Kristyna Ferris (00:05.628)
you

Kristyna Ferris (00:17.404)
Aww.

Kristyna Ferris (00:26.008)
I like hanging out with you too, Eugene. It's a mutually beneficial relationship.

Eugene Meidinger (00:30.296)
Finally if only there was like a word for that, right? Friendship. Yeah jinx I have to ask how is it? How is it working at p3 adaptive because I listen I listen to the podcast occasionally and it it sounds it sounds real like Fancy and prestigious, but like, you know, how is it behind the scenes? No Very nice

Kristyna Ferris (00:33.948)
Friendship?

Kristyna Ferris (00:43.194)
Kristyna Ferris (00:52.168)
it's a bunch of nerds. It's amazing. I love it. Yeah. Yeah. It's a bunch of nerds who started a company with a bunch of other nerds. They call them data geners. but I think I love it. You know, it's, it's a good group of people with a good goals and I love that. They're all about helping other people. Yeah.

Eugene Meidinger (01:10.382)
That's awesome.

very cool. So you're gonna help me understand data movement in Fabric and just to set the stage, up until about two weeks ago I could not tell anyone successfully when to use either Power Query data pipelines or Spark notebooks in Fabric and it very much felt like, I don't know, did you ever see that scene in the Titanic where they have all the cutlery out?

And he's like, what do I do with all this? And the woman next to him is like, well, you just work from the outside in. And so I kind of feel like that's been my thing. He's like, OK, so you start with your salad fork, which is Power Query. And then you work your way up to your entree fork, which is pipelines. And then when you need a real tool, then you desert. I don't know why need a desert fork. But your desert fork is Spark. Right? That's about as, for as long as time, I guess.

Kristyna Ferris (01:45.436)
Yeah.

Kristyna Ferris (02:04.609)
you

Eugene Meidinger (02:10.126)
And then about two weeks ago, I had to give a live training the first day was on fabric So I had at least like do some research and then and then over this past week. I Just enjoy misery. So I did a benchmarking Blog post and so I at least tested some spark I Ran us some issues with data flows in terms of performance and I did some pipeline stuff

Kristyna Ferris (02:17.734)
Nice.

Eugene Meidinger (02:38.584)
So I've gotten my hands a little bit dirtier, but still fairly lost. So hoping you can help out. Okay.

Kristyna Ferris (02:44.72)
I think I can. So I actually like to think of it. I love your fork analogy, but I think of it more as like lobster. Like when I'm eating lobster, canned crab or canned lobster imitation, whatever you want to call it. That to me is data flows. It works. You get it done quick, but the quality is not amazing. Right. It's not very scalable. Yeah.

Eugene Meidinger (02:51.351)
Okay.

Eugene Meidinger (02:57.898)
yeah.

Yeah, yeah, yeah.

Eugene Meidinger (03:05.998)
Quick, quick, quick aside. so I grew up, I grew up poor or at least partly, right? So my, I live, I, my parents were separate. And so when I visited my mom, like our, our meals, cause I would visit her on the weekends would be like mac and cheese, instant mashed potatoes, and then imitation crab. Right. And so it was years. I probably my mid twenties when I first had like real quab, I feel like, I don't know.

Kristyna Ferris (03:25.04)
Yum.

Eugene Meidinger (03:34.925)
And so for me, it's Pollock is the fish, and then they just like put crab juice on it and stuff. In my head is what crab tastes like. so that's probably how, and maybe that's the same with Power Query, like that's my favorite tool, but that makes sense. sorry, go ahead.

Kristyna Ferris (03:43.885)
no.

Kristyna Ferris (03:50.094)
No, you're great. mean, I didn't have real crab until my first past summit. Yeah, doing, I also don't like seafood. So it was kind of a big jump for me to go to a lobster pot kind of dinner anyway. Now I loved it. But yeah, so I think, you know, exactly what I'm talking about where it's like, yeah, it works. It's familiar. It's dependable to the most, for the most part. But if you are trying to get something done at the highest quality, that's not usually what you're going to pick up.

Eugene Meidinger (03:55.027)
wow.

Eugene Meidinger (04:02.786)
Gotcha. Gotcha. Yeah.

Kristyna Ferris (04:19.908)
Right? So when we move into pipelines, you know, if we go into that data movement strategy, that's kind of like you've got real lobster, you know, but it's been cooked and the tail is kind of out, you know, it's like kind of packaged and you can just eat it right away. I think of notebooks as you're cracking the legs. You know what I mean? Like the little bit more work, but you can get really fine tuned with it, right? You might get more.

Eugene Meidinger (04:31.032)
Hehe.

Yeah.

Kristyna Ferris (04:49.156)
out of it that way. You get a lot more out of it typically. So you kind of got these scales of like complexity and quality and knowing where you need to fall in that is really tricky. But

Eugene Meidinger (05:03.48)
Yeah, no, I mean, and that makes a lot of sense. And it's hard. One of the frustrations I have around sometimes some of the discussions is people will maybe say that more options is better. But there's an opportunity cost in figuring out like which one to go with. And there's an opportunity cost in understanding, like it took me a while to understand, was it just a skill and preference thing? Right? And they're interchangeable. Like is it crab versus lobster?

Kristyna Ferris (05:16.572)
All right.

Kristyna Ferris (05:29.702)
Right.

Eugene Meidinger (05:32.754)
Or is it something where, okay, no, there's a meaningful difference. I mean, it sounds like there is. There's a pretty meaningful difference in terms of usability and performance and all that kind of stuff.

Kristyna Ferris (05:37.243)
Yeah.

Kristyna Ferris (05:43.068)
Well, and you've seen that too, even in your benchmarking tests. I mean, it does come down to, you know, what's scalable, what's going to be performant and at the end of the day, what's maintainable. So I think that's something that people don't talk about enough. Sure. If you know that you want to go to notebooks in the future, you might want to start there, but if nobody on your team knows how to write Python.

Eugene Meidinger (06:07.725)
Yes.

Kristyna Ferris (06:07.97)
If nobody on your team is comfortable with a notebook and they see it as this black box, don't let a consultant come in and install that for you because you're never going to be able to maintain it. And data changes constantly. You need maintenance. Yeah.

Eugene Meidinger (06:21.196)
Yeah, is it it scalable with is it scalable with people right? Like, have you met johnny winter from gray skull analytics? So so he was yeah. Yeah. Well, he was he was teasing me on on the slack Because I did some initial benchmarks and my numbers were off But it looks like data flows might be an order of magnitude more expensive for what I was trying to do now I probably wasn't taking advantage of fast copy. So there'll be some follow-up, but let's assume it was

Kristyna Ferris (06:30.0)
Yeah, gosh, so cool.

You

Eugene Meidinger (06:50.574)
He was like, well, how much would it cost an analytics engineer to figure it out and set the thing up? And I'm like, I sincerely I'm like, if it takes them more than half an hour to figure out how to move the data, the company has lost money, right? Because yes, maybe that data flow, like let's assume the worst case scenario, my numbers are right, would have been $15 to move 194 gigabytes of CSVs via data flows, right?

Kristyna Ferris (07:17.284)
Mm-hmm.

Eugene Meidinger (07:17.71)
even in that worst case scenario, how much are you paying for an analytics engineer? Like how much are you paying on someone trying to learn Python from the ground up? And so you're right, like, yes, maybe it's $15 for that initial load, but you can easily waste a hundred or $200 on people spinning their wheels. Or if they pay for one of us, now they're spending $200 for an hour or 45 minutes, right? So I will say like,

Kristyna Ferris (07:21.851)
Right.

Kristyna Ferris (07:37.051)
Yeah.

100%.

Eugene Meidinger (07:47.18)
dog Tyson's barking. Hopefully he's not getting picked up. Good. Awesome. So I will say that believe it or not, saw the light bulb went off and I saw the vision of Fabric when it came to Power Query because I deal with small data. I do not deal with big data. I think the largest data set I've dealt with was like a hundred, maybe three hundred million rows.

Kristyna Ferris (07:49.616)
I can't hear him. Yeah.

Eugene Meidinger (08:17.098)
and we're talking tens of gigabytes, like nothing crazy. And all my customers are like small, medium businesses. And when I realized like, Dataflow's Gen2 is just Power Query, but then when you get to the SQL endpoint, you can do a visual SQL query. It's just Power Query. Because you look at the SQL generates, and first you're like, that's ugly as sin. And then you're like, but I know that ugly. I've seen that ugly before.

Kristyna Ferris (08:17.382)
Nice.

Kristyna Ferris (08:35.386)
Yeah, it's just Power Query. Yeah.

Kristyna Ferris (08:43.802)
Yup.

Eugene Meidinger (08:45.492)
No human puts a dollar sign before a table name. I know exactly where that came from, right? But I got really excited because the idea of asking a business user to work within the equivalent of Databricks is just insane, right? But to be able to, the fact that Microsoft is going, calm down, you're used to the tip of the iceberg with Power BI.

Kristyna Ferris (09:01.294)
It's overwhelming. Yeah.

Eugene Meidinger (09:11.33)
We're just expanding the iceberg a little bit. You can still use the tools you know. Maybe it's a little bit slower, maybe it's a little bit less performant. But you know what's really unperformant? The project failing because it never gets to the finish line. Right? Yeah.

Kristyna Ferris (09:24.718)
Exactly. I think people are underestimating the number of projects that get six months of work and never come to fruition. Right. And that little shameless plug, but that's one of the things that really attracted me to P3 is they have this faucets first mentality. So you start with where the water needs to come out and you just attach a weird hose to it.

Eugene Meidinger (09:42.808)
Okay.

Eugene Meidinger (09:46.848)
Okay. Yeah, yeah, yeah.

Kristyna Ferris (09:49.112)
And then once you figure out that the water is useful and can get there and is clean, then you go back and build all the pipes.

Eugene Meidinger (09:56.408)
Well, and that makes sense to me because you've probably dealt with this, but one of the most challenging things with writing reports is nobody knows what they want. like code, code, can, it'll compile, you can write unit tests. It may make mistakes, but you can, you'll get an error message. Data, you can look at the resulting data. You can see if it looks like what you expect. Reports don't compile, right? Yeah.

Kristyna Ferris (10:07.312)
Yes.

Kristyna Ferris (10:16.067)
Mm-hmm.

Kristyna Ferris (10:23.013)
No!

Eugene Meidinger (10:24.302)
And the like there's an apocryphal quote from Ford of when he invented the like the t the t model the t80 t9 I don't even know the Ford he's like if No, no, I'm not either but the the original car, right? There's a there's a quote that's been attributed. I'm that's apocryphal that probably didn't happen But it's if I asked my customers what they wanted they would have said faster horses And so when you're when you're doing report design people ask you for faster horses. And so bi is

Kristyna Ferris (10:34.618)
I'm not a car person, sorry.

Kristyna Ferris (10:46.625)
Yeah.

Eugene Meidinger (10:53.678)
from my experience working in IT and doing some software development, it's the most prone to scope creep and it's the most iterative out of like all these IT things. Like you just have to assume that that first iteration is wrong. And so you either build a vertical slice or like you kind of described, let's get a minimum viable product because we know we're going to have to iterate. And then like you said, later we can work our way back and push stuff upstream or

Kristyna Ferris (11:02.138)
Yes.

Eugene Meidinger (11:21.378)
do something that's performant instead of like some ugly SQL query or something. Yeah, yeah, yeah.

Kristyna Ferris (11:24.668)
Right, Because you don't want to have to, I mean, worst nightmare is that that scope creep gets back into data engineering. Because I think as report developers, if you haven't figured this out already, you're going to, that everything you do is not quite right. I think that in data engineering, there is less room for that. Right? Similar to when you're laying down a foundation in a house, you need those forms to be kind of ready.

Eugene Meidinger (11:33.357)
Yeah, yeah.

Eugene Meidinger (11:50.574)
Sure.

Kristyna Ferris (11:54.042)
You can't really adjust the concrete later. I can change the paint later, you know.

Eugene Meidinger (11:57.24)
But I was told with data lakes, we just shove it in there and then we scheme on read. Are you telling me I was lied to? Are you telling me that there's engineering and data engineering? that what? No.

Kristyna Ferris (12:05.468)
You

Yeah, it depends.

Eugene Meidinger (12:13.846)
No, but I think you're right. The cost of changing the shape of a chart or the color is far more minimal than changing the data type on a type. And it's so platform dependent too. have you ever, probably not, but have you ever poked at the Contoso generated data set from the Italians?

Kristyna Ferris (12:23.237)
Yeah.

Kristyna Ferris (12:32.027)
Yeah.

Kristyna Ferris (12:40.101)
I've poked at it, yes. Yeah. Yeah, not to the extent you have. I've seen some of the stuff you've done. Yeah.

Eugene Meidinger (12:41.538)
Yeah, yeah, yeah. no, yeah. Well, I'm like, need big data, right? Because the whole, so as an aside, but it'll work its way back. The whole reason I started doing the benchmarking is they announced SQL databases in Fabric. And there was this whole rigmarole on some discussion lists about like, okay, is this better for Power BI import or?

Kristyna Ferris (12:50.051)
Yeah.

Kristyna Ferris (13:00.817)
Yes.

Kristyna Ferris (13:10.054)
Mm.

Eugene Meidinger (13:10.636)
Lake House or whatever and I'm like well maybe I'll do some benchmarking and I was advised like if you really want to be testing that scale you need like a billion rows you need big data otherwise you're just testing the plumbing and not the efficiency right and so I'm like okay yeah

Kristyna Ferris (13:23.876)
Right. Which is crazy, by the way, that you have very small differences until you get to a billion rows. Yeah.

Eugene Meidinger (13:33.036)
Yeah, so I'm like, well, I need big data and I don't know the whole like TPC-H stuff never appealed to me. The Stack Overflow database, like it's hard to relate. Power BI reports are in star schema. I need a billion rows of star schema data. So I wanted to do some testing and then I ended up, if you remember the term yak shaving. No, okay, all right, you can Google it.

Kristyna Ferris (13:42.726)
you

Kristyna Ferris (13:58.267)
No.

Eugene Meidinger (14:00.91)
It's pretty simple. So it's a piece of slang. It doesn't mean anything weird or dirty or anything like that. But it's basically whenever you start doing a task or a chore or whatever, you're like, well, before I can really set up this notebook, I need to install Anaconda. And before I install Anaconda, I need to install this. And so it's like that if you give a mouse a cookie kind of chain of events. Yeah, yeah, yeah. And so it's basically like, you know, your boss comes in.

Kristyna Ferris (14:07.868)
you

Kristyna Ferris (14:19.964)
that's exactly what I call it. Yeah.

Eugene Meidinger (14:28.374)
And it's like, Christina, why isn't this report done? It's like, well, I'm busy shaving a yak right now. And as soon as I finish that, then I can do x and I can do y. So it's the, you give a mouse a cookie kind of situation. And so that's why I'm like, well, okay, if I'm gonna be benchmarking that, I should probably benchmark the load into these sources and all this stuff. But because people kept asking about like compute units, how much does it cost? Not just how fast it is. The reason for that whole story is,

Kristyna Ferris (14:38.14)
you

Kristyna Ferris (14:44.668)
and hence the latest.

Eugene Meidinger (14:56.396)
You're right about the foundation piece because when I wanted to try and put this into SQL or what is even called fabric data warehouse, right? I tried to do a script create as for my local copy of the Contoso database. And it was like, this is incompatible for a couple reasons. So one, because it's all based on Parquet, you don't have indexes, right? So it's like, you don't have file groups, don't have indexes, throw that out, fine.

Kristyna Ferris (15:04.656)
Yeah.

Kristyna Ferris (15:11.974)
Nice.

Kristyna Ferris (15:21.241)
No, you don't.

Eugene Meidinger (15:25.868)
And then it's like, we don't have nchar and nvarchar. Those data types don't exist in Parquet. no.

Kristyna Ferris (15:32.118)
Eugene, I can't tell you how many hours I spent because I had to migrate a process that had a ton of Varchar MAX columns into a warehouse. So I had to go in and change everything by hand to like Varchar 8000.

Eugene Meidinger (15:36.94)
Ha

Eugene Meidinger (15:41.825)
no. Yeah.

Eugene Meidinger (15:48.59)
Yeah, no, I believe it right because if I recall when you go past 8,000 like it moves it was it like the off file store or whatever I think they got Vartar Max in preview for some things now, right? Yeah, I saw some sort of preview answer, but the whole point is like You you think that well pipes are just pipes Right, and then you find out that like if you pick the wrong one, you've got lead pipes

Kristyna Ferris (15:51.334)
Terrible.

Kristyna Ferris (16:01.68)
I would love that.

Kristyna Ferris (16:17.018)
Mm-hmm.

Eugene Meidinger (16:17.184)
and you'll get poisoned, right? And so, yeah, it makes sense. The Fawcett first approach kind of makes sense in that, like, the thing that's most likely to change and the thing that is easiest to change is gonna be the front end. And so, unless you have a very good idea of how you want the data to be shaped, you really don't wanna start from the backend. But also, there's a classic book called Rapid Development. So I recommend trying it out. I guess...

Kristyna Ferris (16:20.122)
Yes.

Eugene Meidinger (16:46.494)
podcast full of show notes so I'll put it in the show notes. But it's about software development but it talks about all the ways that it can fail and the ways that you deal with a project that's become a death march and at the very first chapter it talks about how consistently large projects take 50 % longer than was estimated and the problem is when it's a two million dollar project

Kristyna Ferris (16:48.129)
you

Kristyna Ferris (17:01.916)
Eugene Meidinger (17:13.304)
finding an extra million dollars in budget and time in the couch cushions is not so easy, right? And so there's so much more likely to fail. I mean, that all makes sense. Pivoting a little bit, so what are some other challenges you've run into moving around data and fabric? How's that been?

Kristyna Ferris (17:18.074)
Insane.

Kristyna Ferris (17:32.348)
Well, so I was so excited about Azure SQL Preview. I'm just gonna throw that out there. I am really excited for Azure SQL to be in fabric. Being able to have identity columns alone to me is so worth it. Not having to make my own incremental IDs and trust that nobody's going to manually insert rows themselves. Things like that.

Eugene Meidinger (17:37.304)
Okay.

Eugene Meidinger (17:53.096)
geez. Side note, I don't even understand why SQL Warehouse is a thing. Like, I get some of the needs and stuff, but it's like one of the features is you can insert rows with SQL. If this is supposed to be an append-only analytical data store based off of Parquet, why would I insert rows with SQL? Like, the whole point.

Kristyna Ferris (18:17.382)
Yep.

Eugene Meidinger (18:22.528)
is that you take all, because I understand the VertiPak engine, so I just make assumptions about Parquet. The whole point is you take a million rows and you scrunch them all down into a single file. And so what do you think happens when you insert 10 rows because Frank in accounting wanted to update a table? Well, it makes a little teeny tiny file, or depending on the engine, it'll kind of hold those in a holding area until it has enough to make a bigger file. But in any case, it's, it's,

Kristyna Ferris (18:47.791)
Right.

Eugene Meidinger (18:50.306)
bass ackwards. It's basically, you it just doesn't make sense.

Kristyna Ferris (18:53.498)
It's not made for OLAP, right? Or is it OLTP? That's the one. Thank you. I literally always mix them up. Do I know what they mean when someone says them to me? Yes. Do I say them backwards all the time? And that's why I love Azure SQL. It's made for that. It's made for inserting those rows. You know, it's like, yeah, this is...

Eugene Meidinger (18:55.895)
OLTP. the transaction, yeah, OLAP is cube, OLTP is transaction.

Eugene Meidinger (19:05.23)
Yeah, yeah, yeah.

Eugene Meidinger (19:09.42)
No worries.

Eugene Meidinger (19:16.782)
Turns out a transactional database was designed for transactions. Just wild, yeah. I know.

Kristyna Ferris (19:20.172)
What? Yeah. And I think it works faster, in my opinion, in Power BI than a warehouse does.

Eugene Meidinger (19:27.168)
Yeah, I'm interested. the back to the yak shaving what I'm hoping to do is I want to test both like a full import because I expect it to be way faster for that because My assumption is okay. You have a lake house. So you're converting column store into row store over the wire for TDS and then back in like Whereas why not just store it as row store and send it over and then gonna test like 1 % load on a date filter because that simulates incremental refresh

Kristyna Ferris (19:31.494)
Mm-hmm.

Kristyna Ferris (19:36.028)
Mm-hmm.

Kristyna Ferris (19:45.113)
Yep.

Eugene Meidinger (19:56.93)
You know what mean, right? And again, clustered index or non-clustered secondary index on date filters should get you that super fast. Whereas maybe if you're lucky with the data skipping for parquet, it'll get you there. If it's sorted on that. But I also learned this week, I knew and then I forgot that V-Order is a thing made up by Microsoft. Cause I'm like, well wait, someone was asking about like performance comparison between like Direct Lake.

Kristyna Ferris (19:57.121)
nice. Yeah.

Kristyna Ferris (20:09.455)
Right.

Eugene Meidinger (20:25.752)
and like import. I'm like, well, I think V-Order is just on one column because Z-Order is on multiple columns, but I don't know. And so I go on the Databricks subreddit and I'm asking and I'm like, so can someone, I can't find any docs on this. And someone was like, what the heck's V-Order? And I'm is it a Microsoft thing? And they're like, yeah. And I'm like, great, I look dumb. Admittedly, I was told about this like when Fabric First came out and it just fell out of my head.

Kristyna Ferris (20:36.782)
you

Like whoops.

Kristyna Ferris (20:52.288)
There's so much, there's so much to learn about all this stuff. It just feels like you can, it's a drowning ocean, right? Like the waves come.

Eugene Meidinger (20:53.698)
Yeah. Yeah.

Eugene Meidinger (21:00.206)
And hopefully this podcast will be like a little life preserver. That's the goal.

Kristyna Ferris (21:05.188)
Yeah. Yeah. Or like, Hey, you know, you can duck, you can duck this way. We got you. You know, you don't have to be swept up in it every time.

Eugene Meidinger (21:14.872)
So another question then, let's say we're in, like, have you ever played Animal Crossing? come on. Fine. Well, the most recent one, like, instead of being in a little cozy town, you're on a cozy tropical beach. So let's say I'm on my cozy tropical beach of Power Query, and I know all the people there, and I'm in this ocean of overwhelm, right? So when would I use, say, notebooks

Kristyna Ferris (21:22.04)
No, I haven't, but I know.

Kristyna Ferris (21:34.052)
Uh-huh.

Kristyna Ferris (21:38.139)
Yeah.

Eugene Meidinger (21:44.926)
if I understand Power Query and I can use data flows and all that kind of stuff. When would I leave when I get on my boat and leave my cozy little island village?

Kristyna Ferris (21:54.812)
So I have a hot take for you. When data flows start to break, that is when I would move. So I would start learning it now if you can, if you've got that bandwidth in your day to day, if you see that your data flows are taking really long to refresh, you're seeing like, shoot, this taking a lot of capacity. You're just waiting for that tipping point moment. If you're there, try and start learning it now because the last thing you want to do

Eugene Meidinger (22:01.538)
Makes sense to me.

Kristyna Ferris (22:23.472)
And this, I've been in this situation before where something finally breaks that you knew was gonna break eventually because it was just.

Eugene Meidinger (22:30.99)
Right. The data kept getting bigger or wider or slower. Yeah.

Kristyna Ferris (22:33.87)
Yeah, your refresh was taking an hour 15 and you're like, if it's taking, you're like, this is it. If it's getting to that point, that's when you should try and learn it instead of like I've done this many times in my career. It hit that point. It breaks and it's a business critical report. I'm now working two 24 hour days to get it back online. You don't want to be in that situation.

Eugene Meidinger (22:38.606)
And you hit that two-hour window and you're like, yeah

Eugene Meidinger (23:00.588)
Yeah, yep. It's not a problem until it is. And that's, I've talked about with Power BI in general, with like performance and scaling, it's not a problem until it suddenly is because most of the time Power Query is doing query folding and the VertiPack model is compressing everything really, really well, right? And so it covers a multitude of sins until there's too much of a multitude and like you have to do that. And like,

Kristyna Ferris (23:11.429)
Yeah.

Exactly.

Kristyna Ferris (23:18.299)
Mm-hmm.

Kristyna Ferris (23:25.435)
Yeah.

Eugene Meidinger (23:29.856)
Learning the notebook stuff is hard. Like I was trying to do that. And first, my biggest complaint, if you want to add a new line in DAX, you hit Alt Enter. Because if you hit Enter, heaven forbid, it'll just save it, right? Guess how it works in notebooks. If you need to a new line, you hit Enter. If you do Alt Enter, Control Enter, or Shift Enter, it's going to run the code, which may be expensive code, and then it may make a new cell. there was that.

Kristyna Ferris (23:41.848)
you

Kristyna Ferris (23:54.501)
I know.

Eugene Meidinger (23:59.31)
And then I spent a good amount of time just trying to talk to chat GPT to help me out now I will say Two features that they're doing which are great because they've got data Wrangler. So it's like power query for Python. That's awesome, and they've got snippets But I still have yet to figure out the code I need to convert CSV to Delta and save it so I need to do some research on that and that's like basic stuff

Kristyna Ferris (24:09.307)
Yes.

Kristyna Ferris (24:24.25)
I have it saved some, like I've got a link, like a list of links right now because it's hard. can't, I agree with you. I've tried using chat GPT to do this stuff and it doesn't know fabric command. It doesn't know that I need something in a spark data frame. It doesn't understand the difference.

Eugene Meidinger (24:29.056)
nice.

Eugene Meidinger (24:40.222)
Yes, yes. I ran into that a lot. So I don't know if search is included in the free version. So for people who don't know, one, it's very important to be logged in if you're going to use ChatGPT. I don't know how it is for Copilot, but for ChatGPT, if you're like some rando on the internet and you don't have an account, they'll default you to 4.0 mini for the model, and you'll get categorically worse results. I pay for it.

Kristyna Ferris (25:03.932)
Mm.

Eugene Meidinger (25:09.454)
And so multiple times what I would have to do is I would ask it something. And if it was something older like Azure portal, it was pretty good. But if it was something fabric related, because it's so new, it would be off. what I just do is I just tell it, research this. So it'll give me a bad answer. I'll tell it, research this. And then it'll start doing like the Bing search. And then usually the answers are better. But even then, so like there's a notebook utils library, formerly MSSpark utils.

Kristyna Ferris (25:19.995)
Yeah.

Kristyna Ferris (25:25.404)
Eugene Meidinger (25:39.688)
and you've probably met Sandeep Pawar. He's like, should try fast copy for copying the CC files, so it's in my benchmark. And the docs are very, very sparse. And then he has a blog which is short but helpful. And so it was funny because I asked ChatGPT research this and the only thing it found were those two things, right?

Kristyna Ferris (25:43.535)
It's awesome.

Kristyna Ferris (26:01.424)
Yeah.

Eugene Meidinger (26:02.68)
So even then, sometimes there's just not a ton of stuff out there, right? And I'm sure if someone was using Spark all the time, they could figure it out. But for me, it was challenging.

Kristyna Ferris (26:10.108)
Mm-hmm.

Well, you know, what's helped me the most is actually finding people who are doing things like the Senpai library. And it's not explicitly called out what they're doing, but you can kind of learn from them the steps. Like, this is how they made that kind of data frame. this is how they pulled the data in. So that's been helping me a lot. I also have been learning. I mean, I've been doing this for a few months now, you know, cut my teeth on it a bit, but I think,

Eugene Meidinger (26:19.143)
sure.

Eugene Meidinger (26:39.286)
Yeah,

Kristyna Ferris (26:43.758)
One of the first things that got me off the starting block and might be the same for other Power BI specific users is looking at all the admin stuff you can do. So running things like from the SemPy library about my environment, that was interesting to me and got me learning the Python. So if you're trying to learn something and you're a Power BI admin, why not try that?

Eugene Meidinger (26:57.838)
Kristyna Ferris (27:11.162)
and then at least you can learn it without having to do a bunch of crazy stuff.

Eugene Meidinger (27:15.06)
Yeah, that, I think that makes sense because it's something practical and you're invested in because one of my biggest challenges, and this is like a personal roadblock when it comes to learning PySpark, is I have a Pluralsight subscription because I'm an author and I have a LinkedIn learning subscription and I'll try to like watch a course. And the problem is, is it's like baby's first data frame. And they'll be like, how, here's how you can filter data.

Here's how you can project columns. And I'm like, I know SQL. Why, why? Why would I do this? Even if I'm in Spark, I can do Spark SQL. You know what I mean? And so there's so much stuff that does not properly show off the power and capability and efficiencies of Spark because it's like, they're just doing it a different way. And it's so hard to understand that.

Kristyna Ferris (27:46.812)
Yep.

Kristyna Ferris (28:02.011)
Hmm.

Kristyna Ferris (28:07.248)
right.

Eugene Meidinger (28:11.586)
breaking point that you talk about, right? What is the straw that breaks the camel's back that would cause me to go to PySpark? If I know Power Query and I know SQL, I can get 95, 99 % of what I need done and decently well, especially if I know how to push it back. It's when you start getting into like certain scenarios where it's like, no, you really need to learn some Spark. And identifying those is hard, right?

Kristyna Ferris (28:33.392)
Well, it is. I always think of it this way, know, like path of least resistance, my friend. So I tend to put whatever I can in SQL, especially if I'm working with teams that are SQL oriented, right? SQL is their strong suit.

Eugene Meidinger (28:47.436)
Yeah. It's the universal language of data, other than Excel. And we really should not put it in fabric.

Kristyna Ferris (28:51.525)
Yeah.

No, don't don't do that. Yeah. But can you imagine running DAX queries as your ETL strategy? my gosh. Kill me.

Eugene Meidinger (29:01.358)
I will say I did some earlier benchmarks with local sources just to test it. So the fastest one I tested file formats or like SQL local SQL was parquet no-brainer. Do you want to guess what the second fastest format was? This is a file type.

Kristyna Ferris (29:11.526)
Mm-hmm.

Kristyna Ferris (29:20.87)
JSON.

Eugene Meidinger (29:22.29)
Jason was awful MS access MS access parquet parquet took seven seconds to read a hundred million rows of sales data MS access took nine Everything else was taking like like CSV and stuff was thing like 17 and then Jason was like 50. It was awful It's so bad, it's so bad. Yeah, no MS access

Kristyna Ferris (29:24.028)
MS Access.

Ahhhh!

Kristyna Ferris (29:36.438)
What?

Kristyna Ferris (29:44.092)
I just threw Jason out there. I was like, it's not going to be CSV. So it feels obvious. Yeah.

Eugene Meidinger (29:50.114)
That's the big data technology of the future. It was like 10 times as big. Like it was literally like half a gigabyte for the file and now it goes up to two gigabytes. So it doesn't compress like Parquet, but if you need to read with Power BI, just... Yeah. All right. So I think we're starting to come up on time. Two quick questions. So one, what is your favorite fabric feature? If you can think of something. Or your least favorite, we can bash on it, it's up to you.

Kristyna Ferris (30:00.336)
Ha ha ha!

I guess, yeah.

Kristyna Ferris (30:09.306)
yes.

Kristyna Ferris (30:15.396)
My favorite fabric.

Kristyna Ferris (30:20.188)
I'll go negative then positive because I don't want to be negative because I like using it. Yep. So the negative for me, merge and identity statements in a warehouse. And I can't query anything from the SQL endpoint inside of Azure SQL, which to me is bonkers. I would love to be able to just insert from my lake house directly into an Azure SQL environment. Right now can't do that.

Eugene Meidinger (30:24.034)
Yeah, yeah, yeah, Compliment sandwich. Let's do it.

Eugene Meidinger (30:42.382)
Yeah.

Kristyna Ferris (30:49.796)
things I love and adore, number one has to be no longer having to call Power BI REST APIs and just using the Senpai notebook. I have made my own like tenant activity trackers because I don't love the built-in ones. and that has been amazing. That and best practice analyzer having that just built into my processes now.

Eugene Meidinger (31:07.704)
Yeah, yeah, yeah.

that off.

Eugene Meidinger (31:16.824)
built it works with Sempli? that what that's awesome. I, I, this is, this is moving higher up on my backlog of things to learn. Michael Kowalski is everywhere. It's creepy. Like, cause I was trying to list out for my course, I was trying to list out tools for performance tuning. And it's like, all right. So he wrote the best practice analyzer for the model and for reporting. And he's involved with like the link labs and all this stuff. And it's like, when does this man sleep?

Kristyna Ferris (31:19.708)
yeah. So semantic link labs. Highly recommend.

He is.

Kristyna Ferris (31:41.532)
you

Kristyna Ferris (31:46.671)
I don't

Eugene Meidinger (31:47.586)
Yeah, no, that's awesome. I need to learn more of that. And so then the other question is, you know, where can people find you? There's like 20 social networks now. And then like, is there anything you wanna, I can't think of the word, but like, you know, pitch, push, publish, anything you're doing or just like, I don't know.

Kristyna Ferris (32:07.43)
Yeah, so I co-write a blog with my dad. So it's a technical blog, dataonwheels.com. Definitely love doing that with him. I'm also on LinkedIn and then I'm on X and Blue Sky. So, but I check LinkedIn the most. It's probably a good way to put that.

Eugene Meidinger (32:13.836)
Yep, perfect.

Eugene Meidinger (32:24.782)
Yeah, the pain is real if you want to have any sort of personal brand or marketing presence. My link tree is like a mile long, but very cool. Well, it has been a pleasure talking with you. Thank you so much, and I'm glad you could share some of your expertise.

Kristyna Ferris (32:32.048)
Yeah. Yeah.

Kristyna Ferris (32:41.562)
Thanks. Yeah, this is great.

Eugene Meidinger (32:43.778)
All right, and stop.

Eugene Meidinger

Host

Kristyna Ferris

Guest